In the previous post, we demonstrated how to efficiently compute co-occurrences with matrix algebra and use those calculations to recommend books to users. Though we saw some sensible recommendations come out of this approach, it also suffers from a number of issues, including:

- The
*Gatsby Problem*: popular books tend to be over represented in the recommendations. - The scale of recommendations is difficult to interpret and compare. Is there a “good” score for a recommendation?

In this post, we’re going to recast our model in terms of probabilities as a first attempt to address these issues.

## The Probability Model

In this post, we’ll develop a model to answer the question:

If someone has read book A, what is the probability that they have also read book B?

$P(B | A)$

This is a *conditional probability* model. Notice how different this model is from the co-occurrence approach we’ve taken so far. The co-occurrence model can only answer a question about *joint probability*:

What is the probability that $A$ and $B$ are read together?

$P(A \cap B)$

An important feature of this new model is that $P(A|B) \neq P(B|A)$ where of course $P(A \cap B) = P(B \cap A)$. This should give us some hope that we may be able to take the popularity bias of a book into account.

## The matrix model reloaded

In previous posts, we worked with the $n$ users by $m$ books matrix $U$. In this matrix, the entry $u_{ij}=1$ when user $i$ has read books $j$ and is zero otherwise. With $U$ defined this way, the matrix product $R=U^TU$ turns out to give co-occurrence counts, i.e. $r_{ij}$ is the number of users who have read both books $i$ and $j$.

Now we’ll build the new matrix $\tilde{R}$ where $\tilde{r}_{ij}$ models the probability that a user has read book $i$ given they have read book $j$. For this, we first recall the definition of conditional probability: $$P(i|j) = \frac{P(i,j)}{P(j)}.$$

We model the joint probability $P(i,j)$ as: $$P(i,j) = \frac{[R – D]_{ij}}{\sigma}$$ where:

- $D$ is the $m \times m$ diagonal matrix with $$d_{ii} = r_{ii} = \sum_{k} u_{ki}^2 =\sum_{k} u_{ki}$$ for $1 \leq i \leq m$
- The scalar constant $\sigma$ is given by: $$\sigma = \sum_{i=1}^{m} \sum_{j=1}^m r_{ij} – d_{ij}.$$

The only slightly mysterious term in this expression is the subtraction of the matrix $D$ from $R$. The reason for this is that the diagonal of $U^TU$ needs to be “zeroed” out since it’s not a valid co-occurrence. The unconditional probability $P(j)$ is then computed simply as a row sum of the joint distribution, i.e.: $$P(j) = \sum_{i} P(i,j).$$ So we have all the ingredients to compute $P(i|j)$ directly.

Let the $m$-dimensional query vector $\vec{q}$ be the discrete probability distribution that characterizes what a user has read so far. Since $\vec{q}$ is a discrete probability distribution, $0 \leq q_i \leq 1$ for all $i$ and $$\sum_{i} q_i =1.$$ For now, we can think of $\vec{q}$ as giving “weights” to all the books a user has read so far. For example, $\vec{q}=\vec{e}_i$ if the user has read just book $i$. It’s somewhat less clear what $\vec{q}$ means when several components are non-zero, but we’ll make this explicit when we take up Markov-chain models in later posts.

Now let $P$ be the matrix with $p_{ij} = P(i|j)$

This all works assuming we can compute $R=U^TU$. However, even when $U$ is sparse and of modest dimension, $U^TU$ can still be far too big to compute directly or hold in memory because in general there is no guarantee that the product of two sparse matrices will be sparse. So can we compute recommendations incrementally as we did with the co-occurrence recommender?

The answer is yes. To do this:

- Compute once and store the row sums of $R – D$. This can be done efficiently by computing as $$\vec{s} = U^T (U\mathbf{1}_m) – D\mathbf{1}_m$$ where $\mathbf{1}_m$ is an $m$-vector of 1s.
- Let $s_i$ be $i$-th component of $\vec{s}$. Set $S$ equal to the diagonal matrix with i-th entry equal to $1/s_i$ when $s_i>0$ and $1$ otherwise.

Let the $m$-dimensional query vector $\vec{q}$ be the discrete probability distribution that characterizes what a user has read so far. Since $\vec{q}$ is a discrete probability distribution, $0 \leq q_i \leq 1$ for all $i$ and $$\sum_{i} q_i =1.$$ For now, we can think of $\vec{q}$ as giving “weights” to all the books a user has read so far. For example, $\vec{q}=\vec{e}_i$ if the user has read just book $i$ and $P\vec{q}$ is the probability distribution $P( \cdot | i)$. It’s somewhat less clear what $\vec{q}$ means when several components are non-zero, but we’ll make this explicit when we take up Markov-chain models in later posts.

We can then compute $P\vec{q}$ as: $$\vec{y} = S \vec{q}$$ $$P\vec{q} = U^T(U\vec{y}) – D\vec{y}.$$ This computation is highly efficient provided we perform the sparse matrix multiplications in the order dictated by the parentheses in this expression.

Here’s a simple numerical example comparing the full computation with the iterative version:

```
import scipy as sp
import numpy as np
U = sp.matrix(
[
# books
# 0 1 2 3 4
[0,1,0,1,0], # User A
[0,0,1,0,1], # User B
[1,0,1,0,1], # User C
[0,1,0,0,1], # User D
[1,1,1,1,1], # User E
[0,1,1,0,1], # User F
[0,0,1,1,0], # User G
[1,1,0,1,0], # User H
]
)
#
# The non-iterative computation
#
R = U.transpose().dot(U)
d = np.diag(R)
D = np.diag(d)
sigma = (R-D).sum()
R_tilde = (R-D)/sigma
P = R_tilde / R_tilde.sum(axis=0)
assert R_tilde.sum() == 1.0, "The joint distribution must sum to 1"
assert np.isclose(P.sum(axis=0), 1.0).all(), "Each column of the conditional distribution must sum to 1"
#
# The iterative computation
#
n,m = U.shape
s = U.transpose().dot(U.dot(np.ones((m, 1)))) - D.dot(np.ones((m,1)))
s[np.isclose(s,0)] = 1
S = np.diagflat(1/s)
#
# Example 1: Check each e_i.
#
Q = np.eye(5)
for i in range(Q.shape[-1]):
q = Q[:,i].reshape(-1,1)
y = S.dot(q)
r = U.transpose().dot(U.dot(y)) - D.dot(y)
assert np.isclose(r.sum(), 1.0).all(), "The conditional distribution must sum to 1"
assert np.all(np.isclose(P[:,i], r)), "The non-iterative and iterative approaches must agree"
#
# Example 2: Check some random distributions
#
for _ in range(30):
q = np.random.rand(5,1)
q = q / q.sum()
y = S.dot(q)
r = U.transpose().dot(U.dot(y)) - D.dot(y)
assert np.isclose(r.sum(), 1.0).all(), "The conditional distribution must sum to 1"
assert np.all(np.isclose(P.dot(q), r)), "The non-iterative and iterative approaches must agree"
```

The complete implementation of this probability-based model using sparse matrices is available here.

## Recommendations with Goodreads Data

We can now compare the results of the co-occurrence recommendation approach to the probability model for our running set of evaluation queries.

description | query | rank | cooccurrence result | cooccurrence score | probability result | probability score |
---|---|---|---|---|---|---|

classic post-modern literature | The Sheltering Sky by Bowles | 1 | The Sheltering Sky by Paul Bowles | 1369 | The Great Gatsby by F. Scott Fitzgerald | 0.0012 |

classic post-modern literature | The Sheltering Sky by Bowles | 2 | The Great Gatsby by F. Scott Fitzgerald | 1058 | The Catcher in the Rye by J.D. Salinger | 0.001 |

classic post-modern literature | The Sheltering Sky by Bowles | 3 | The Catcher in the Rye by J.D. Salinger | 999 | Animal Farm by George Orwell | 0.001 |

classic post-modern literature | The Sheltering Sky by Bowles | 4 | To Kill a Mockingbird by Harper Lee | 925 | Of Mice and Men by John Steinbeck | 0.001 |

classic post-modern literature | The Sheltering Sky by Bowles | 5 | 1984 by George Orwell | 876 | 1984 by George Orwell | 0.0009 |

classic literature | The Brothers Karamazov | 1 | 1984 by George Orwell | 316 | 1984 by George Orwell | 0.0021 |

classic literature | The Brothers Karamazov | 2 | The Great Gatsby by F. Scott Fitzgerald | 312 | To Kill a Mockingbird by Harper Lee | 0.0019 |

classic literature | The Brothers Karamazov | 3 | The Catcher in the Rye by J.D. Salinger | 276 | The Great Gatsby by F. Scott Fitzgerald | 0.0019 |

classic literature | The Brothers Karamazov | 4 | To Kill a Mockingbird by Harper Lee | 275 | The Catcher in the Rye by J.D. Salinger | 0.0017 |

classic literature | The Brothers Karamazov | 5 | Animal Farm by George Orwell | 257 | Animal Farm by George Orwell | 0.0016 |

a classic but technical math book | counterexamples in analysis | 1 | Counterexamples in Analysis by Bernard R. Gelbaum | 4 | Meta Math!: The Quest for Omega (Peter N. Nevraumont Books) by Gregory Chaitin | 0.0016 |

a classic but technical math book | counterexamples in analysis | 2 | The Fault in Our Stars by John Green | 2 | Orange Is the New Black by Piper Kerman | 0.0016 |

a classic but technical math book | counterexamples in analysis | 3 | Calculus by Michael Spivak | 2 | What Is Mathematics?: An Elementary Approach to Ideas and Methods by Richard Courant | 0.0016 |

a classic but technical math book | counterexamples in analysis | 4 | Harry Potter and the Methods of Rationality by Eliezer Yudkowsky | 2 | Men of Mathematics by Eric Temple Bell | 0.0016 |

a classic but technical math book | counterexamples in analysis | 5 | Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character by Richard Feynman | 2 | A Brief History of Time: From the Big Bang to Black Holes by Stephen Hawking | 0.0016 |

children's book | goodnight moon | 1 | Goodnight Moon by Margaret Wise Brown | 16117 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling | 0.0012 |

children's book | goodnight moon | 2 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling | 11706 | The Great Gatsby by F. Scott Fitzgerald | 0.0011 |

children's book | goodnight moon | 3 | To Kill a Mockingbird by Harper Lee | 10875 | 1984 by George Orwell | 0.0011 |

children's book | goodnight moon | 4 | Where the Wild Things Are by Maurice Sendak | 10485 | Harry Potter and the Chamber of Secrets (Harry Potter, #2) by J.K. Rowling | 0.0011 |

children's book | goodnight moon | 5 | The Hunger Games (The Hunger Games, #1) by Suzanne Collins | 10481 | Green Eggs and Ham by Dr. Seuss | 0.0011 |

a 2015 best-seller | girl on the train | 1 | The Girl on the Train by Paula Hawkins | 5478 | The Hunger Games (The Hunger Games, #1) by Suzanne Collins | 0.002 |

a 2015 best-seller | girl on the train | 2 | The Hunger Games (The Hunger Games, #1) by Suzanne Collins | 4374 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling | 0.0018 |

a 2015 best-seller | girl on the train | 3 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling | 3702 | Twilight (Twilight, #1) by Stephenie Meyer | 0.0017 |

a 2015 best-seller | girl on the train | 4 | Twilight (Twilight, #1) by Stephenie Meyer | 3487 | Catching Fire (The Hunger Games, #2) by Suzanne Collins | 0.0015 |

a 2015 best-seller | girl on the train | 5 | To Kill a Mockingbird by Harper Lee | 3283 | The Fault in Our Stars by John Green | 0.0015 |

a mystery/thriller series | a letter of mary | 1 | A Letter of Mary (Mary Russell, #3) by Laurie R. King | 1403 | Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling | 0.0011 |

a mystery/thriller series | a letter of mary | 2 | The Beekeeper's Apprentice (Mary Russell, #1) by Laurie R. King | 1307 | The Da Vinci Code (Robert Langdon, #2) by Dan Brown | 0.0011 |

a mystery/thriller series | a letter of mary | 3 | A Monstrous Regiment of Women (Mary Russell, #2) by Laurie R. King | 1302 | To Kill a Mockingbird by Harper Lee | 0.001 |

a mystery/thriller series | a letter of mary | 4 | The Moor (Mary Russell, #4) by Laurie R. King | 1066 | Harry Potter and the Deathly Hallows (Harry Potter, #7) by J.K. Rowling | 0.001 |

a mystery/thriller series | a letter of mary | 5 | O Jerusalem (Mary Russell, #5) by Laurie R. King | 996 | Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) by J.K. Rowling | 0.001 |

obscure biography of the semi-obscure novelist J.L. Carr | the last englishman by byron rogers | 1 | The Last Englishman: The Life of J. L. Carr by Byron Rogers | 6 | Slaughterhouse-Five by Kurt Vonnegut Jr. | 0.0009 |

obscure biography of the semi-obscure novelist J.L. Carr | the last englishman by byron rogers | 2 | We Have Always Lived in the Castle by Shirley Jackson | 3 | Breakfast of Champions by Kurt Vonnegut Jr. | 0.0009 |

obscure biography of the semi-obscure novelist J.L. Carr | the last englishman by byron rogers | 3 | A Month in the Country by J.L. Carr | 3 | Lord of the Flies by William Golding | 0.0009 |

obscure biography of the semi-obscure novelist J.L. Carr | the last englishman by byron rogers | 4 | The Harpole Report by J.L. Carr | 3 | The Catcher in the Rye by J.D. Salinger | 0.0009 |

obscure biography of the semi-obscure novelist J.L. Carr | the last englishman by byron rogers | 5 | From the Mouth of the Whale by Sjon | 3 | 1984 by George Orwell | 0.0009 |

classic philosophy text | the world as will and representation | 1 | The World as Will and Representation, Vol 1 by Arthur Schopenhauer | 687 | 1984 by George Orwell | 0.0022 |

classic philosophy text | the world as will and representation | 2 | 1984 by George Orwell | 393 | Thus Spoke Zarathustra by Friedrich Nietzsche | 0.002 |

classic philosophy text | the world as will and representation | 3 | The World as Will and Representation, Vol 2 by Arthur Schopenhauer | 387 | Animal Farm by George Orwell | 0.0019 |

classic philosophy text | the world as will and representation | 4 | Thus Spoke Zarathustra by Friedrich Nietzsche | 357 | The Stranger by Albert Camus | 0.0019 |

classic philosophy text | the world as will and representation | 5 | Animal Farm by George Orwell | 338 | The Republic by Plato | 0.0016 |

Though we see some changes in results (a mix of improvements and non-improvements), we continue to see popular books dominating the list of recommendations. We’ll continue to build on the probability approach developed in this post throughout the series to further mitigate the “Gatsby Problem”. But up next, we’ll take a brief detour to review Markov chains which will be an important tool for analyzing the behavior of our probability based models.