Recommendation Systems: From Co-occurrence Counts to Probabilities

In the previous post, we demonstrated how to efficiently compute co-occurrences with matrix algebra and use those calculations to recommend books to users. Though we saw some sensible recommendations come out of this approach, it also suffers from a number of issues, including:

  • The Gatsby Problem: popular books tend to be over represented in the recommendations.
  • The scale of recommendations is difficult to interpret and compare. Is there a “good” score for a recommendation?

In this post, we’re going to recast our model in terms of probabilities as a first attempt to address these issues.

The Probability Model

In this post, we’ll develop a model to answer the question:

If someone has read book A, what is the probability that they have also read book B?

$P(B | A)$

This is a conditional probability model. Notice how different this model is from the co-occurrence approach we’ve taken so far. The co-occurrence model can only answer a question about joint probability:

What is the probability that $A$ and $B$ are read together?

$P(A \cap B)$

An important feature of this new model is that $P(A|B) \neq P(B|A)$ where of course $P(A \cap B) = P(B \cap A)$. This should give us some hope that we may be able to take the popularity bias of a book into account.

The matrix model reloaded

In previous posts, we worked with the $n$ users by $m$ books matrix $U$. In this matrix, the entry $u_{ij}=1$ when user $i$ has read books $j$ and is zero otherwise. With $U$ defined this way, the matrix product $R=U^TU$ turns out to give co-occurrence counts, i.e. $r_{ij}$ is the number of users who have read both books $i$ and $j$.

Now we’ll build the new matrix $\tilde{R}$ where $\tilde{r}_{ij}$ models the probability that a user has read book $i$ given they have read book $j$. For this, we first recall the definition of conditional probability: $$P(i|j) = \frac{P(i,j)}{P(j)}.$$

We model the joint probability $P(i,j)$ as: $$P(i,j) = \frac{[R – D]_{ij}}{\sigma}$$ where:

  • $D$ is the $m \times m$ diagonal matrix with $$d_{ii} = r_{ii} = \sum_{k} u_{ki}^2 =\sum_{k} u_{ki}$$ for $1 \leq i \leq m$
  • The scalar constant $\sigma$ is given by: $$\sigma = \sum_{i=1}^{m} \sum_{j=1}^m r_{ij} – d_{ij}.$$

The only slightly mysterious term in this expression is the subtraction of the matrix $D$ from $R$. The reason for this is that the diagonal of $U^TU$ needs to be “zeroed” out since it’s not a valid co-occurrence. The unconditional probability $P(j)$ is then computed simply as a row sum of the joint distribution, i.e.: $$P(j) = \sum_{i} P(i,j).$$ So we have all the ingredients to compute $P(i|j)$ directly.

Let the $m$-dimensional query vector $\vec{q}$ be the discrete probability distribution that characterizes what a user has read so far. Since $\vec{q}$ is a discrete probability distribution, $0 \leq q_i \leq 1$ for all $i$ and $$\sum_{i} q_i =1.$$ For now, we can think of $\vec{q}$ as giving “weights” to all the books a user has read so far. For example, $\vec{q}=\vec{e}_i$ if the user has read just book $i$. It’s somewhat less clear what $\vec{q}$ means when several components are non-zero, but we’ll make this explicit when we take up Markov-chain models in later posts.

Now let $P$ be the matrix with $p_{ij} = P(i|j)$

This all works assuming we can compute $R=U^TU$. However, even when $U$ is sparse and of modest dimension, $U^TU$ can still be far too big to compute directly or hold in memory because in general there is no guarantee that the product of two sparse matrices will be sparse. So can we compute recommendations incrementally as we did with the co-occurrence recommender?

The answer is yes. To do this:

  • Compute once and store the row sums of $R – D$. This can be done efficiently by computing as $$\vec{s} = U^T (U\mathbf{1}_m) – D\mathbf{1}_m$$ where $\mathbf{1}_m$ is an $m$-vector of 1s.
  • Let $s_i$ be $i$-th component of $\vec{s}$. Set $S$ equal to the diagonal matrix with i-th entry equal to $1/s_i$ when $s_i>0$ and $1$ otherwise.

Let the $m$-dimensional query vector $\vec{q}$ be the discrete probability distribution that characterizes what a user has read so far. Since $\vec{q}$ is a discrete probability distribution, $0 \leq q_i \leq 1$ for all $i$ and $$\sum_{i} q_i =1.$$ For now, we can think of $\vec{q}$ as giving “weights” to all the books a user has read so far. For example, $\vec{q}=\vec{e}_i$ if the user has read just book $i$ and $P\vec{q}$ is the probability distribution $P( \cdot | i)$. It’s somewhat less clear what $\vec{q}$ means when several components are non-zero, but we’ll make this explicit when we take up Markov-chain models in later posts.

We can then compute $P\vec{q}$ as: $$\vec{y} = S \vec{q}$$ $$P\vec{q} = U^T(U\vec{y}) – D\vec{y}.$$ This computation is highly efficient provided we perform the sparse matrix multiplications in the order dictated by the parentheses in this expression.

Here’s a simple numerical example comparing the full computation with the iterative version:

import scipy as sp
import numpy as np

U = sp.matrix(
    [
    #    books    
    #    0 1 2 3 4   
        [0,1,0,1,0], # User A
        [0,0,1,0,1], # User B
        [1,0,1,0,1], # User C
        [0,1,0,0,1], # User D
        [1,1,1,1,1], # User E
        [0,1,1,0,1], # User F
        [0,0,1,1,0], # User G
        [1,1,0,1,0], # User H
    ] 
)

#
# The non-iterative computation
#
R = U.transpose().dot(U)
d = np.diag(R)
D = np.diag(d)
sigma = (R-D).sum()
R_tilde = (R-D)/sigma

P = R_tilde / R_tilde.sum(axis=0)

assert R_tilde.sum() == 1.0, "The joint distribution must sum to 1"
assert np.isclose(P.sum(axis=0), 1.0).all(), "Each column of the conditional distribution must sum to 1"

#
# The iterative computation
#
n,m = U.shape
s = U.transpose().dot(U.dot(np.ones((m, 1)))) - D.dot(np.ones((m,1)))
s[np.isclose(s,0)] = 1
S = np.diagflat(1/s)

#
# Example 1: Check each e_i.
#
Q = np.eye(5)
for i in range(Q.shape[-1]):
    q = Q[:,i].reshape(-1,1)
    y = S.dot(q)
    r = U.transpose().dot(U.dot(y)) - D.dot(y)
    assert np.isclose(r.sum(), 1.0).all(), "The conditional distribution must sum to 1"
    assert np.all(np.isclose(P[:,i], r)), "The non-iterative and iterative approaches must agree"
    
#
# Example 2: Check some random distributions
#
for _ in range(30):
    q = np.random.rand(5,1)
    q = q / q.sum()
    y = S.dot(q)
    r = U.transpose().dot(U.dot(y)) - D.dot(y)
    assert np.isclose(r.sum(), 1.0).all(), "The conditional distribution must sum to 1"
    assert np.all(np.isclose(P.dot(q), r)), "The non-iterative and iterative approaches must agree"

The complete implementation of this probability-based model using sparse matrices is available here.

Recommendations with Goodreads Data

We can now compare the results of the co-occurrence recommendation approach to the probability model for our running set of evaluation queries.

descriptionqueryrankcooccurrence resultcooccurrence scoreprobability resultprobability score
classic post-modern literatureThe Sheltering Sky by Bowles1The Sheltering Sky by Paul Bowles1369The Great Gatsby by F. Scott Fitzgerald0.0012
classic post-modern literatureThe Sheltering Sky by Bowles2The Great Gatsby by F. Scott Fitzgerald1058The Catcher in the Rye by J.D. Salinger0.001
classic post-modern literatureThe Sheltering Sky by Bowles3The Catcher in the Rye by J.D. Salinger999Animal Farm by George Orwell0.001
classic post-modern literatureThe Sheltering Sky by Bowles4To Kill a Mockingbird by Harper Lee925Of Mice and Men by John Steinbeck0.001
classic post-modern literatureThe Sheltering Sky by Bowles51984 by George Orwell8761984 by George Orwell0.0009
classic literatureThe Brothers Karamazov11984 by George Orwell3161984 by George Orwell0.0021
classic literatureThe Brothers Karamazov2The Great Gatsby by F. Scott Fitzgerald312To Kill a Mockingbird by Harper Lee0.0019
classic literatureThe Brothers Karamazov3The Catcher in the Rye by J.D. Salinger276The Great Gatsby by F. Scott Fitzgerald0.0019
classic literatureThe Brothers Karamazov4To Kill a Mockingbird by Harper Lee275The Catcher in the Rye by J.D. Salinger0.0017
classic literatureThe Brothers Karamazov5Animal Farm by George Orwell257Animal Farm by George Orwell0.0016
a classic but technical math bookcounterexamples in analysis1Counterexamples in Analysis by Bernard R. Gelbaum4Meta Math!: The Quest for Omega (Peter N. Nevraumont Books) by Gregory Chaitin0.0016
a classic but technical math bookcounterexamples in analysis2The Fault in Our Stars by John Green2Orange Is the New Black by Piper Kerman0.0016
a classic but technical math bookcounterexamples in analysis3Calculus by Michael Spivak2What Is Mathematics?: An Elementary Approach to Ideas and Methods by Richard Courant0.0016
a classic but technical math bookcounterexamples in analysis4Harry Potter and the Methods of Rationality by Eliezer Yudkowsky2Men of Mathematics by Eric Temple Bell0.0016
a classic but technical math bookcounterexamples in analysis5Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character by Richard Feynman2A Brief History of Time: From the Big Bang to Black Holes by Stephen Hawking0.0016
children's bookgoodnight moon1Goodnight Moon by Margaret Wise Brown16117Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling0.0012
children's bookgoodnight moon2Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling11706The Great Gatsby by F. Scott Fitzgerald0.0011
children's bookgoodnight moon3To Kill a Mockingbird by Harper Lee108751984 by George Orwell0.0011
children's bookgoodnight moon4Where the Wild Things Are by Maurice Sendak10485Harry Potter and the Chamber of Secrets (Harry Potter, #2) by J.K. Rowling0.0011
children's bookgoodnight moon5The Hunger Games (The Hunger Games, #1) by Suzanne Collins10481Green Eggs and Ham by Dr. Seuss0.0011
a 2015 best-sellergirl on the train1The Girl on the Train by Paula Hawkins5478The Hunger Games (The Hunger Games, #1) by Suzanne Collins0.002
a 2015 best-sellergirl on the train2The Hunger Games (The Hunger Games, #1) by Suzanne Collins4374Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling0.0018
a 2015 best-sellergirl on the train3Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling3702Twilight (Twilight, #1) by Stephenie Meyer0.0017
a 2015 best-sellergirl on the train4Twilight (Twilight, #1) by Stephenie Meyer3487Catching Fire (The Hunger Games, #2) by Suzanne Collins0.0015
a 2015 best-sellergirl on the train5To Kill a Mockingbird by Harper Lee3283The Fault in Our Stars by John Green0.0015
a mystery/thriller seriesa letter of mary1A Letter of Mary (Mary Russell, #3) by Laurie R. King1403Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling0.0011
a mystery/thriller seriesa letter of mary2The Beekeeper's Apprentice (Mary Russell, #1) by Laurie R. King1307The Da Vinci Code (Robert Langdon, #2) by Dan Brown0.0011
a mystery/thriller seriesa letter of mary3A Monstrous Regiment of Women (Mary Russell, #2) by Laurie R. King1302To Kill a Mockingbird by Harper Lee0.001
a mystery/thriller seriesa letter of mary4The Moor (Mary Russell, #4) by Laurie R. King1066Harry Potter and the Deathly Hallows (Harry Potter, #7) by J.K. Rowling0.001
a mystery/thriller seriesa letter of mary5O Jerusalem (Mary Russell, #5) by Laurie R. King996Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) by J.K. Rowling0.001
obscure biography of the semi-obscure novelist J.L. Carrthe last englishman by byron rogers1The Last Englishman: The Life of J. L. Carr by Byron Rogers6Slaughterhouse-Five by Kurt Vonnegut Jr.0.0009
obscure biography of the semi-obscure novelist J.L. Carrthe last englishman by byron rogers2We Have Always Lived in the Castle by Shirley Jackson3Breakfast of Champions by Kurt Vonnegut Jr.0.0009
obscure biography of the semi-obscure novelist J.L. Carrthe last englishman by byron rogers3A Month in the Country by J.L. Carr3Lord of the Flies by William Golding0.0009
obscure biography of the semi-obscure novelist J.L. Carrthe last englishman by byron rogers4The Harpole Report by J.L. Carr3The Catcher in the Rye by J.D. Salinger0.0009
obscure biography of the semi-obscure novelist J.L. Carrthe last englishman by byron rogers5From the Mouth of the Whale by Sjon31984 by George Orwell0.0009
classic philosophy textthe world as will and representation1The World as Will and Representation, Vol 1 by Arthur Schopenhauer6871984 by George Orwell0.0022
classic philosophy textthe world as will and representation21984 by George Orwell393Thus Spoke Zarathustra by Friedrich Nietzsche0.002
classic philosophy textthe world as will and representation3The World as Will and Representation, Vol 2 by Arthur Schopenhauer387Animal Farm by George Orwell0.0019
classic philosophy textthe world as will and representation4Thus Spoke Zarathustra by Friedrich Nietzsche357The Stranger by Albert Camus0.0019
classic philosophy textthe world as will and representation5Animal Farm by George Orwell338The Republic by Plato0.0016

Though we see some changes in results (a mix of improvements and non-improvements), we continue to see popular books dominating the list of recommendations. We’ll continue to build on the probability approach developed in this post throughout the series to further mitigate the “Gatsby Problem”. But up next, we’ll take a brief detour to review Markov chains which will be an important tool for analyzing the behavior of our probability based models.