Recommendation Systems: Co-occurrence Calculations

In this first post in a series on recommendation systems, we’re going to develop a powerful but highly intuitive representation for user behavior that will allow us to easily make recommendations. Since we’re going to be making heavy use of the Goodreads data set in the series, we’ll formulate our basic recommendation system problem as recommending what book a user should read next.

Suppose we have $N$ users and $M$ books. Each of our users may have read any number of the $M$ books. Let’s represent the user $i$ with the vector $u_i$ which has $1$ in component $j$ if user $i$ has read book $j$ and has a $0$ otherwise.

We can stack our $N$ user vectors into an $N \times M$ matrix $U$. Now let’s do something with this user matrix that may be a bit non-intuitive at first. Set $$R = U^T U.$$ The matrix $R$ is an $M \times M$ matrix and we’re going to demonstrate that the entries of $R$ have a very practical meaning. For this, we’ll work out the details for a small example.

Suppose we have $8$ users named $A,B,\ldots,H$ and $5$ books titled $0,1,\dots,4$. Let the $U$ matrix be:

import scipy as sp
U = sp.matrix(
    [
    #    books    
    #    0 1 2 3 4   
        [0,1,0,1,0], # User A
        [0,0,1,0,1], # User B
        [1,0,1,0,1], # User C
        [0,1,0,0,1], # User D
        [1,1,1,1,1], # User E
        [0,1,1,0,1], # User F
        [0,0,1,1,0], # User G
        [1,1,0,1,0], # User H
    ] 
)

This matrix tells us, for example, user $D$ has read books $1$ and $4$ but no others.

Now we compute $R$:

R = U.transpose().dot(U)
#    0 1 2 3 4
# 0 [3 2 2 2 2]
# 1 [2 5 2 3 3]
# 2 [2 2 5 2 4]
# 3 [2 3 2 4 1]
# 4 [2 3 4 1 5]

The matrix $R$ is called a co-occurence matrix. The entry $R_{i,j}$ for $i \neq j$ counts the number of times that users have read both the books $i$ and $j$. The diagonal entries count the total number of times that a book has been read. So, for this example, $R$ tells us that 3 users have read both books 2 and 4, and 3 total users have read book 0.

That’s pretty cool! Here’s why this calculation works. For a diagonal element: $$R_{ii} = \sum_{k} U^T_{ik} U_{ki}=\sum_{k} U_{ki}^2.$$ This expression is just the row sum of $U$ which is the total number of times book $i$ has been read. For an off-diagonal element $$R_{ij} = \sum_{k} U^T_{ik}U_{ki}=\sum_{k}U_{ki}U_{kj}.$$ When $U_{ki}=U_{kj}=1$ the user $k$ has read both $i$ and $j$ and in all other cases $U_{ki}U_{kj}=0$. So the sum in this expression simply counts the number of users who have read both $i$ and $j$.

From the matrix $R$, we can pretty easily come up with our first recommendation algorithm. To recommend the next book to a user who just finished $i$, we can simply look at the $i$-th row (or column) of $R$ and find the entries with the largest elements.

It turns out we don’t even need to compute the full matrix $R$ to do this kind of recommendation. Start by letting $\vec{q}$ be the $n\times 1$ vector with $0$ in every component except the $i$-th component where it has a $1$. Note that $R \vec{q}$ is simply the $i$-th column of $R$ which is exactly what we need to search to find the books to recommend. We can break the computation of this quantity in two steps since $$R \vec{q} = (U^T U) \vec{q} = U^T(U\vec{q}).$$ So we first multiply $U\vec{q}$ and then multiply it’s result by $U^T$.

Here’s how we implement the two stage calculation of co-occurences for our example problem:

# build the query vector -- user has read book 1
q = sp.matrix([0,1,0,0,0]).transpose()

# Compute Rq = U^T U q in two steps
y = U.dot(q)
r = U.transpose().dot(y)

# r:
# 0 [2]
# 1 [5]
# 2 [2]
# 3 [3]
# 4 [3]

Because books 1 and 3 and 1 and 4 were both read by 3 different users, the model recommends the user read books 3 and 4 next.

In the next post, we’ll show how to scale this calculation so that we can apply it to the actual Goodreads data set and generate real book recommendations. While there are certainly some nice things about this simple co-occurence model, we’ll also see that a major drawback is that it tends to recommend very popular books (e.g. things everyone was forced to read in high school). I call this the Gatsby Problem and we’ll explore possible solutions to it throughout the series.