Recommendation Systems: A Co-occurrence Recommender

In the previous post of the series, we developed a co-occurence model for book recommendations. The model is similar to Amazon’s highly successful “Customers who bought…” feature. Now it’s time to apply this simple model to some real data to make recommendations.

Preprocessing

As usual, we’ll work with the Goodreads dataset. I described the structure of this dataset here. Our goal in this post is to build out the matrix structure for supporting co-occurence calculations. There are a few steps to preprocessing this data:

  • The interactions.csv dataset uses an internal, sequential ID for books and users. The book_id_map.csv and user_id_map.csv provide a mapping from the interal id to the global book and user ids used by good reads. We need to translate all the interactions to the global IDs so we can associate books and users correctly.
  • Some users have read only 1 book and some books have been read only once. In the former case, we can’t make co-occurence recommendations and in the later we should be skeptical given the lack of data. We’ll filter these low frequency, “noisy” records out. This will make our recommendations stronger and also significantly reduce the size of the problem.
  • Book and author info is held in separate files (as though they were normalized for storage in a database). We need to denormalize this data so that we have book and author metadata together in a convenient representation.
  • Finally, because we are going to build a user-book matrix structure from this data, we will also need to have a sequential index to track the positions of users and books in the rows and columns of the matrix.

The result of all this preprocessing is that we will create a set of two lists books and interactions. Interactions is a list of lists where the list at index i is the list of indexes of the books read by user i. Books is a list of dicts where the dict at index j is the metadata for book j. The data in JSON representation looks something like this:

{
  "books": [
    {
      "id": 7327624,
      "index": 0,
      "title": "The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",
      "authors": [
        "Barbara Hambly"
      ]
    },
    {
      "id": 6066819,
      "index": 1,
      "title": "Best Friends Forever",
      "authors": [
        "Jennifer Weiner"
      ]
    },
    {
      "id": 287141,
      "index": 2,
      "title": "The Aeneid for Boys and Girls",
      "authors": [
        "Alfred J. Church"
      ]
    },
    {
      "id": 6066812,
      "index": 3,
      "title": "All's Fairy in Love and War (Avalon: Web of Magic, #8)",
      "authors": [
        "Rachel Roberts"
      ]
    }
  ],
  "interactions": [
    [
      1446751,
      708450,
      1436475,
      584799
    ],
    [
      1082384,
      211143,
      1352904,
      625894,
      822344
    ],
    [
      65446,
      371683,
      1062261,
      787840
    ],
    [
      1388662,
      1345813,
      1525439
    ]
  ]
}

Here is the complete preprocessing script to generate the goodreads data in this format:

Matrix Representation

Now we develop the matrix representation we described in the last section. The matrix $U$ tracks users along its rows and books across its columns. If user $i$ has read book $j$ then $u_{ij}=1$. Otherwise, $u_{ij}=0$.

There are about 1.5M books tracked in our data after preprocessing. Any user has read only a tiny portion of the total number of books. So nearly all entries of $U$ will be $0$. So to save space and reduce computation time, we should use a sparse matrix representation.

In scipy, it easiest to construct a matrix in COO format. For calculations, we’ll want CSR format so we cast the result to CSR which can be done very efficiently.

Making Recommendations

Now that we have the matrix $U$, we can begin to make our first recommendations! The remaining pieces we need are:

  • A way to map a book into its matrix index
  • A way to create a query vector $q$ from the index(es) of a book(s)
  • A scoring function that computes $U^TUq$ and sorts the results

First let’s load the data from disk:

import json
from scipy.sparse import load_npz

U = load_npz("goodreads/U.npz")

with open("goodreads/interactions.json") as f:
    books = json.load(f)["books"]

Now let’s create a very crude book to index lookup function:

def book_index_lookup(title, author=None):
    '''
    Lookup a book index
    title: str : title of the book must contain this string
    author: str: optional require first author of the book to contain this string
    '''
    indexes = []
    for book in books:
        if title.lower() in book["title"].lower():
            if author:
                if author.lower() in book["authors"][0].lower(): # only check first author
                    indexes.append(book["index"])
                    print(book)
            else:
                indexes.append(book["index"])
                print(book)
    return indexes

We can use this function to find the indexes of our favorite books, for example: book_index_lookup("sheltering sky", "bowles"). Notice that many books (especially very popular ones) have a number of editions! E.g.

{'id': 1104411, 'index': 74658, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 10433296, 'index': 123405, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 1104365, 'index': 158924, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 13621757, 'index': 393809, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 1104412, 'index': 447614, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 11963271, 'index': 500628, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 1104363, 'index': 541122, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles', 'Michael Hofmann']} {'id': 16077498, 'index': 630073, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles', 'Jennifer Connelly']} {'id': 1334454, 'index': 715822, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 12048, 'index': 829340, 'title': "The Sheltering Sky, Let it Come Down, The Spider's House", 'authors': ['Paul Bowles', 'Daniel Halpern']} {'id': 760472, 'index': 981161, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 243598, 'index': 1072439, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 12053, 'index': 1214210, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 904647, 'index': 1509922, 'title': 'The Sheltering Sky', 'authors': ['Paul Bowles']} {'id': 20707302, 'index': 1519437, 'title': 'Sheltering Sky', 'authors': ['Paul Bowles']}

Collapsing these editions into a canonical representation is a problem we’ll need to address later.

Once we have the indexes of our favorite books, we need to create our query vector. This vector representation should be sparse so that it is compatible with multiplication by $U$.

import numpy as np
from scipy.sparse import coo_matrix
    
def build_query_vector(indexes):
    # build the query vector
    data, I, J = [], [], []
    for idx in indexes:
        if idx:
            I.append(idx)
            J.append(0)
            data.append(1)
    q = coo_matrix((data, (I, J)), shape=(U.shape[-1], 1), dtype=np.float64).tocsr()
    return q

Finally, we will score our recommendations for the query vector $q$. The score for book $i$ is the $i$-th entry of the product $U^TUq$. So we simply need to sort this sparse product, keeping track of the indices.

def score(q, number=10):
    y = U.transpose().dot(U.dot(q))
    recs = [{
            "book": books[i],
            "score": float(score),
         } for i, score in zip(y.indices, y.data)
    ]
    recs.sort(key=lambda x: x["score"], reverse=True)
    return recs[0:number]

Now we’re ready to look at some recommendations!

import json

def recommend(title, author=None):
    indexes = book_index_lookup(title, author)
    q = build_query_vector(indexes)
    for hit in score(q,5):
        print(json.dumps(hit, indent=2))

# classic post-modern literature
recommend("sheltering sky", "bowles")
# {
#   "author": "Paul Bowles",
#   "title": "The Sheltering Sky",
#   "score": 1369.0
# }
# {
#   "author": "F. Scott Fitzgerald",
#   "title": "The Great Gatsby",
#   "score": 1064.0
# }
# {
#   "author": "J.D. Salinger",
#   "title": "The Catcher in the Rye",
#   "score": 1003.0
# }
# {
#   "author": "Harper Lee",
#   "title": "To Kill a Mockingbird",
#   "score": 929.0
# }
# {
#   "author": "George Orwell, George Orwell, Erich Fromm",
#   "title": "1984",
#   "score": 881.0
# }

# classic literature
recommend("the brothers karamazov")
# {
#   "author": "Fyodor Dostoyevsky, Richard Pevear, Larissa Volokhonsky",
#   "title": "The Brothers Karamazov",
#   "score": 14149.0
# }
# {
#   "author": "George Orwell, George Orwell, Erich Fromm",
#   "title": "1984",
#   "score": 8822.0
# }
# {
#   "author": "F. Scott Fitzgerald",
#   "title": "The Great Gatsby",
#   "score": 8498.0
# }
# {
#   "author": "J.D. Salinger",
#   "title": "The Catcher in the Rye",
#   "score": 7921.0
# }
# {
#   "author": "Harper Lee",
#   "title": "To Kill a Mockingbird",
#   "score": 7811.0
# }


# a more technical math book
recommend("counterexamples in analysis")
# {
#   "author": "Bernard R. Gelbaum, John M.H. Olmsted",
#   "title": "Counterexamples in Analysis",
#   "score": 4.0
# }
# {
#   "author": "John Green",
#   "title": "The Fault in Our Stars",
#   "score": 2.0
# }
# {
#   "author": "Michael Spivak",
#   "title": "Calculus",
#   "score": 2.0
# }
# {
#   "author": "Eliezer Yudkowsky",
#   "title": "Harry Potter and the Methods of Rationality",
#   "score": 2.0
# }
# {
#   "author": "Richard Feynman",
#   "title": "Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character",
#   "score": 2.0
# }

# children's book
recommend("goodnight moon")
# {
#   "author": "Margaret Wise Brown, Clement Hurd",
#   "title": "Goodnight Moon",
#   "score": 16117.0
# }
# {
#   "author": "J.K. Rowling, Mary GrandPre",
#   "title": "Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",
#   "score": 11706.0
# }
# {
#   "author": "Harper Lee",
#   "title": "To Kill a Mockingbird",
#   "score": 10875.0
# }
# {
#   "author": "Maurice Sendak",
#   "title": "Where the Wild Things Are",
#   "score": 10485.0
# }
# {
#   "author": "Suzanne Collins",
#   "title": "The Hunger Games (The Hunger Games, #1)",
#   "score": 10481.0
# }

# 2015 best seller
recommend("girl on the train")
# {
#   "author": "Paula Hawkins",
#   "title": "The Girl on the Train",
#   "score": 85683.0
# }
# {
#   "author": "Suzanne Collins",
#   "title": "The Hunger Games (The Hunger Games, #1)",
#   "score": 52960.0
# }
# {
#   "author": "J.K. Rowling, Mary GrandPre",
#   "title": "Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",
#   "score": 46939.0
# }
# {
#   "author": "John Green",
#   "title": "The Fault in Our Stars",
#   "score": 41521.0
# }
# {
#   "author": "Harper Lee",
#   "title": "To Kill a Mockingbird",
#   "score": 40854.0
# }

# a series
recommend("a letter of mary")
# {
#   "author": "Laurie R. King",
#   "title": "A Letter of Mary (Mary Russell, #3)",
#   "score": 1403.0
# }
# {
#   "author": "Laurie R. King",
#   "title": "The Beekeeper's Apprentice (Mary Russell, #1)",
#   "score": 1307.0
# }
# {
#   "author": "Laurie R. King",
#   "title": "A Monstrous Regiment of Women (Mary Russell, #2)",
#   "score": 1302.0
# }
# {
#   "author": "Laurie R. King",
#   "title": "The Moor (Mary Russell, #4)",
#   "score": 1066.0
# }
# {
#   "author": "Laurie R. King",
#   "title": "O Jerusalem (Mary Russell, #5)",
#   "score": 996.0
# }

# obscure biography of the semi-obscure novelist J.L. Carr
recommend("the last englishman", "byron rogers")
# {
#   "author": "Byron Rogers",
#   "title": "The Last Englishman: The Life of J. L. Carr",
#   "score": 6.0
# }
# {
#   "author": "Shirley Jackson, Jonathan Lethem",
#   "title": "We Have Always Lived in the Castle",
#   "score": 3.0
# }
# {
#   "author": "J.L. Carr, Michael Holroyd",
#   "title": "A Month in the Country",
#   "score": 3.0
# }
# {
#   "author": "J.L. Carr",
#   "title": "The Harpole Report",
#   "score": 3.0
# }
# {
#   "author": "Sjon, Victoria Cribb",
#   "title": "From the Mouth of the Whale",
#   "score": 3.0
# }

# classic philosophy text
recommend("the world as will and representation")
# {
#   "author": "Arthur Schopenhauer, E.F.J. Payne",
#   "title": "The World as Will and Representation, Vol 1",
#   "score": 687.0
# }
# {
#   "author": "George Orwell, George Orwell, Erich Fromm",
#   "title": "1984",
#   "score": 393.0
# }
# {
#   "author": "Arthur Schopenhauer, E.F.J. Payne",
#   "title": "The World as Will and Representation, Vol 2",
#   "score": 387.0
# }
# {
#   "author": "Friedrich Nietzsche, Walter Kaufmann",
#   "title": "Thus Spoke Zarathustra",
#   "score": 357.0
# }
# {
#   "author": "George Orwell",
#   "title": "Animal Farm",
#   "score": 338.0
# }

What can we say about these recommendations? At least a few things should stand out:

  • The input book itself appears at the top of the list. This is expected based on the model. The score is the total number of times the book has been read by users in the data.
  • The recommendations presented seem to draw us into the right genre; if the input is philosophy we get some philosophy, if the input is mathematics we get mathematics, etc.
  • The recommendations skew strongly towards popular books. This is not surprising. If you’ve read The Sheltering Sky, you’ve probably also read the Great Gatsby. But that doesn’t mean The Great Gatsby has all that much in common with The Sheltering Sky.

In short, these recommendations appear to be on the right track but we need to deal with The Gatsby Problem where very popular books tend to dominate results. We’ll look at this issue in the next post in the series. Of course evaluating results by inspecting a few hand-picked examples can only get us so far which is a problem we’ll pick up much later on in the series. There are also of course many optimizations we might like to make to these algorithms to use less memory, handle more results, or deliver results faster. We’ll look at some of these issues later as well.

All that said, we have our first recommendation system. Here’s a simplified script that ties the co-occurence based recommendation engine together: