---
title: Collaborative Filtering with Slope One Predictors
author: Chris Hodapp
date: January 30, 2018
tags: technobabble, machine learning
---

# Needs a brief intro

# Needs a summary at the end

Suppose you have a large number of users, and a large number of
movies.  Users have watched movies, and they've provided ratings for
some of them (perhaps just simple numerical ratings, 1 to 10 stars).
However, they've all watched different movies, and for any given user,
it's only a tiny fraction of the total movies.

Now, you want to predict how some user will rate some movie they
haven't rated, based on what they (and other users) have rated.

That's a common problem, especially when generalized from 'movies' to
anything else, and one with many approaches.  (To put some technical
terms to it, this is the [[https://en.wikipedia.org/wiki/Collaborative_filtering][collaborative filtering]] approach to
[[https://en.wikipedia.org/wiki/Recommender_system][recommender systems]].  [[http://www.mmds.org/][Mining of Massive Datasets]] is an excellent free
text in which to read more in depth on this, particularly chapter 9.)

Slope One Predictors are one such approach to collaborative filtering,
described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope One Predictors for Online Rating-Based
Collaborative Filtering]].  Despite the complex-sounding name, they are
wonderfully simple to understand and implement, and very fast.

I'll give a contrived example below to explain them.

Consider a user Bob.  Bob is enthusiastic, but has rather simple
tastes: he mostly just watches Clint Eastwood movies.  In fact, he's
watched and rated nearly all of them, and basically nothing else.

Now, suppose we want to predict how much Bob will like something
completely different and unheard of (to him at least), like... I don't
know... /Citizen Kane/.

Here's Slope One in a nutshell:

1. First, find the users who rated both /Citizen Kane/ *and* any of
   the Clint Eastwood movies that Bob rated.
2. Now, for each movie that comes up above, compute a *deviation*
   which tells us: On average, how differently (i.e. how much higher
   or lower) did users rate Citizen Kane compared to this movie?  (For
   instance, we'll have a number for how /Citizen Kane/ was rated
   compared to /Dirty Harry/, and perhaps it's +0.6 - meaning that on
   average, users who rated both movies rated /Citizen Kane/ about 0.6
   stars above /Dirty Harry/.  We'd have another deviation for
   /Citizen Kane/ compared to /Gran Torino/, another for /Citizen
   Kane/ compared to /The Good, the Bad and the Ugly/, and so on - for
   every movie that Bob rated, provided that other users who rated
   /Citizen Kane/ also rated the movie.)
3. If that deviation between /Citizen Kane/ and /Dirty Harry/ was
   +0.6, it's reasonable that adding 0.6 from Bob's rating on /Dirty
   Harry/ would give one prediction of how Bob might rate /Citizen
   Kane/.  We can then generate more predictions based on the ratings
   he gave the other movies - anything for which we could compute a
   deviation.
4. To turn this to a single prediction, we could just average all
   those predictions together.

One variant, Weighted Slope One, is nearly identical.  The only
difference is in how we average those predictions in step #4.  In
Slope One, every deviation counts equally, no matter how many users
had differences in ratings averaged together to produce it.  In
Weighted Slope One, deviations that came from larger numbers of users
count for more (because, presumably, they are better estimates).

Or, in other words: If only one person rated both /Citizen Kane/ and
the lesser-known Eastwood classic /Revenge of the Creature/, and they
happened to think that /Revenge of the Creature/ deserved at least 3
more stars, then with Slope One, this deviation of -3 would carry
exactly as much weight as thousands of people rating /Citizen Kane/ as
about 0.5 stars below /The Good, the Bad and the Ugly/.  In Weighted
Slope One, that latter deviation would count for thousands of times as
much.  The example makes it sound a bit more drastic than it is.

The Python library [[http://surpriselib.com/][Surprise]] (a [[https://www.scipy.org/scikits.html][scikit]]) has an implementation of this
algorithm, and the Benchmarks section of that page shows its
performance compared to some other methods.

/TODO/: Show a simple Python implementation of this (Jupyter
notebook?)

* Linear Algebra Tricks

Those who aren't familiar with matrix methods or algebra can probably
skip this section. Everything I've described above, you can compute
given just some data to work with ([[https://grouplens.org/datasets/movielens/100k/][movielens 100k]], perhaps?) and some
basic arithmetic.  You don't need any complicated numerical methods.

However, the entire Slope One method can be implemented in a very fast
and simple way with a couple matrix operations.

First, we need to have our data encoded as a *utility matrix*.  In a
utility matrix, each row represents one user, each column represents
one item (a movie, in our case), and each element represents a user's
rating of an item.  If we have $n$ users and $m$ movies, then this a
$n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for
movie $i$ - assuming we've numbered our users and our movies.

Users have typically rated only a fraction of movies, and so most of
the elements of this matrix are unknown.  We can represent this with
another $n \times m$ matrix (specifically a binary matrix), a 'mask'
$M$ in which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie
$i$, and otherwise 0.

I mentioned *deviation* above and gave an informal definition of it.
The paper gaves a formal but rather terse definition below of the
average deviation of item $i$ with respect to item $j$:

$$\textrm{dev}_{j,i} = \sum_{u \in S_{j,i}(\chi)} \frac{u_j - u_i}{card(S_{j,i}(\chi))}$$

where:
- $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively
- $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're
  training on, provided a rating for both movie $i$ and movie $j$
- $card$ is the cardinality of that set, i.e. for
  ${card(S_{j,i}(\chi))}$ it is just how many users rated both $i$ and
  $j$.

That denominator does depend on $i$ and $j$, but doesn't depend on the
summation term, so it can be pulled out, and also, we can split up the
summation as long as it is kept over the same terms:

$$\textrm{dev}_{j,i} = \frac{1}{card(S_{j,i}(\chi))} \sum_{u \in
S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u
\in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i\right)$$

# TODO: These need some actual matrices to illustrate

Let's start with computing ${card(S_{j,i}(\chi))}$, the number of
users who rated both movie $i$ and movie $j$.  Consider column $i$ of
the mask $M$.  For each value in this column, it equals 1 if the
respective user rated movie $i$, or 0 if they did not.  Clearly,
simply summing up column $i$ would tell us how many users rated movie
$i$, and the same applies to column $j$ for movie $j$.

Now, suppose we take element-wise logical AND of columns $i$ and $j$.
The resultant column has a 1 only where both corresponding elements
were 1 - where a user rated both $i$ and $j$.  If we sum up this
column, we have exactly the number we need: the number of users who
rated both $i$ and $j$.

Some might notice that "elementwise logical AND" is just "elementwise
multiplication", thus "sum of elementwise logical AND" is just "sum of
elementwise multiplication", which is: dot product.  That is,
${card(S_{j,i}(\chi))}=M_j \bullet M_i$ if we use $M_i$ and $M_j$ for
columns $i$ and $j$ of $M$.

However, we'd like to compute deviation as a matrix for all $i$ and
$j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single
combination of $i$ and $j$ - that is, we need a dot product between
every single pair of columns from $M$.  Incidentally, "dot product of
every pair of columns" happens to be almost exactly matrix
multiplication; note that for matrices $A$ and $B$, element $(x,y)$ of
the matrix product $AB$ is just the dot product of /row/ $x$ of $A$
and /column/ $y$ of $B$ - and that matrix product as a whole has this
dot product between every row of $A$ and every column of $B$.

We wanted the dot product of every column of $M$ with every column of
$M$, which is easy: just transpose $M$ for one operand.  Then, we can
compute our count matrix like this:

$$C=M^\top M$$

Thus $C_{i,j}$ is the dot product of column $i$ of $M$ and column $j$
of $M$ - or, the number of users who rated both movies $i$ and $j$.

That was the first half of what we needed for $\textrm{dev}_{j,i}$.
We still need the other half:

$$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$

We can apply a similar trick here.  Consider first what $\sum_{u \in
S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of
movie $j$ that were done by a user who also rated movie $i$.
Likewise, $\sum_{u \in S_{j,i}(\chi)} u_j$ is the sum of only those
ratings of movie $i$ that were done by a user who also rated movie
$j$.  (Note the symmetry: it's over the same set of users, because
it's always the users who rated both $i$ and $j$.)

# TODO: Finish that section (mostly translate from code notes)

* Implementation

#+BEGIN_SRC python
print("foo")
#+END_SRC