--- title: Collaborative Filtering with Slope One Predictors author: Chris Hodapp date: January 30, 2018 tags: technobabble, machine learning --- # Needs a brief intro # Needs a summary at the end Suppose you have a large number of users, and a large number of movies. Users have watched movies, and they've provided ratings for some of them (perhaps just simple numerical ratings, 1 to 10 stars). However, they've all watched different movies, and for any given user, it's only a tiny fraction of the total movies. Now, you want to predict how some user will rate some movie they haven't rated, based on what they (and other users) have rated. That's a common problem, especially when generalized from 'movies' to anything else, and one with many approaches. (To put some technical terms to it, this is the [[https://en.wikipedia.org/wiki/Collaborative_filtering][collaborative filtering]] approach to [[https://en.wikipedia.org/wiki/Recommender_system][recommender systems]]. [[http://www.mmds.org/][Mining of Massive Datasets]] is an excellent free text in which to read more in depth on this, particularly chapter 9.) Slope One Predictors are one such approach to collaborative filtering, described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope One Predictors for Online Rating-Based Collaborative Filtering]]. Despite the complex-sounding name, they are wonderfully simple to understand and implement, and very fast. I'll give a contrived example below to explain them. Consider a user Bob. Bob is enthusiastic, but has rather simple tastes: he mostly just watches Clint Eastwood movies. In fact, he's watched and rated nearly all of them, and basically nothing else. Now, suppose we want to predict how much Bob will like something completely different and unheard of (to him at least), like... I don't know... /Citizen Kane/. Here's Slope One in a nutshell: 1. First, find the users who rated both /Citizen Kane/ *and* any of the Clint Eastwood movies that Bob rated. 2. Now, for each movie that comes up above, compute a *deviation* which tells us: On average, how differently (i.e. how much higher or lower) did users rate Citizen Kane compared to this movie? (For instance, we'll have a number for how /Citizen Kane/ was rated compared to /Dirty Harry/, and perhaps it's +0.6 - meaning that on average, users who rated both movies rated /Citizen Kane/ about 0.6 stars above /Dirty Harry/. We'd have another deviation for /Citizen Kane/ compared to /Gran Torino/, another for /Citizen Kane/ compared to /The Good, the Bad and the Ugly/, and so on - for every movie that Bob rated, provided that other users who rated /Citizen Kane/ also rated the movie.) 3. If that deviation between /Citizen Kane/ and /Dirty Harry/ was +0.6, it's reasonable that adding 0.6 from Bob's rating on /Dirty Harry/ would give one prediction of how Bob might rate /Citizen Kane/. We can then generate more predictions based on the ratings he gave the other movies - anything for which we could compute a deviation. 4. To turn this to a single prediction, we could just average all those predictions together. One variant, Weighted Slope One, is nearly identical. The only difference is in how we average those predictions in step #4. In Slope One, every deviation counts equally, no matter how many users had differences in ratings averaged together to produce it. In Weighted Slope One, deviations that came from larger numbers of users count for more (because, presumably, they are better estimates). Or, in other words: If only one person rated both /Citizen Kane/ and the lesser-known Eastwood classic /Revenge of the Creature/, and they happened to think that /Revenge of the Creature/ deserved at least 3 more stars, then with Slope One, this deviation of -3 would carry exactly as much weight as thousands of people rating /Citizen Kane/ as about 0.5 stars below /The Good, the Bad and the Ugly/. In Weighted Slope One, that latter deviation would count for thousands of times as much. The example makes it sound a bit more drastic than it is. The Python library [[http://surpriselib.com/][Surprise]] (a [[https://www.scipy.org/scikits.html][scikit]]) has an implementation of this algorithm, and the Benchmarks section of that page shows its performance compared to some other methods. /TODO/: Show a simple Python implementation of this (Jupyter notebook?) * Linear Algebra Tricks Those who aren't familiar with matrix methods or algebra can probably skip this section. Everything I've described above, you can compute given just some data to work with ([[https://grouplens.org/datasets/movielens/100k/][movielens 100k]], perhaps?) and some basic arithmetic. You don't need any complicated numerical methods. However, the entire Slope One method can be implemented in a very fast and simple way with a couple matrix operations. First, we need to have our data encoded as a *utility matrix*. In a utility matrix, each row represents one user, each column represents one item (a movie, in our case), and each element represents a user's rating of an item. If we have $n$ users and $m$ movies, then this a $n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for movie $i$ - assuming we've numbered our users and our movies. Users have typically rated only a fraction of movies, and so most of the elements of this matrix are unknown. We can represent this with another $n \times m$ matrix (specifically a binary matrix), a 'mask' $M$ in which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie $i$, and otherwise 0. I mentioned *deviation* above and gave an informal definition of it. The paper gaves a formal but rather terse definition below of the average deviation of item $i$ with respect to item $j$: $$\textrm{dev}_{j,i} = \sum_{u \in S_{j,i}(\chi)} \frac{u_j - u_i}{card(S_{j,i}(\chi))}$$ where: - $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively - $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're training on, provided a rating for both movie $i$ and movie $j$ - $card$ is the cardinality of that set, i.e. for ${card(S_{j,i}(\chi))}$ it is just how many users rated both $i$ and $j$. That denominator does depend on $i$ and $j$, but doesn't depend on the summation term, so it can be pulled out, and also, we can split up the summation as long as it is kept over the same terms: $$\textrm{dev}_{j,i} = \frac{1}{card(S_{j,i}(\chi))} \sum_{u \in S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i\right)$$ # TODO: These need some actual matrices to illustrate Let's start with computing ${card(S_{j,i}(\chi))}$, the number of users who rated both movie $i$ and movie $j$. Consider column $i$ of the mask $M$. For each value in this column, it equals 1 if the respective user rated movie $i$, or 0 if they did not. Clearly, simply summing up column $i$ would tell us how many users rated movie $i$, and the same applies to column $j$ for movie $j$. Now, suppose we take element-wise logical AND of columns $i$ and $j$. The resultant column has a 1 only where both corresponding elements were 1 - where a user rated both $i$ and $j$. If we sum up this column, we have exactly the number we need: the number of users who rated both $i$ and $j$. Some might notice that "elementwise logical AND" is just "elementwise multiplication", thus "sum of elementwise logical AND" is just "sum of elementwise multiplication", which is: dot product. That is, ${card(S_{j,i}(\chi))}=M_j \bullet M_i$ if we use $M_i$ and $M_j$ for columns $i$ and $j$ of $M$. However, we'd like to compute deviation as a matrix for all $i$ and $j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single combination of $i$ and $j$ - that is, we need a dot product between every single pair of columns from $M$. Incidentally, "dot product of every pair of columns" happens to be almost exactly matrix multiplication; note that for matrices $A$ and $B$, element $(x,y)$ of the matrix product $AB$ is just the dot product of /row/ $x$ of $A$ and /column/ $y$ of $B$ - and that matrix product as a whole has this dot product between every row of $A$ and every column of $B$. We wanted the dot product of every column of $M$ with every column of $M$, which is easy: just transpose $M$ for one operand. Then, we can compute our count matrix like this: $$C=M^\top M$$ Thus $C_{i,j}$ is the dot product of column $i$ of $M$ and column $j$ of $M$ - or, the number of users who rated both movies $i$ and $j$. That was the first half of what we needed for $\textrm{dev}_{j,i}$. We still need the other half: $$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$ We can apply a similar trick here. Consider first what $\sum_{u \in S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of movie $j$ that were done by a user who also rated movie $i$. Likewise, $\sum_{u \in S_{j,i}(\chi)} u_j$ is the sum of only those ratings of movie $i$ that were done by a user who also rated movie $j$. (Note the symmetry: it's over the same set of users, because it's always the users who rated both $i$ and $j$.) # TODO: Finish that section (mostly translate from code notes) * Implementation #+BEGIN_SRC python print("foo") #+END_SRC