194 lines
9.2 KiB
Org Mode
194 lines
9.2 KiB
Org Mode
---
|
|
title: Collaborative Filtering with Slope One Predictors
|
|
author: Chris Hodapp
|
|
date: January 30, 2018
|
|
tags: technobabble, machine learning
|
|
---
|
|
|
|
# Needs a brief intro
|
|
|
|
# Needs a summary at the end
|
|
|
|
Suppose you have a large number of users, and a large number of
|
|
movies. Users have watched movies, and they've provided ratings for
|
|
some of them (perhaps just simple numerical ratings, 1 to 10 stars).
|
|
However, they've all watched different movies, and for any given user,
|
|
it's only a tiny fraction of the total movies.
|
|
|
|
Now, you want to predict how some user will rate some movie they
|
|
haven't rated, based on what they (and other users) have rated.
|
|
|
|
That's a common problem, especially when generalized from 'movies' to
|
|
anything else, and one with many approaches. (To put some technical
|
|
terms to it, this is the [[https://en.wikipedia.org/wiki/Collaborative_filtering][collaborative filtering]] approach to
|
|
[[https://en.wikipedia.org/wiki/Recommender_system][recommender systems]]. [[http://www.mmds.org/][Mining of Massive Datasets]] is an excellent free
|
|
text in which to read more in depth on this, particularly chapter 9.)
|
|
|
|
Slope One Predictors are one such approach to collaborative filtering,
|
|
described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope One Predictors for Online Rating-Based
|
|
Collaborative Filtering]]. Despite the complex-sounding name, they are
|
|
wonderfully simple to understand and implement, and very fast.
|
|
|
|
I'll give a contrived example below to explain them.
|
|
|
|
Consider a user Bob. Bob is enthusiastic, but has rather simple
|
|
tastes: he mostly just watches Clint Eastwood movies. In fact, he's
|
|
watched and rated nearly all of them, and basically nothing else.
|
|
|
|
Now, suppose we want to predict how much Bob will like something
|
|
completely different and unheard of (to him at least), like... I don't
|
|
know... /Citizen Kane/.
|
|
|
|
Here's Slope One in a nutshell:
|
|
|
|
1. First, find the users who rated both /Citizen Kane/ *and* any of
|
|
the Clint Eastwood movies that Bob rated.
|
|
2. Now, for each movie that comes up above, compute a *deviation*
|
|
which tells us: On average, how differently (i.e. how much higher
|
|
or lower) did users rate Citizen Kane compared to this movie? (For
|
|
instance, we'll have a number for how /Citizen Kane/ was rated
|
|
compared to /Dirty Harry/, and perhaps it's +0.6 - meaning that on
|
|
average, users who rated both movies rated /Citizen Kane/ about 0.6
|
|
stars above /Dirty Harry/. We'd have another deviation for
|
|
/Citizen Kane/ compared to /Gran Torino/, another for /Citizen
|
|
Kane/ compared to /The Good, the Bad and the Ugly/, and so on - for
|
|
every movie that Bob rated, provided that other users who rated
|
|
/Citizen Kane/ also rated the movie.)
|
|
3. If that deviation between /Citizen Kane/ and /Dirty Harry/ was
|
|
+0.6, it's reasonable that adding 0.6 from Bob's rating on /Dirty
|
|
Harry/ would give one prediction of how Bob might rate /Citizen
|
|
Kane/. We can then generate more predictions based on the ratings
|
|
he gave the other movies - anything for which we could compute a
|
|
deviation.
|
|
4. To turn this to a single prediction, we could just average all
|
|
those predictions together.
|
|
|
|
One variant, Weighted Slope One, is nearly identical. The only
|
|
difference is in how we average those predictions in step #4. In
|
|
Slope One, every deviation counts equally, no matter how many users
|
|
had differences in ratings averaged together to produce it. In
|
|
Weighted Slope One, deviations that came from larger numbers of users
|
|
count for more (because, presumably, they are better estimates).
|
|
|
|
Or, in other words: If only one person rated both /Citizen Kane/ and
|
|
the lesser-known Eastwood classic /Revenge of the Creature/, and they
|
|
happened to think that /Revenge of the Creature/ deserved at least 3
|
|
more stars, then with Slope One, this deviation of -3 would carry
|
|
exactly as much weight as thousands of people rating /Citizen Kane/ as
|
|
about 0.5 stars below /The Good, the Bad and the Ugly/. In Weighted
|
|
Slope One, that latter deviation would count for thousands of times as
|
|
much. The example makes it sound a bit more drastic than it is.
|
|
|
|
The Python library [[http://surpriselib.com/][Surprise]] (a [[https://www.scipy.org/scikits.html][scikit]]) has an implementation of this
|
|
algorithm, and the Benchmarks section of that page shows its
|
|
performance compared to some other methods.
|
|
|
|
/TODO/: Show a simple Python implementation of this (Jupyter
|
|
notebook?)
|
|
|
|
* Linear Algebra Tricks
|
|
|
|
Those who aren't familiar with matrix methods or algebra can probably
|
|
skip this section. Everything I've described above, you can compute
|
|
given just some data to work with ([[https://grouplens.org/datasets/movielens/100k/][movielens 100k]], perhaps?) and some
|
|
basic arithmetic. You don't need any complicated numerical methods.
|
|
|
|
However, the entire Slope One method can be implemented in a very fast
|
|
and simple way with a couple matrix operations.
|
|
|
|
First, we need to have our data encoded as a *utility matrix*. In a
|
|
utility matrix, each row represents one user, each column represents
|
|
one item (a movie, in our case), and each element represents a user's
|
|
rating of an item. If we have $n$ users and $m$ movies, then this a
|
|
$n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for
|
|
movie $i$ - assuming we've numbered our users and our movies.
|
|
|
|
Users have typically rated only a fraction of movies, and so most of
|
|
the elements of this matrix are unknown. We can represent this with
|
|
another $n \times m$ matrix (specifically a binary matrix), a 'mask'
|
|
$M$ in which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie
|
|
$i$, and otherwise 0.
|
|
|
|
I mentioned *deviation* above and gave an informal definition of it.
|
|
The paper gaves a formal but rather terse definition below of the
|
|
average deviation of item $i$ with respect to item $j$:
|
|
|
|
$$\textrm{dev}_{j,i} = \sum_{u \in S_{j,i}(\chi)} \frac{u_j - u_i}{card(S_{j,i}(\chi))}$$
|
|
|
|
where:
|
|
- $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively
|
|
- $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're
|
|
training on, provided a rating for both movie $i$ and movie $j$
|
|
- $card$ is the cardinality of that set, i.e. for
|
|
${card(S_{j,i}(\chi))}$ it is just how many users rated both $i$ and
|
|
$j$.
|
|
|
|
That denominator does depend on $i$ and $j$, but doesn't depend on the
|
|
summation term, so it can be pulled out, and also, we can split up the
|
|
summation as long as it is kept over the same terms:
|
|
|
|
$$\textrm{dev}_{j,i} = \frac{1}{card(S_{j,i}(\chi))} \sum_{u \in
|
|
S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u
|
|
\in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i\right)$$
|
|
|
|
# TODO: These need some actual matrices to illustrate
|
|
|
|
Let's start with computing ${card(S_{j,i}(\chi))}$, the number of
|
|
users who rated both movie $i$ and movie $j$. Consider column $i$ of
|
|
the mask $M$. For each value in this column, it equals 1 if the
|
|
respective user rated movie $i$, or 0 if they did not. Clearly,
|
|
simply summing up column $i$ would tell us how many users rated movie
|
|
$i$, and the same applies to column $j$ for movie $j$.
|
|
|
|
Now, suppose we take element-wise logical AND of columns $i$ and $j$.
|
|
The resultant column has a 1 only where both corresponding elements
|
|
were 1 - where a user rated both $i$ and $j$. If we sum up this
|
|
column, we have exactly the number we need: the number of users who
|
|
rated both $i$ and $j$.
|
|
|
|
Some might notice that "elementwise logical AND" is just "elementwise
|
|
multiplication", thus "sum of elementwise logical AND" is just "sum of
|
|
elementwise multiplication", which is: dot product. That is,
|
|
${card(S_{j,i}(\chi))}=M_j \bullet M_i$ if we use $M_i$ and $M_j$ for
|
|
columns $i$ and $j$ of $M$.
|
|
|
|
However, we'd like to compute deviation as a matrix for all $i$ and
|
|
$j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single
|
|
combination of $i$ and $j$ - that is, we need a dot product between
|
|
every single pair of columns from $M$. Incidentally, "dot product of
|
|
every pair of columns" happens to be almost exactly matrix
|
|
multiplication; note that for matrices $A$ and $B$, element $(x,y)$ of
|
|
the matrix product $AB$ is just the dot product of /row/ $x$ of $A$
|
|
and /column/ $y$ of $B$ - and that matrix product as a whole has this
|
|
dot product between every row of $A$ and every column of $B$.
|
|
|
|
We wanted the dot product of every column of $M$ with every column of
|
|
$M$, which is easy: just transpose $M$ for one operand. Then, we can
|
|
compute our count matrix like this:
|
|
|
|
$$C=M^\top M$$
|
|
|
|
Thus $C_{i,j}$ is the dot product of column $i$ of $M$ and column $j$
|
|
of $M$ - or, the number of users who rated both movies $i$ and $j$.
|
|
|
|
That was the first half of what we needed for $\textrm{dev}_{j,i}$.
|
|
We still need the other half:
|
|
|
|
$$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$
|
|
|
|
We can apply a similar trick here. Consider first what $\sum_{u \in
|
|
S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of
|
|
movie $j$ that were done by a user who also rated movie $i$.
|
|
Likewise, $\sum_{u \in S_{j,i}(\chi)} u_j$ is the sum of only those
|
|
ratings of movie $i$ that were done by a user who also rated movie
|
|
$j$. (Note the symmetry: it's over the same set of users, because
|
|
it's always the users who rated both $i$ and $j$.)
|
|
|
|
# TODO: Finish that section (mostly translate from code notes)
|
|
|
|
* Implementation
|
|
|
|
#+BEGIN_SRC python
|
|
print("foo")
|
|
#+END_SRC
|