Try to fix code style

This commit is contained in:
Chris Hodapp 2020-02-01 09:39:22 -05:00
parent 35bb7686c8
commit 3a37dbc3b8
6 changed files with 173 additions and 110 deletions

6
.gitmodules vendored Normal file
View File

@ -0,0 +1,6 @@
[submodule "hugo_blag/themes/zen"]
path = hugo_blag/themes/zen
url = https://github.com/frjo/hugo-theme-zen.git
[submodule "hugo_blag/themes/nofancy"]
path = hugo_blag/themes/nofancy
url = https://github.com/gizak/nofancy.git

View File

@ -1,4 +1,31 @@
baseURL = "http://example.org/" baseURL = "http://example.org/"
languageCode = "en-us" languageCode = "en-us"
title = "My New Hugo Site" title = "My New Hugo Site"
theme = "indigo" # Want to use this but the default code theme is fucking hideous:
#theme = "indigo"
#theme = "zen"
# This one *does* use 'highlight' below:
theme = "nofancy"
[params]
# See themes/nofancy/static/highlight/styles for available options
highlight="tomorrow"
# Controls what items are listed in the top nav menu
# "none", or "categories"
# If you have too many categories to fit in the top nav menu, set this to "none"
topmenu="categories"
# none of this is taking any effect despite
# https://gohugo.io/getting-started/configuration-markup#highlight:
#[markup]
# [markup.highlight]
# codeFences = true
# guessSyntax = false
# hl_Lines = ""
# lineNoStart = 1
# lineNos = false
# lineNumbersInTable = true
# noClasses = true
# style = "monokai"
# tabWidth = 4

View File

@ -81,35 +81,34 @@ Download [MovieLens 20M](https://grouplens.org/datasets/movielens/20m/) and unco
For Python dependencies, everything I need is imported below: pandas, numpy, matplotlib, and scikit-learn. For Python dependencies, everything I need is imported below: pandas, numpy, matplotlib, and scikit-learn.
{{<highlight python>}}
```python
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy as np import numpy as np
import pandas as pd import pandas as pd
import scipy.sparse import scipy.sparse
import sklearn.model_selection import sklearn.model_selection
``` {{< / highlight >}}
# 3. Loading data # 3. Loading data
I don't explain this in detail. This is just standard calls in [Pandas](https://pandas.pydata.org/) and little details that are boring but essential: I don't explain this in detail. This is just standard calls in [Pandas](https://pandas.pydata.org/) and little details that are boring but essential:
```python {{<highlight python>}}
ml = pd.read_csv("ml-20m/ratings.csv", ml = pd.read_csv("ml-20m/ratings.csv",
header=0, header=0,
dtype={"user_id": np.int32, "movie_id": np.int32, "rating": np.float32, "time": np.int64}, dtype={"user_id": np.int32, "movie_id": np.int32, "rating": np.float32, "time": np.int64},
names=("user_id", "movie_id", "rating", "time")) names=("user_id", "movie_id", "rating", "time"))
# Convert Unix seconds to a Pandas timestamp: # Convert Unix seconds to a Pandas timestamp:
ml["time"] = pd.to_datetime(ml["time"], unit="s") ml["time"] = pd.to_datetime(ml["time"], unit="s")
``` {{< / highlight >}}
Below is just to inspect that data appears to be okay: Below is just to inspect that data appears to be okay:
```python {{<highlight python>}}
ml.info() ml.info()
``` {{< / highlight >}}
<div class=result> <div class=result>
@ -127,9 +126,9 @@ ml.info()
```python {{<highlight python>}}
ml.describe() ml.describe()
``` {{< / highlight >}}
@ -152,9 +151,9 @@ max|1.384930e+05|1.312620e+05|5.000000e+00
```python {{<highlight python>}}
ml[:10] ml[:10]
``` {{< / highlight >}}
@ -179,11 +178,11 @@ ml[:10]
```python {{<highlight python>}}
max_user = int(ml["user_id"].max() + 1) max_user = int(ml["user_id"].max() + 1)
max_movie = int(ml["movie_id"].max() + 1) max_movie = int(ml["movie_id"].max() + 1)
max_user, max_movie, max_user * max_movie max_user, max_movie, max_user * max_movie
``` {{< / highlight >}}
@ -199,9 +198,9 @@ max_user, max_movie, max_user * max_movie
Computing what percent we have of all 'possible' ratings (i.e. every single movie & every single user), this data is rather sparse: Computing what percent we have of all 'possible' ratings (i.e. every single movie & every single user), this data is rather sparse:
```python {{<highlight python>}}
print("%.2f%%" % (100 * ml.shape[0] / (max_user * max_movie))) print("%.2f%%" % (100 * ml.shape[0] / (max_user * max_movie)))
``` {{< / highlight >}}
<div class=result> <div class=result>
@ -217,27 +216,27 @@ This is partly just to explore the data a little, and partly because we need to
The dataset includes a lot of per-movie information too, but we only bother with the title so far: The dataset includes a lot of per-movie information too, but we only bother with the title so far:
```python {{<highlight python>}}
names = pd.read_csv( names = pd.read_csv(
"ml-20m/movies.csv", header=0, "ml-20m/movies.csv", header=0,
encoding = "ISO-8859-1", index_col=0, encoding = "ISO-8859-1", index_col=0,
names=("movie_id", "movie_title"), usecols=[0,1]) names=("movie_id", "movie_title"), usecols=[0,1])
``` {{< / highlight >}}
```python {{<highlight python>}}
movie_group = ml.groupby("movie_id") movie_group = ml.groupby("movie_id")
movie_stats = names.\ movie_stats = names.\
join(movie_group.size().rename("num_ratings")).\ join(movie_group.size().rename("num_ratings")).\
join(movie_group.mean()["rating"].rename("avg_rating")) join(movie_group.mean()["rating"].rename("avg_rating"))
``` {{< / highlight >}}
Sorting by number of ratings and taking the top 25, this looks pretty sensible: Sorting by number of ratings and taking the top 25, this looks pretty sensible:
```python {{<highlight python>}}
movie_stats.sort_values("num_ratings", ascending=False)[:25] movie_stats.sort_values("num_ratings", ascending=False)[:25]
``` {{< / highlight >}}
@ -279,9 +278,9 @@ movie_stats.sort_values("num_ratings", ascending=False)[:25]
Prior to anything else, split training/test data out with a specific random seed: Prior to anything else, split training/test data out with a specific random seed:
```python {{<highlight python>}}
ml_train, ml_test = sklearn.model_selection.train_test_split(ml, test_size=0.25, random_state=12345678) ml_train, ml_test = sklearn.model_selection.train_test_split(ml, test_size=0.25, random_state=12345678)
``` {{< / highlight >}}
# 4. Utility Matrix # 4. Utility Matrix
@ -314,15 +313,15 @@ later does this.
We'll convert to a utility matrix, for which the naive way is creating a dense matrix: We'll convert to a utility matrix, for which the naive way is creating a dense matrix:
```python {{<highlight python>}}
m = np.zeros((max_user, max_movie)) m = np.zeros((max_user, max_movie))
m[df["user_id"], df["movie_id"]] = df["rating"] m[df["user_id"], df["movie_id"]] = df["rating"]
``` {{< / highlight >}}
...but we'd be dealing with a 18,179,137,922-element matrix that's a little bit unusable here (at least it is for me since I only have 32 GB RAM), so we'll use [sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html). ...but we'd be dealing with a 18,179,137,922-element matrix that's a little bit unusable here (at least it is for me since I only have 32 GB RAM), so we'll use [sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html).
```python {{<highlight python>}}
def df2mat(df): def df2mat(df):
m = scipy.sparse.coo_matrix( m = scipy.sparse.coo_matrix(
(df["rating"], (df["user_id"], df["movie_id"])), (df["rating"], (df["user_id"], df["movie_id"])),
@ -332,14 +331,14 @@ def df2mat(df):
ml_mat_train, ml_mask_train = df2mat(ml_train) ml_mat_train, ml_mask_train = df2mat(ml_train)
ml_mat_test, ml_mask_test = df2mat(ml_test) ml_mat_test, ml_mask_test = df2mat(ml_test)
``` {{< / highlight >}}
We need a mask for some later steps, hence the m > 0 step. Ratings go only from 1 to 5, so values of 0 are automatically unknown/missing data, which fits with how sparse matrices work. We need a mask for some later steps, hence the m > 0 step. Ratings go only from 1 to 5, so values of 0 are automatically unknown/missing data, which fits with how sparse matrices work.
```python {{<highlight python>}}
ml_mat_train ml_mat_train
``` {{< / highlight >}}
@ -356,9 +355,9 @@ ml_mat_train
To demonstrate that the matrix and dataframe have the same data: To demonstrate that the matrix and dataframe have the same data:
```python {{<highlight python>}}
ml_train[:10] ml_train[:10]
``` {{< / highlight >}}
@ -383,9 +382,9 @@ ml_train[:10]
```python {{<highlight python>}}
list(ml_train.iloc[:10].rating) list(ml_train.iloc[:10].rating)
``` {{< / highlight >}}
@ -399,11 +398,11 @@ list(ml_train.iloc[:10].rating)
```python {{<highlight python>}}
user_ids = list(ml_train.iloc[:10].user_id) user_ids = list(ml_train.iloc[:10].user_id)
movie_ids = list(ml_train.iloc[:10].movie_id) movie_ids = list(ml_train.iloc[:10].movie_id)
[ml_mat_train[u,i] for u,i in zip(user_ids, movie_ids)] [ml_mat_train[u,i] for u,i in zip(user_ids, movie_ids)]
``` {{< / highlight >}}
@ -472,15 +471,15 @@ will go through these steps with some real data. I arbitrarily chose
user 28812: user 28812:
```python {{<highlight python>}}
pd.set_option('display.max_rows', 10) pd.set_option('display.max_rows', 10)
``` {{< / highlight >}}
```python {{<highlight python>}}
target_user = 28812 target_user = 28812
names.merge(ml_train[ml_train.user_id == target_user], right_on="movie_id", left_index=True) names.merge(ml_train[ml_train.user_id == target_user], right_on="movie_id", left_index=True)
``` {{< / highlight >}}
@ -508,10 +507,10 @@ names.merge(ml_train[ml_train.user_id == target_user], right_on="movie_id", left
I picked *Home Alone*, movie ID 586, as the one we want to predict user 28812's rating on. This isn't completely arbitrary. I chose it because the testing data contains the actual rating and we can compare against it later. I picked *Home Alone*, movie ID 586, as the one we want to predict user 28812's rating on. This isn't completely arbitrary. I chose it because the testing data contains the actual rating and we can compare against it later.
```python {{<highlight python>}}
target_movie = 586 target_movie = 586
names[names.index == target_movie] names[names.index == target_movie]
``` {{< / highlight >}}
@ -529,14 +528,14 @@ names[names.index == target_movie]
Now, from step #1 and about half of step #2: What users also rated one of the movies that 28812 rated, *and* rated *Home Alone*? What were those ratings? Now, from step #1 and about half of step #2: What users also rated one of the movies that 28812 rated, *and* rated *Home Alone*? What were those ratings?
```python {{<highlight python>}}
users_df = ml_train[ml_train.user_id == target_user][["movie_id"]]. \ users_df = ml_train[ml_train.user_id == target_user][["movie_id"]]. \
merge(ml_train, on="movie_id")[["movie_id", "user_id", "rating"]]. \ merge(ml_train, on="movie_id")[["movie_id", "user_id", "rating"]]. \
merge(ml_train[ml_train.movie_id == target_movie], on="user_id"). \ merge(ml_train[ml_train.movie_id == target_movie], on="user_id"). \
drop(["movie_id_y", "time"], axis=1) drop(["movie_id_y", "time"], axis=1)
# time is irrelevant to us, movie_id_y is just always 3175 # time is irrelevant to us, movie_id_y is just always 3175
users_df users_df
``` {{< / highlight >}}
@ -564,10 +563,10 @@ users_df
Each row has one user's ratings of both *Home Alone* (it's the `rating_y` column), and some other movie that 28812 rated (`rating_x`), so we can easily find the deviation of each individual rating - how much higher they rated *Home Alone* than the respective movie for `movie_id_x`: Each row has one user's ratings of both *Home Alone* (it's the `rating_y` column), and some other movie that 28812 rated (`rating_x`), so we can easily find the deviation of each individual rating - how much higher they rated *Home Alone* than the respective movie for `movie_id_x`:
```python {{<highlight python>}}
users_df = users_df.assign(rating_dev = users_df.rating_y - users_df.rating_x) users_df = users_df.assign(rating_dev = users_df.rating_y - users_df.rating_x)
users_df users_df
``` {{< / highlight >}}
@ -595,11 +594,11 @@ users_df
...and for the rest of step 2, turn this to an average deviation by grouping by movie ID. For the sake of displaying it, inner join with the dataframe that has movie titles: ...and for the rest of step 2, turn this to an average deviation by grouping by movie ID. For the sake of displaying it, inner join with the dataframe that has movie titles:
```python {{<highlight python>}}
pd.set_option('display.max_rows', 20) pd.set_option('display.max_rows', 20)
rating_dev = users_df.groupby("movie_id_x").mean()["rating_dev"] rating_dev = users_df.groupby("movie_id_x").mean()["rating_dev"]
names.join(rating_dev, how="inner").sort_values("rating_dev") names.join(rating_dev, how="inner").sort_values("rating_dev")
``` {{< / highlight >}}
@ -639,12 +638,12 @@ The numbers above then tell us that, on average, users who watched both movies r
For step 3, we can produce a prediction from each deviation above by adding it to each of user 28812's ratings for the respective movies: For step 3, we can produce a prediction from each deviation above by adding it to each of user 28812's ratings for the respective movies:
```python {{<highlight python>}}
df = ml_train[ml_train.user_id == target_user]. \ df = ml_train[ml_train.user_id == target_user]. \
join(rating_dev, on="movie_id") join(rating_dev, on="movie_id")
df = df.assign(rating_adj = df["rating"] + df["rating_dev"])[["user_id", "movie_id", "rating", "rating_adj"]] df = df.assign(rating_adj = df["rating"] + df["rating_dev"])[["user_id", "movie_id", "rating", "rating_adj"]]
df.join(names, on="movie_id").sort_values("movie_title") df.join(names, on="movie_id").sort_values("movie_title")
``` {{< / highlight >}}
@ -682,9 +681,9 @@ df.join(names, on="movie_id").sort_values("movie_title")
That is, every 'adjusted' rating above (the `rating_adj` column) is something like: based on just this movie, what rating would we expect user 28812 to give *Home Alone*? Produce the final prediction by averaging all these: That is, every 'adjusted' rating above (the `rating_adj` column) is something like: based on just this movie, what rating would we expect user 28812 to give *Home Alone*? Produce the final prediction by averaging all these:
```python {{<highlight python>}}
df["rating_adj"].mean() df["rating_adj"].mean()
``` {{< / highlight >}}
@ -700,9 +699,9 @@ df["rating_adj"].mean()
As mentioned above, we also happen to have the user's actual rating on *Home Alone* in the test set (i.e. we didn't train on it), so we can compare here: As mentioned above, we also happen to have the user's actual rating on *Home Alone* in the test set (i.e. we didn't train on it), so we can compare here:
```python {{<highlight python>}}
ml_test[(ml_test.user_id == target_user) & (ml_test.movie_id == target_movie)]["rating"].iloc[0] ml_test[(ml_test.user_id == target_user) & (ml_test.movie_id == target_movie)]["rating"].iloc[0]
``` {{< / highlight >}}
@ -722,10 +721,10 @@ That's quite close - though that may just be luck. It's hard to say from one poi
Take a look at the table below. This is a similar aggregation to what we just did to determine average deviation - but this instead counts up the number of ratings that went into each average deviation. Take a look at the table below. This is a similar aggregation to what we just did to determine average deviation - but this instead counts up the number of ratings that went into each average deviation.
```python {{<highlight python>}}
num_ratings = users_df.groupby("movie_id_x").count()["rating_dev"].rename("num_ratings") num_ratings = users_df.groupby("movie_id_x").count()["rating_dev"].rename("num_ratings")
names.join(num_ratings, how="inner").sort_values("num_ratings") names.join(num_ratings, how="inner").sort_values("num_ratings")
``` {{< / highlight >}}
@ -776,11 +775,11 @@ an estimate it is.
This is easy to do, luckily: This is easy to do, luckily:
```python {{<highlight python>}}
df = df.join(num_ratings, on="movie_id") df = df.join(num_ratings, on="movie_id")
df = df.assign(rating_weighted = df["rating_adj"] * df["num_ratings"]) df = df.assign(rating_weighted = df["rating_adj"] * df["num_ratings"])
df df
``` {{< / highlight >}}
@ -816,9 +815,9 @@ df
```python {{<highlight python>}}
df["rating_weighted"].sum() / df["num_ratings"].sum() df["rating_weighted"].sum() / df["num_ratings"].sum()
``` {{< / highlight >}}
@ -1020,7 +1019,7 @@ While $U$ and $M$ can be sparse matrices, $C$ and $D$ sort of must be dense matr
However, if we look at the $P(u)_j$ formula above, it refers only to row $j$ of $C$ and $D$ and the formulas for $C$ and $D$ make it easy to compute them by row if needed, or by blocks of rows according to what $u$ and $j$ we need. This is what I do below. However, if we look at the $P(u)_j$ formula above, it refers only to row $j$ of $C$ and $D$ and the formulas for $C$ and $D$ make it easy to compute them by row if needed, or by blocks of rows according to what $u$ and $j$ we need. This is what I do below.
```python {{<highlight python>}}
def slope_one(U, M, users, movies, approx=True): def slope_one(U, M, users, movies, approx=True):
M_j = M[:,movies].T.multiply(1) M_j = M[:,movies].T.multiply(1)
U_j = U[:,movies].T U_j = U[:,movies].T
@ -1036,15 +1035,15 @@ def slope_one(U, M, users, movies, approx=True):
else: else:
P_u_j = ((mask * (U_u + Dj)).sum(axis=1) - U_u[0,movies]) / np.maximum(mask.sum(axis=1), 1) P_u_j = ((mask * (U_u + Dj)).sum(axis=1) - U_u[0,movies]) / np.maximum(mask.sum(axis=1), 1)
return P_u_j return P_u_j
``` {{< / highlight >}}
To show that it actually gives the same result as above, and that the approximation produces seemingly no change here: To show that it actually gives the same result as above, and that the approximation produces seemingly no change here:
```python {{<highlight python>}}
(slope_one(ml_mat_train, ml_mask_train, [target_user], [target_movie])[0], (slope_one(ml_mat_train, ml_mask_train, [target_user], [target_movie])[0],
slope_one(ml_mat_train, ml_mask_train, [target_user], [target_movie], approx=False)[0]) slope_one(ml_mat_train, ml_mask_train, [target_user], [target_movie], approx=False)[0])
``` {{< / highlight >}}
@ -1060,7 +1059,7 @@ To show that it actually gives the same result as above, and that the approximat
This computes training error on a small part (1%) of the data, since doing it over the entire thing would be horrendously slow: This computes training error on a small part (1%) of the data, since doing it over the entire thing would be horrendously slow:
```python {{<highlight python>}}
def slope_one_err(U, M, users, movies, true_ratings): def slope_one_err(U, M, users, movies, true_ratings):
# Keep 'users' and 'movies' small (couple hundred maybe) # Keep 'users' and 'movies' small (couple hundred maybe)
p = slope_one(U, M, users, movies) p = slope_one(U, M, users, movies)
@ -1068,10 +1067,10 @@ def slope_one_err(U, M, users, movies, true_ratings):
err_abs = np.abs(d).sum() err_abs = np.abs(d).sum()
err_sq = np.square(d).sum() err_sq = np.square(d).sum()
return err_abs, err_sq return err_abs, err_sq
``` {{< / highlight >}}
```python {{<highlight python>}}
import multiprocessing import multiprocessing
count = int(len(ml_train) * 0.01) count = int(len(ml_train) * 0.01)
@ -1092,12 +1091,12 @@ with multiprocessing.Pool() as p:
errs = p.map(err_part, idxs_split) errs = p.map(err_part, idxs_split)
err_mae_train = sum([e[0] for e in errs]) / count err_mae_train = sum([e[0] for e in errs]) / count
err_rms_train = np.sqrt(sum([e[1] for e in errs]) / count) err_rms_train = np.sqrt(sum([e[1] for e in errs]) / count)
``` {{< / highlight >}}
and then likewise on 2% of the testing data (it's a smaller set to start): and then likewise on 2% of the testing data (it's a smaller set to start):
```python {{<highlight python>}}
import multiprocessing import multiprocessing
count = int(len(ml_test) * 0.02) count = int(len(ml_test) * 0.02)
@ -1117,19 +1116,19 @@ with multiprocessing.Pool() as p:
errs = p.map(err_part, idxs_split) errs = p.map(err_part, idxs_split)
err_mae_test = sum([e[0] for e in errs]) / count err_mae_test = sum([e[0] for e in errs]) / count
err_rms_test = np.sqrt(sum([e[1] for e in errs]) / count) err_rms_test = np.sqrt(sum([e[1] for e in errs]) / count)
``` {{< / highlight >}}
```python {{<highlight python>}}
# These are used later for comparison: # These are used later for comparison:
test_results = [("", "Slope One", err_mae_test, err_rms_test)] test_results = [("", "Slope One", err_mae_test, err_rms_test)]
``` {{< / highlight >}}
```python {{<highlight python>}}
print("Training error: MAE={:.3f}, RMSE={:.3f}".format(err_mae_train, err_rms_train)) print("Training error: MAE={:.3f}, RMSE={:.3f}".format(err_mae_train, err_rms_train))
print("Testing error: MAE={:.3f}, RMSE={:.3f}".format(err_mae_test, err_rms_test)) print("Testing error: MAE={:.3f}, RMSE={:.3f}".format(err_mae_test, err_rms_test))
``` {{< / highlight >}}
<div class=result> <div class=result>
@ -1253,16 +1252,16 @@ The code below is a direct implementation of this by simply iteratively applying
## 6.5. Implementation ## 6.5. Implementation
```python {{<highlight python>}}
# Hyperparameters # Hyperparameters
gamma = 0.002 gamma = 0.002
lambda_ = 0.02 lambda_ = 0.02
num_epochs = 20 num_epochs = 20
num_factors = 40 num_factors = 40
``` {{< / highlight >}}
```python {{<highlight python>}}
class SVDModel(object): class SVDModel(object):
def __init__(self, num_items, num_users, mean, def __init__(self, num_items, num_users, mean,
num_factors = 100, init_variance = 0.1): num_factors = 100, init_variance = 0.1):
@ -1354,12 +1353,12 @@ class SVDModel(object):
i, u, r_ui = items[idx], users[idx], ratings[idx] i, u, r_ui = items[idx], users[idx], ratings[idx]
self.update_by_gradient(i, u, r_ui, lambda_, gamma) self.update_by_gradient(i, u, r_ui, lambda_, gamma)
if epoch_callback: epoch_callback(self, epoch, num_epochs) if epoch_callback: epoch_callback(self, epoch, num_epochs)
``` {{< / highlight >}}
## 6.6. Running & Testing ## 6.6. Running & Testing
```python {{<highlight python>}}
movies_train = ml_train["movie_id"].values movies_train = ml_train["movie_id"].values
users_train = ml_train["user_id"].values users_train = ml_train["user_id"].values
ratings_train = ml_train["rating"].values ratings_train = ml_train["rating"].values
@ -1373,13 +1372,13 @@ def at_epoch(self, epoch, num_epochs):
(self.b_i, self.b_u, self.p, self.q)) (self.b_i, self.b_u, self.p, self.q))
print() print()
print("Epoch {:02d}/{}; Training: MAE={:.3f} RMSE={:.3f}, Testing: MAE={:.3f} RMSE={:.3f}".format(epoch + 1, num_epochs, train_mae, train_rmse, test_mae, test_rmse)) print("Epoch {:02d}/{}; Training: MAE={:.3f} RMSE={:.3f}, Testing: MAE={:.3f} RMSE={:.3f}".format(epoch + 1, num_epochs, train_mae, train_rmse, test_mae, test_rmse))
``` {{< / highlight >}}
```python {{<highlight python>}}
svd40 = SVDModel(max_movie, max_user, ml["rating"].mean(), num_factors=num_factors) svd40 = SVDModel(max_movie, max_user, ml["rating"].mean(), num_factors=num_factors)
svd40.train(movies_train, users_train, ratings_train, epoch_callback=at_epoch) svd40.train(movies_train, users_train, ratings_train, epoch_callback=at_epoch)
``` {{< / highlight >}}
<div class=result> <div class=result>
@ -1428,10 +1427,10 @@ svd40.train(movies_train, users_train, ratings_train, epoch_callback=at_epoch)
```python {{<highlight python>}}
test_rmse, test_mae = svd40.error(movies_test, users_test, ratings_test) test_rmse, test_mae = svd40.error(movies_test, users_test, ratings_test)
test_results.append(("", "SVD", test_mae, test_rmse)) test_results.append(("", "SVD", test_mae, test_rmse))
``` {{< / highlight >}}
## 6.7. Visualization of Latent Space ## 6.7. Visualization of Latent Space
@ -1440,10 +1439,10 @@ I mentioned somewhere in here that this is a latent-factor model. The latent spa
The 40-dimensional space above might be a bit unruly to work with, but we can easily train on something lower, like a 4-dimensional space. We can then pick a few dimensions, and visualize where movies fit in this space. The 40-dimensional space above might be a bit unruly to work with, but we can easily train on something lower, like a 4-dimensional space. We can then pick a few dimensions, and visualize where movies fit in this space.
```python {{<highlight python>}}
svd4 = SVDModel(max_movie, max_user, ml["rating"].mean(), 4) svd4 = SVDModel(max_movie, max_user, ml["rating"].mean(), 4)
svd4.train(ml_train["movie_id"].values, ml_train["user_id"].values, ml_train["rating"].values, epoch_callback=at_epoch) svd4.train(ml_train["movie_id"].values, ml_train["user_id"].values, ml_train["rating"].values, epoch_callback=at_epoch)
``` {{< / highlight >}}
<div class=result> <div class=result>
@ -1494,29 +1493,29 @@ svd4.train(ml_train["movie_id"].values, ml_train["user_id"].values, ml_train["ra
To limit the data, we can use just the top movies (by number of ratings): To limit the data, we can use just the top movies (by number of ratings):
```python {{<highlight python>}}
top = movie_stats.sort_values("num_ratings", ascending=False)[:100] top = movie_stats.sort_values("num_ratings", ascending=False)[:100]
ids_top = top.index.values ids_top = top.index.values
``` {{< / highlight >}}
```python {{<highlight python>}}
factors = svd4.q[:,ids_top].T factors = svd4.q[:,ids_top].T
means, stds = factors.mean(axis=0), factors.std(axis=0) means, stds = factors.mean(axis=0), factors.std(axis=0)
factors[:] = (factors - means) / stds factors[:] = (factors - means) / stds
``` {{< / highlight >}}
So, here are the top 100 movies when plotted in the first two dimensions of the concept space: So, here are the top 100 movies when plotted in the first two dimensions of the concept space:
```python {{<highlight python>}}
plt.figure(figsize=(15,15)) plt.figure(figsize=(15,15))
markers = ["$ {} $".format("\ ".join(m.split(" ")[:-1])) for m in top["movie_title"][:50]] markers = ["$ {} $".format("\ ".join(m.split(" ")[:-1])) for m in top["movie_title"][:50]]
for i,item in enumerate(factors[:50,:]): for i,item in enumerate(factors[:50,:]):
l = len(markers[i]) l = len(markers[i])
plt.scatter(item[0], item[1], marker = markers[i], alpha=0.75, s = 50 * (l**2)) plt.scatter(item[0], item[1], marker = markers[i], alpha=0.75, s = 50 * (l**2))
plt.show() plt.show()
``` {{< / highlight >}}
![png](../images/2018-04-08-recommenders/output_94_0.png) ![png](../images/2018-04-08-recommenders/output_94_0.png)
@ -1525,14 +1524,14 @@ plt.show()
And here are the other two: And here are the other two:
```python {{<highlight python>}}
plt.figure(figsize=(15,15)) plt.figure(figsize=(15,15))
markers = ["$ {} $".format("\ ".join(m.split(" ")[:-1])) for m in top["movie_title"][50:]] markers = ["$ {} $".format("\ ".join(m.split(" ")[:-1])) for m in top["movie_title"][50:]]
for i,item in enumerate(factors[50:,:]): for i,item in enumerate(factors[50:,:]):
l = len(markers[i]) l = len(markers[i])
plt.scatter(item[2], item[3], marker = markers[i], alpha=0.75, s = 50 * (l**2)) plt.scatter(item[2], item[3], marker = markers[i], alpha=0.75, s = 50 * (l**2))
plt.show() plt.show()
``` {{< / highlight >}}
![png](../images/2018-04-08-recommenders/output_96_0.png) ![png](../images/2018-04-08-recommenders/output_96_0.png)
@ -1541,7 +1540,7 @@ plt.show()
Below is another way of visualizing. Neither the code nor the result are very pretty, but it divides the entire latent space into a 2D grid, identifies the top few movies (ranked by number of ratings) in each grid square, and prints the resultant grid. Below is another way of visualizing. Neither the code nor the result are very pretty, but it divides the entire latent space into a 2D grid, identifies the top few movies (ranked by number of ratings) in each grid square, and prints the resultant grid.
```python {{<highlight python>}}
def clean_title(s): def clean_title(s):
remove = [", The", ", A", ", An"] remove = [", The", ", A", ", An"]
s1 = " ".join(s.split(" ")[:-1]) s1 = " ".join(s.split(" ")[:-1])
@ -1586,13 +1585,13 @@ def latent_factor_grid(latent_space, count=2):
else: else:
first_idxs[i,j] = -1 first_idxs[i,j] = -1
return pd.DataFrame(first_titles) return pd.DataFrame(first_titles)
``` {{< / highlight >}}
```python {{<highlight python>}}
pd.set_option('display.max_rows', 500) pd.set_option('display.max_rows', 500)
latent_factor_grid(svd4.q[:2,:]) latent_factor_grid(svd4.q[:2,:])
``` {{< / highlight >}}
@ -1627,9 +1626,9 @@ Both axes seem to start more on the low-brow side along the top left. There is
Here is the same thing for the other two dimensions in this latent space: Here is the same thing for the other two dimensions in this latent space:
```python {{<highlight python>}}
latent_factor_grid(svd4.q[2:,:]) latent_factor_grid(svd4.q[2:,:])
``` {{< / highlight >}}
@ -1666,11 +1665,11 @@ Some sensible axes seem to form here too. Moving from left to right (i.e. increa
We can also look at the per-movie bias parameters in the model - loosely, how much higher or lower a movie's rating is, beyond what interactions with user preferences seem to explain. Here are the top 10 and bottom 10; interestingly, while to seems to correlate with the average rating, it doesn't seem to do so especially strongly. We can also look at the per-movie bias parameters in the model - loosely, how much higher or lower a movie's rating is, beyond what interactions with user preferences seem to explain. Here are the top 10 and bottom 10; interestingly, while to seems to correlate with the average rating, it doesn't seem to do so especially strongly.
```python {{<highlight python>}}
#bias = movie_stats.assign(bias = svd40.b_i[:-1]).sort_values("bias", ascending=False) #bias = movie_stats.assign(bias = svd40.b_i[:-1]).sort_values("bias", ascending=False)
bias = movie_stats.join(pd.Series(svd40.b_i[:-1]).rename("bias")).sort_values("bias", ascending=False).dropna() bias = movie_stats.join(pd.Series(svd40.b_i[:-1]).rename("bias")).sort_values("bias", ascending=False).dropna()
bias.iloc[:10] bias.iloc[:10]
``` {{< / highlight >}}
@ -1695,9 +1694,9 @@ bias.iloc[:10]
```python {{<highlight python>}}
bias.iloc[:-10:-1] bias.iloc[:-10:-1]
``` {{< / highlight >}}
@ -1727,27 +1726,27 @@ bias.iloc[:-10:-1]
Results below are cross-validated, while our results above aren't, so comparison may have some noise to it (i.e. if you change the random seed you may see our results perform much better or worse while the Surprise results should be more consistent). Results below are cross-validated, while our results above aren't, so comparison may have some noise to it (i.e. if you change the random seed you may see our results perform much better or worse while the Surprise results should be more consistent).
```python {{<highlight python>}}
import surprise import surprise
from surprise.dataset import Dataset from surprise.dataset import Dataset
``` {{< / highlight >}}
Note the `.iloc[::10]` below, which is a quick way to decimate the data by a factor of 10. Surprise seems to be less memory-efficient than my code above (at least, without any tuning whatsoever), so in order to test it I don't pass in the entire dataset. Note the `.iloc[::10]` below, which is a quick way to decimate the data by a factor of 10. Surprise seems to be less memory-efficient than my code above (at least, without any tuning whatsoever), so in order to test it I don't pass in the entire dataset.
```python {{<highlight python>}}
reader = surprise.Reader(rating_scale=(1, 5)) reader = surprise.Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ml[["user_id", "movie_id", "rating"]].iloc[::10], reader) data = Dataset.load_from_df(ml[["user_id", "movie_id", "rating"]].iloc[::10], reader)
cv=5 cv=5
cv_random = surprise.model_selection.cross_validate(surprise.NormalPredictor(), data, cv=cv) cv_random = surprise.model_selection.cross_validate(surprise.NormalPredictor(), data, cv=cv)
cv_sl1 = surprise.model_selection.cross_validate(surprise.SlopeOne(), data, cv=cv) cv_sl1 = surprise.model_selection.cross_validate(surprise.SlopeOne(), data, cv=cv)
cv_svd = surprise.model_selection.cross_validate(surprise.SVD(), data, cv=cv) cv_svd = surprise.model_selection.cross_validate(surprise.SVD(), data, cv=cv)
``` {{< / highlight >}}
# 8. Overall results # 8. Overall results
```python {{<highlight python>}}
get_record = lambda name, df: \ get_record = lambda name, df: \
("Surprise", name, df["test_mae"].sum() / cv, df["test_rmse"].sum() / cv) ("Surprise", name, df["test_mae"].sum() / cv, df["test_rmse"].sum() / cv)
cv_data_surprise = [ cv_data_surprise = [
@ -1757,7 +1756,7 @@ pd.DataFrame.from_records(
data=test_results + cv_data_surprise, data=test_results + cv_data_surprise,
columns=("Library", "Algorithm", "MAE (test)", "RMSE (test)"), columns=("Library", "Algorithm", "MAE (test)", "RMSE (test)"),
) )
``` {{< / highlight >}}

View File

@ -0,0 +1,29 @@
<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']],
displayMath: [['$$','$$'], ['\[','\]']],
processEscapes: true,
processEnvironments: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'],
TeX: { equationNumbers: { autoNumber: "AMS" },
extensions: ["AMSmath.js", "AMSsymbols.js"] }
}
});
</script>
<!-- script type="text/x-mathjax-config">
MathJax.Hub.Queue(function() {
// Fix <code> tags after MathJax finishes running. This is a
// hack to overcome a shortcoming of Markdown. Discussion at
// https://github.com/mojombo/jekyll/issues/199
var all = MathJax.Hub.getAllJax(), i;
for(i = 0; i < all.length; i += 1) {
all[i].SourceElement().parentNode.className += ' has-jax';
}
});
</script -->

@ -0,0 +1 @@
Subproject commit ae4670287c71f4c4aed91be9b3d3919846fd62c9

1
hugo_blag/themes/zen Submodule

@ -0,0 +1 @@
Subproject commit b09452be937db32d659e2a255617256a4dca345b