Fix some MathJax annoyances

2020-04-30 17:52:44 -04:00
parent 64b3ddb238
commit ca556b1243
3 changed files with 186 additions and 126 deletions
--- a/content/posts/2016-10-12-pi-pan-tilt-3.md
+++ b/content/posts/2016-10-12-pi-pan-tilt-3.md
@@ -171,7 +171,7 @@ image.
 Here's another comparison, this time a 1:1 crop from the center of an
 image (shot at 40mm with [this lens][12-40mm], whose Amazon price
-mysteriously is now <span>$</span>146 instead of the <span>$</span>23
+mysteriously is now $146 instead of the $23
 I actually paid).  Click the preview for a lossless PNG view, as JPEG
 might eat some of the finer details, or [here][leaves-full] for the
 full JPEG file (including raw, if you want to look around).
--- a/content/posts/2018-04-08-recommender-systems-1/index.md
+++ b/content/posts/2018-04-08-recommender-systems-1/index.md
@@ -85,6 +85,7 @@ Below is just to inspect that data appears to be okay:
 ml.info()
 {{< / highlight >}}
 {{< rawhtml >}}
 <pre class="result">
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 20000263 entries, 0 to 20000262
@@ -96,7 +97,7 @@ ml.info()
    dtypes: datetime64[ns](1), float32(1), int32(2)
    memory usage: 381.5 MB
 </pre>
-
+{{< /rawhtml >}}
 {{<highlight python>}}
 ml.describe()
@@ -106,7 +107,6 @@ ml.describe()
 <pre class="result">
 | user_id | movie_id | rating
 |---------|----------|-------
 count|2.000026e+07|2.000026e+07|2.000026e+07
@@ -117,7 +117,6 @@ min|1.000000e+00|1.000000e+00|5.000000e-01
 50%|6.914100e+04|2.167000e+03|3.500000e+00
 75%|1.036370e+05|4.770000e+03|4.000000e+00
 max|1.384930e+05|1.312620e+05|5.000000e+00
 </pre>
@@ -131,7 +130,6 @@ ml[:10]
 <pre class="result">
 | user_id | movie_id | rating | time
 |--------|---------|-------|-----
 0|1|2|3.5|2005-04-02 23:53:47
@@ -144,7 +142,6 @@ ml[:10]
 7|1|223|4.0|2005-04-02 23:46:13
 8|1|253|4.0|2005-04-02 23:35:40
 9|1|260|4.0|2005-04-02 23:33:46
 </pre>
@@ -159,12 +156,13 @@ max_user, max_movie, max_user * max_movie
 {{< rawhtml >}}
 <pre class="result">
    (138494, 131263, 18179137922)
 </pre>
-
+{{< /rawhtml >}}
 Computing what percent we have of all 'possible' ratings (i.e. every single movie & every single user), this data is rather sparse:
@@ -174,12 +172,13 @@ Computing what percent we have of all 'possible' ratings (i.e. every single movi
 print("%.2f%%" % (100 * ml.shape[0] / (max_user * max_movie)))
 {{< / highlight >}}
 {{< rawhtml >}}
 <pre class="result">
    0.11%
 </pre>
-
+{{< /rawhtml >}}
 ## 3.1. Aggregation
@@ -214,7 +213,6 @@ movie_stats.sort_values("num_ratings", ascending=False)[:25]
 <pre class="result">
 | movie_title | num_ratings | avg_rating | movie_id |  |  | 
 |------------|------------|-----------|---------|-|-|-
 296|Pulp Fiction (1994)|67310.0|4.174231
@@ -242,7 +240,6 @@ movie_stats.sort_values("num_ratings", ascending=False)[:25]
 608|Fargo (1996)|43272.0|4.112359
 47|Seven (a.k.a. Se7en) (1995)|43249.0|4.053493
 380|True Lies (1994)|43159.0|3.491149
 </pre>
@@ -267,9 +264,9 @@ examples of this, check out section 11.3.2 in [MMDS](http://www.mmds.org/).)
 In a utility matrix, each row represents one user, each column represents
 one item (a movie, in our case), and each element represents a user's
-rating of an item.  If we have $n$ users and $m$ movies, then this is a
+rating of an item.  If we have \\(n\\) users and \\(m\\) movies, then this is a
-$n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for
+\\(n \times m\\) matrix \\(U\\) for which \\(U_{k,i}\\) is user \\(k\\)'s rating for
-movie $i$ - assuming we've numbered our users and our movies.
+movie \\(i\\) - assuming we've numbered our users and our movies.
 Users have typically rated only a fraction of movies, and so most of
 the elements of this matrix are unknown. Algorithms represent this
@@ -315,13 +312,14 @@ ml_mat_train
 {{< rawhtml >}}
 <pre class="result">
    <138494x131263 sparse matrix of type '<class 'numpy.float32'>'
    	with 15000197 stored elements in Compressed Sparse Column format>
 </pre>
-
+{{< /rawhtml >}}
 To demonstrate that the matrix and dataframe have the same data:
@@ -335,7 +333,6 @@ ml_train[:10]
 <pre class="result">
 | user_id | movie_id | rating | time
 |--------|---------|-------|-----
 13746918|94976|7371|4.5|2009-11-04 05:51:26
@@ -348,7 +345,6 @@ ml_train[:10]
 15311014|105846|4226|4.5|2004-07-30 18:12:26
 8514776|58812|1285|4.0|2000-04-24 20:39:46
 3802643|25919|3275|2.5|2010-06-18 00:48:40
 </pre>
@@ -361,12 +357,13 @@ list(ml_train.iloc[:10].rating)
 {{< rawhtml >}}
 <pre class="result">
    [4.5, 3.0, 3.0, 4.5, 4.0, 2.5, 5.0, 4.5, 4.0, 2.5]
 </pre>
-
+{{< /rawhtml >}}
@@ -379,12 +376,13 @@ movie_ids = list(ml_train.iloc[:10].movie_id)
 {{< rawhtml >}}
 <pre class="result">
    [4.5, 3.0, 3.0, 4.5, 4.0, 2.5, 5.0, 4.5, 4.0, 2.5]
 </pre>
-
+{{< /rawhtml >}}
 Okay, enough of that; we can begin with some actual predictions.
@@ -457,7 +455,6 @@ names.merge(ml_train[ml_train.user_id == target_user], right_on="movie_id", left
 <pre class="result">
 | movie_title | user_id | movie_id | rating | time
 |------------|--------|---------|-------|-----
 4229884|Jumanji (1995)|28812|2|5.0|1996-09-23 02:08:39
@@ -471,7 +468,6 @@ names.merge(ml_train[ml_train.user_id == target_user], right_on="movie_id", left
 4229957|Independence Day (a.k.a. ID4) (1996)|28812|780|5.0|1996-09-23 02:09:02
 4229959|Phenomenon (1996)|28812|802|5.0|1996-09-23 02:09:02
 4229960|Die Hard (1988)|28812|1036|5.0|1996-09-23 02:09:02
 </pre>
@@ -488,12 +484,13 @@ names[names.index == target_movie]
 {{< rawhtml >}}
 <pre class="result">
 | movie_title | movie_id | 
 |------------|---------|-
 586|Home Alone (1990)
 </pre>
-
+{{< /rawhtml >}}
@@ -513,7 +510,6 @@ users_df
 <pre class="result">
 | movie_id_x | user_id | rating_x | rating_y
 |-----------|--------|---------|---------
 0|329|17593|3.0|4.0
@@ -527,8 +523,6 @@ users_df
 522688|2|126271|3.0|4.0
 522689|595|82760|2.0|4.0
 522690|595|18306|4.5|5.0
 </pre>
@@ -544,7 +538,6 @@ users_df
 <pre class="result">
 | movie_id_x | user_id | rating_x | rating_y | rating_dev
 |-----------|--------|---------|---------|-----------
 0|329|17593|3.0|4.0|1.0
@@ -558,7 +551,6 @@ users_df
 522688|2|126271|3.0|4.0|1.0
 522689|595|82760|2.0|4.0|2.0
 522690|595|18306|4.5|5.0|0.5
 </pre>
@@ -574,9 +566,6 @@ names.join(rating_dev, how="inner").sort_values("rating_dev")
 <pre class="result">
 | movie_title | rating_dev
 |------------|-----------
 318|Shawshank Redemption, The (1994)|-1.391784
@@ -600,8 +589,6 @@ names.join(rating_dev, how="inner").sort_values("rating_dev")
 173|Judge Dredd (1995)|0.518570
 19|Ace Ventura: When Nature Calls (1995)|0.530155
 160|Congo (1995)|0.559034
 </pre>
@@ -620,8 +607,6 @@ df.join(names, on="movie_id").sort_values("movie_title")
 <pre class="result">
 | user_id | movie_id | rating | rating_adj | movie_title
 |--------|---------|-------|-----------|------------
 4229920|28812|344|3.0|3.141987|Ace Ventura: Pet Detective (1994)
@@ -645,7 +630,6 @@ df.join(names, on="movie_id").sort_values("movie_title")
 4229892|28812|50|3.0|1.683520|Usual Suspects, The (1995)
 4229903|28812|208|3.0|3.250881|Waterworld (1995)
 4229919|28812|339|4.0|3.727966|While You Were Sleeping (1995)
 </pre>
@@ -660,12 +644,13 @@ df["rating_adj"].mean()
 {{< rawhtml >}}
 <pre class="result">
    4.087520122528076
 </pre>
-
+{{< /rawhtml >}}
 As mentioned above, we also happen to have the user's actual rating on *Home Alone* in the test set (i.e. we didn't train on it), so we can compare here:
@@ -678,12 +663,13 @@ ml_test[(ml_test.user_id == target_user) & (ml_test.movie_id == target_movie)]["
 {{< rawhtml >}}
 <pre class="result">
    4.0
 </pre>
-
+{{< /rawhtml >}}
 That's quite close - though that may just be luck. It's hard to say from one point.
@@ -702,7 +688,6 @@ names.join(num_ratings, how="inner").sort_values("num_ratings")
 <pre class="result">
 | movie_title | num_ratings
 |------------|------------
 802|Phenomenon (1996)|3147
@@ -726,7 +711,6 @@ names.join(num_ratings, how="inner").sort_values("num_ratings")
 593|Silence of the Lambs, The (1991)|12120
 480|Jurassic Park (1993)|13546
 356|Forrest Gump (1994)|13847
 </pre>
@@ -757,7 +741,6 @@ df
 <pre class="result">
 | user_id | movie_id | rating | rating_adj | num_ratings | rating_weighted
 |--------|---------|-------|-----------|------------|----------------
 4229918|28812|329|4.0|3.767164|6365|23978.000326
@@ -781,7 +764,6 @@ df
 4229912|28812|296|4.0|2.883755|11893|34296.500678
 4229884|28812|2|5.0|4.954595|7422|36773.001211
 4229953|28812|595|4.0|3.515051|9036|31761.999825
 </pre>
@@ -794,12 +776,13 @@ df["rating_weighted"].sum() / df["num_ratings"].sum()
 {{< rawhtml >}}
 <pre class="result">
    4.02968199025023
 </pre>
-
+{{< /rawhtml >}}
 It changes the answer, but only very slightly.
@@ -818,8 +801,9 @@ eyes glaze over, you can probably just skip this section.
 ### 5.2.1. Short Answer
-Let $U$ be the utility matrix. Let $M$ be a binary matrix for which $M_{i,j}=1$ if user $i$ rated movie $j$, otherwise 0.  Compute the model's matrices with:
+Let \\(U\\) be the utility matrix. Let \\(M\\) be a binary matrix for which \\(M_{i,j}=1\\) if user \\(i\\) rated movie \\(j\\), otherwise 0.  Compute the model's matrices with:
 {{< rawhtml >}}
 <div>
 $$
 \begin{align}
@@ -828,35 +812,40 @@ D &= \left(M^\top U - (M^\top U)^\top\right) /\ \textrm{max}(1, M^\top M)
 \end{align}
 $$
 </div>
 {{< /rawhtml >}}
-where $/$ is Hadamard (i.e. elementwise) division, and $\textrm{max}$ is elementwise maximum with 1.  Then, the below gives the prediction for how user $u$ will rate movie $j$:
+where \\(/\\) is Hadamard (i.e. elementwise) division, and \\(\textrm{max}\\) is elementwise maximum with 1.  Then, the below gives the prediction for how user \\(u\\) will rate movie \\(j\\):
 {{< rawhtml >}}
 <div>
 $$
 P(u)_j = \frac{[M_u \odot (C_j > 0)] \cdot (D_j + U_u) - U_{u,j}}{M_u \cdot (C_j > 0)}
 $$
 </div>
 {{< /rawhtml >}}
-$D_j$ and $C_j$ are row $j$ of $D$ and $C$, respectively. $M_u$ and $U_u$ are column $u$ of $M$ and $U$, respectively. $\odot$ is elementwise multiplication.
+
 \\(D_j\\) and \\(C_j\\) are row \\(j\\) of \\(D\\) and \\(C\\), respectively. \\(M_u\\) and \\(U_u\\) are column \\(u\\) of \\(M\\) and \\(U\\), respectively. \\(\odot\\) is elementwise multiplication.
 ### 5.2.2. Long Answer
-First, we need to have our data encoded as an $n \times m$ utility
+First, we need to have our data encoded as an \\(n \times m\\) utility
 matrix (see a [few sections above](#Utility-Matrix) for the definition
 of *utility matrix*).
 As noted, most elements of this matrix are unknown as users have rated
 only a fraction of movies. We can represent this with another
-$n \times m$ matrix (specifically a binary matrix), a 'mask' $M$ in
+\\(n \times m\\) matrix (specifically a binary matrix), a 'mask' \\(M\\) in
-which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie $i$, and
+which \\(M_{k,i}\\) is 1 if user \\(k\\) supplied a rating for movie \\(i\\), and
 otherwise 0.
 #### 5.2.2.1. Deviation Matrix
 I mentioned *deviation* above and gave an informal definition of it.
 The paper gaves a formal but rather terse definition below of the
-average deviation of item $i$ with respect to item $j$, and I
+average deviation of item \\(i\\) with respect to item \\(j\\), and I
 then separate out the summation a little:
 {{< rawhtml >}}
 <div>
 $$
 \begin{split}
@@ -867,128 +856,154 @@ S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u
 \end{split}
 $$
 </div>
 {{< /rawhtml >}}
 where:
- $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively
+- \\(u_j\\) and \\(u_i\\) mean: user \\(u\\)'s ratings for movies \\(i\\) and \\(j\\), respectively
- $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're
+- \\(u \in S_{j,i}(\chi)\\) means: all users \\(u\\) who, in the dataset we're
-  training on, provided a rating for both movie $i$ and movie $j$
+  training on, provided a rating for both movie \\(i\\) and movie \\(j\\)
- $card$ is the cardinality of that set, i.e.
+- \\(card\\) is the cardinality of that set, i.e.
-  ${card(S_{j,i}(\chi))}$ is how many users rated both $i$ and
+  \\({card(S_{j,i}(\chi))}\\) is how many users rated both \\(i\\) and
-  $j$.
+  \\(j\\).
 #### 5.2.2.2. Cardinality/Counts Matrix
-Let's start with computing ${card(S_{j,i}(\chi))}$, the number of
+Let's start with computing \\({card(S_{j,i}(\chi))}\\), the number of
-users who rated both movie $i$ and movie $j$.  Consider column $i$ of
+users who rated both movie \\(i\\) and movie \\(j\\).  Consider column \\(i\\) of
-the mask $M$.  For each value in this column, it equals 1 if the
+the mask \\(M\\).  For each value in this column, it equals 1 if the
-respective user rated movie $i$, or 0 if they did not.  Clearly,
+respective user rated movie \\(i\\), or 0 if they did not.  Clearly,
-simply summing up column $i$ would tell us how many users rated movie
+simply summing up column \\(i\\) would tell us how many users rated movie
-$i$, and the same applies to column $j$ for movie $j$.
+\\(i\\), and the same applies to column \\(j\\) for movie \\(j\\).
-Now, suppose we take element-wise logical AND of columns $i$ and $j$.
+Now, suppose we take element-wise logical AND of columns \\(i\\) and \\(j\\).
 The resultant column has a 1 only where both corresponding elements
-were 1 - where a user rated both $i$ and $j$.  If we sum up this
+were 1 - where a user rated both \\(i\\) and \\(j\\).  If we sum up this
 column, we have exactly the number we need: the number of users who
-rated both $i$ and $j$. Some might notice that "elementwise logical
+rated both \\(i\\) and \\(j\\). Some might notice that "elementwise logical
 AND" is just "elementwise
 multiplication", thus "sum of elementwise logical AND" is just "sum of
 elementwise multiplication", which is: dot product.  That is,
-${card(S_{j,i}(\chi))}=M_j \cdot M_i$ if we use $M_i$ and $M_j$ for
+\\({card(S_{j,i}(\chi))}=M_j \cdot M_i\\) if we use \\(M_i\\) and \\(M_j\\) for
-columns $i$ and $j$ of $M$.
+columns \\(i\\) and \\(j\\) of \\(M\\).
-However, we'd like to compute deviation as a matrix for all $i$ and
+However, we'd like to compute deviation as a matrix for all \\(i\\) and
-$j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single
+\\(j\\), so we'll likewise need \\({card(S_{j,i}(\chi))}\\) for every single
-combination of $i$ and $j$ - that is, we need a dot product between
+combination of \\(i\\) and \\(j\\) - that is, we need a dot product between
-every single pair of columns from $M$.  This is incidentally just
+every single pair of columns from \\(M\\).  This is incidentally just
 matrix multiplication:
 {{< rawhtml >}}
 <div>
 $$C=M^\top M$$
 </div>
 {{< /rawhtml >}}
 since $C\_{i,j}=card(S\_{j,i}(\chi))$ is the dot product of row $i$ of $M^T$ - which is column
 $i$ of $M$ - and column $j$ of $M$.
-That was the first half of what we needed for $\textrm{dev}_{j,i}$.
+since \\(C\_{i,j}=card(S\_{j,i}(\chi))\\) is the dot product of row \\(i\\) of \\(M^T\\) - which is column
 \\(i\\) of \\(M\\) - and column \\(j\\) of \\(M\\).
 That was the first half of what we needed for \\(\textrm{dev}_{j,i}\\).
 We still need the other half:
 {{< rawhtml >}}
 <div>
 $$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$
 </div>
 {{< /rawhtml >}}
 We can apply a similar trick here.  Consider first what $\sum\_{u \in
 S\_{j,i}(\chi)} u\_j$ means: It is the sum of only those ratings of
-movie $j$ that were done by a user who also rated movie $i$.
+movie \\(j\\) that were done by a user who also rated movie \\(i\\).
-Likewise, $\sum\_{u \in S\_{j,i}(\chi)} u\_i$ is the sum of only those
+Likewise, \\(\sum\_{u \in S\_{j,i}(\chi)} u\_i\\) is the sum of only those
-ratings of movie $i$ that were done by a user who also rated movie
+ratings of movie \\(i\\) that were done by a user who also rated movie
-$j$.  (Note the symmetry: it's over the same set of users, because
+\\(j\\).  (Note the symmetry: it's over the same set of users, because
-it's always the users who rated both $i$ and $j$.)
+it's always the users who rated both \\(i\\) and \\(j\\).)
-Let's call the utility matrix $U$, and use $U\_i$ and $U\_j$ to refer
+Let's call the utility matrix \\(U\\), and use \\(U\_i\\) and \\(U\_j\\) to refer
-to columns $i$ and $j$ of it (just as in $M$).  $U\_i$ has each rating
+to columns \\(i\\) and \\(j\\) of it (just as in \\(M\\)).  \\(U\_i\\) has each rating
-of movie $i$, but we want only the sum of the ratings done by a user
+of movie \\(i\\), but we want only the sum of the ratings done by a user
-who also rated movie $j$. Like before, the dot product of $U\_i$ and
+who also rated movie \\(j\\). Like before, the dot product of \\(U\_i\\) and
-$M\_j$ (consider the definition of $M\_j$) computes this, and so:
+\\(M\_j\\) (consider the definition of \\(M\_j\\)) computes this, and so:
 {{< rawhtml >}}
 <div>
 $$\sum_{u \in S_{j,i}(\chi)} u_j = M_i \cdot U_j$$
 </div>
 {{< /rawhtml >}}
-and as with $C$, since we want every pairwise dot product, this summation just
+
-equals element $(i,j)$ of $M^\top U$. The other half of the summation,
+and as with \\(C\\), since we want every pairwise dot product, this summation just
-$\sum\_{u \in S_{j,i}(\chi)} u\_i$, equals $M\_j \cdot U\_i$, which is just
+equals element \\((i,j)\\) of \\(M^\top U\\). The other half of the summation,
 \\(\sum\_{u \in S_{j,i}(\chi)} u\_i\\), equals \\(M\_j \cdot U\_i\\), which is just
 the transpose of this matrix:
 {{< rawhtml >}}
 <div>
 $$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i = M^\top U - (M^\top U)^\top = M^\top U - U^\top M$$
 </div>
 {{< /rawhtml >}}
 So, finally, we can compute an entire deviation matrix at once like:
 {{< rawhtml >}}
 <div>
 $$D = \left(M^\top U - (M^\top U)^\top\right) /\ M^\top M$$
 </div>
 {{< /rawhtml >}}
-where $/$ is Hadamard (i.e. elementwise) division, and $D\_{j,i} = \textrm{dev}\_{j,i}$.
+
 where \\(/\\) is Hadamard (i.e. elementwise) division, and \\(D\_{j,i} = \textrm{dev}\_{j,i}\\).
 By convention and to avoid division by zero, we treat the case where the denominator and numerator are both 0 as just equaling 0. This comes up only where no ratings exist for there to be a deviation - hence the `np.maximum(1, counts)` below.
 #### 5.2.2.3. Prediction
-Finally, the paper gives the formula to predict how user $u$ will rate movie $j$, and I write this in terms of our matrices:
+Finally, the paper gives the formula to predict how user \\(u\\) will rate movie \\(j\\), and I write this in terms of our matrices:
 {{< rawhtml >}}
 <div>
 $$
 P(u)_j = \frac{1}{card(R_j)}\sum_{i\in R_j} \left(\textrm{dev}_{j,i}+u_i\right) = \frac{1}{card(R_j)}\sum_{i\in R_j} \left(D_{j,i} + U_{u,j} \right)
 $$
 </div>
 {{< /rawhtml >}}
 where $R\_j = \{i | i \in S(u), i \ne j, card(S\_{j,i}(\chi)) > 0\}$, and $S(u)$ is the set of movies that user $u$ has rated. To unpack the paper's somewhat dense notation, the summation is over every movie $i$ that user $u$ rated and that at least one other user rated, except movie $j$.
-We can apply the usual trick yet one more time with a little effort. The summation already goes across a row of $U$ and $D$ (that is, user $u$ is held constant), but covers only certain elements. This is equivalent to a dot product with a mask representing $R\_j$. $M\_u$, row $u$ of the mask, already represents $S(u)$, and $R\_j$ is just $S(u)$ with some more elements removed - which we can mostly represent with $M\_u \odot (C\_j > 0)$ where $\odot$ is elementwise product (i.e. Hadamard), $C\_j$ is column/row $j$ of $C$ (it's symmetric), and where we abuse some notation to say that $C\_j > 0$ is a binary vector.  Likewise, $D\_j$ is row $j$ of $D$. The one correction still required is that we subtract $u\_j$ to cover for the $i \ne j$ part of $R\_j$.  To abuse some more notation:
+where \\(R\_j = \{i | i \in S(u), i \ne j, card(S\_{j,i}(\chi)) > 0\}\\), and \\(S(u)\\) is the set of movies that user \\(u\\) has rated. To unpack the paper's somewhat dense notation, the summation is over every movie \\(i\\) that user \\(u\\) rated and that at least one other user rated, except movie \\(j\\).
 We can apply the usual trick yet one more time with a little effort. The summation already goes across a row of \\(U\\) and \\(D\\) (that is, user \\(u\\) is held constant), but covers only certain elements. This is equivalent to a dot product with a mask representing \\(R\_j\\). \\(M\_u\\), row \\(u\\) of the mask, already represents \\(S(u)\\), and \\(R\_j\\) is just \\(S(u)\\) with some more elements removed - which we can mostly represent with \\(M\_u \odot (C\_j > 0)\\) where \\(\odot\\) is elementwise product (i.e. Hadamard), \\(C\_j\\) is column/row \\(j\\) of \\(C\\) (it's symmetric), and where we abuse some notation to say that \\(C\_j > 0\\) is a binary vector.  Likewise, \\(D\_j\\) is row \\(j\\) of \\(D\\). The one correction still required is that we subtract \\(u\_j\\) to cover for the \\(i \ne j\\) part of \\(R\_j\\).  To abuse some more notation:
 {{< rawhtml >}}
 <div>
 $$P(u)_j = \frac{[M_u \odot (C_j > 0)] \cdot (D_j + U_u) - U_{u,j}}{M_u \cdot (C_j > 0)}$$
 </div>
 {{< /rawhtml >}}
 #### 5.2.2.4. Approximation
 The paper also gives a formula that is a suitable approximation for larger data sets:
 {{< rawhtml >}}
 <div>
 $$p^{S1}(u)_j = \bar{u} + \frac{1}{card(R_j)}\sum_{i\in R_j} \textrm{dev}_{j,i}$$
 </div>
 {{< /rawhtml >}}
-where $\bar{u}$ is user $u$'s average rating. This doesn't change the formula much; we can compute $\bar{u}$ simply as column means of $U$.
+
 where \\(\bar{u}\\) is user \\(u\\)'s average rating. This doesn't change the formula much; we can compute \\(\bar{u}\\) simply as column means of \\(U\\).
 ## 5.3. Implementation
 I left out another detail above, which is that the above can't really be implemented exactly as written on this dataset (though, it works fine for the much smaller [ml-100k](https://grouplens.org/datasets/movielens/100k/)) because it uses entirely too much memory.
-While $U$ and $M$ can be sparse matrices, $C$ and $D$ sort of must be dense matrices, and for this particular dataset they are a bit too large to work with in memory in this form. Some judicious optimization, attention to datatypes, use of $C$ and $D$ being symmetric and skew-symmetric respectively, and care to avoid extra copies could probably work around this - but I don't do that here.
+While \\(U\\) and \\(M\\) can be sparse matrices, \\(C\\) and \\(D\\) sort of must be dense matrices, and for this particular dataset they are a bit too large to work with in memory in this form. Some judicious optimization, attention to datatypes, use of \\(C\\) and \\(D\\) being symmetric and skew-symmetric respectively, and care to avoid extra copies could probably work around this - but I don't do that here.
-However, if we look at the $P(u)_j$ formula above, it refers only to row $j$ of $C$ and $D$ and the formulas for $C$ and $D$ make it easy to compute them by row if needed, or by blocks of rows according to what $u$ and $j$ we need. This is what I do below.
+However, if we look at the \\(P(u)_j\\) formula above, it refers only to row \\(j\\) of \\(C\\) and \\(D\\) and the formulas for \\(C\\) and \\(D\\) make it easy to compute them by row if needed, or by blocks of rows according to what \\(u\\) and \\(j\\) we need. This is what I do below.
 {{<highlight python>}}
@@ -1020,12 +1035,13 @@ To show that it actually gives the same result as above, and that the approximat
 {{< rawhtml >}}
 <pre class="result">
    (4.0875210502743862, 4.0875210502743862)
 </pre>
-
+{{< /rawhtml >}}
 This computes training error on a small part (1%) of the data, since doing it over the entire thing would be horrendously slow:
@@ -1102,13 +1118,14 @@ print("Training error: MAE={:.3f},  RMSE={:.3f}".format(err_mae_train, err_rms_t
 print("Testing error:  MAE={:.3f},  RMSE={:.3f}".format(err_mae_test, err_rms_test))
 {{< / highlight >}}
 {{< rawhtml >}}
 <pre class="result">
    Training error: MAE=0.640,  RMSE=0.834
    Testing error:  MAE=0.657,  RMSE=0.856
 </pre>
-
+{{< /rawhtml >}}
 # 6. "SVD" algorithm
@@ -1126,33 +1143,39 @@ References on this model are in a few different places:
 ## 6.2. Motivation
-We again start from the $n \times m$ utility matrix $U$.  As $m$ and $n$ tend to be quite large, $U$ has a lot of degrees of freedom.  If we want to be able to predict anything at all, we must assume some fairly strict constraints - and one form of this is assuming that we don't *really* have that many degrees of freedom, and that there are actually some much smaller latent factors controlling everything.
+We again start from the \\(n \times m\\) utility matrix \\(U\\).  As \\(m\\) and \\(n\\) tend to be quite large, \\(U\\) has a lot of degrees of freedom.  If we want to be able to predict anything at all, we must assume some fairly strict constraints - and one form of this is assuming that we don't *really* have that many degrees of freedom, and that there are actually some much smaller latent factors controlling everything.
-One common form of this is assuming that the rank of matrix $U$ - its *actual* dimensionality - is much lower. Let's say its rank is $r$.  We could then represent $U$ as the matrix product of smaller matrices, i.e. $U=P^\top Q$ where $P$ is a $r \times n$ matrix and $Q$ is $r \times m$.
+One common form of this is assuming that the rank of matrix \\(U\\) - its *actual* dimensionality - is much lower. Let's say its rank is \\(r\\).  We could then represent \\(U\\) as the matrix product of smaller matrices, i.e. \\(U=P^\top Q\\) where \\(P\\) is a \\(r \times n\\) matrix and \\(Q\\) is \\(r \times m\\).
-If we can find dense matrices $P$ and $Q$ such that $P^\top Q$ equals, or approximately equals, $U$ for the corresponding elements of $U$ that are known, then $P^\top Q$ also gives us predictions for the unknown elements of $U$ - the ratings we don't know, but want to predict.  Of course, $r$ must be small enough here to prevent overfitting.
+If we can find dense matrices \\(P\\) and \\(Q\\) such that \\(P^\top Q\\) equals, or approximately equals, \\(U\\) for the corresponding elements of \\(U\\) that are known, then \\(P^\top Q\\) also gives us predictions for the unknown elements of \\(U\\) - the ratings we don't know, but want to predict.  Of course, \\(r\\) must be small enough here to prevent overfitting.
 (What we're talking about above is [matrix completion](https://en.wikipedia.org/wiki/Matrix_completion) using low-rank [matrix decomposition/factorization](https://en.wikipedia.org/wiki/Matrix_decomposition). These are both subjects unto themselves. See the [matrix-completion-whirlwind](https://github.com/asberk/matrix-completion-whirlwind/blob/master/matrix_completion_master.ipynb) notebook for a much better explanation on that subject, and an implementation of [altMinSense/altMinComplete](https://arxiv.org/pdf/1212.0467).)
-Ordinarily, we'd use something like SVD directly if we wanted to find matrices $P$ and $Q$ (or if we wanted to do any of about 15,000 other things, since SVD is basically magical matrix fairy dust). We can't really do that here due to the fact that large parts of $U$ are unknown, and in some cases because $U$ is just too large.  One approach for working around this is the UV-decomposition algorithm that section 9.4 of [MMDS](http://www.mmds.org/) describes.
+Ordinarily, we'd use something like SVD directly if we wanted to find matrices \\(P\\) and \\(Q\\) (or if we wanted to do any of about 15,000 other things, since SVD is basically magical matrix fairy dust). We can't really do that here due to the fact that large parts of \\(U\\) are unknown, and in some cases because \\(U\\) is just too large.  One approach for working around this is the UV-decomposition algorithm that section 9.4 of [MMDS](http://www.mmds.org/) describes.
 What we'll do below is a similar approach to UV decomposition that follows a common method: define a model, define an error function we want to minimize, find that error function's gradient with respect to the model's parameters, and then use gradient-descent to minimize that error function by nudging the parameters in the direction that decreases the error, i.e. the negative of their gradient.  (More on this later.)
-Matrices $Q$ and $P$ have some other neat properties too. Note that $Q$ has $m$ columns, each one $r$-dimensional - one column per movie.  $P$ has $n$ columns, each one $r$-dimensional - one column per user. In effect, we can look at each column $i$ of $Q$ as the coordinates of movie $i$ in "concept space" or "feature space" - a new $r$-dimensional space where each axis corresponds to something that seems to explain ratings. Likewise, we can look at each column $u$ of $P$ as how much user $u$ "belongs" to each axis in concept space.  "Feature vectors" is a common term to see.
+Matrices \\(Q\\) and \\(P\\) have some other neat properties too. Note that \\(Q\\) has \\(m\\) columns, each one \\(r\\)-dimensional - one column per movie.  \\(P\\) has \\(n\\) columns, each one \\(r\\)-dimensional - one column per user. In effect, we can look at each column \\(i\\) of \\(Q\\) as the coordinates of movie \\(i\\) in "concept space" or "feature space" - a new \\(r\\)-dimensional space where each axis corresponds to something that seems to explain ratings. Likewise, we can look at each column \\(u\\) of \\(P\\) as how much user \\(u\\) "belongs" to each axis in concept space.  "Feature vectors" is a common term to see.
-In that sense, $P$ and $Q$ give us a model in which ratings are an interaction between properties of a movie, and a user's preferences.  If we're using $U=P^\top Q$ as our model, then every element of $U$ is just the dot product of the feature vectors of the respective movie and user.  That is, if $p_u$ is column $u$ of $P$ and $q_i$ is column $i$ of $Q$:
+In that sense, \\(P\\) and \\(Q\\) give us a model in which ratings are an interaction between properties of a movie, and a user's preferences.  If we're using \\(U=P^\top Q\\) as our model, then every element of \\(U\\) is just the dot product of the feature vectors of the respective movie and user.  That is, if \\(p_u\\) is column \\(u\\) of \\(P\\) and \\(q_i\\) is column \\(i\\) of \\(Q\\):
 {{< rawhtml >}}
 <div>
 $$\hat{r}_{ui}=q_i^\top p_u$$
 </div>
 {{< /rawhtml >}}
 However, some things aren't really interactions. Some movies are just (per the ratings) overall better or worse. Some users just tend to rate everything higher or lower. We need some sort of bias built into the model to comprehend this.
-Let's call $b_i$ the bias for movie $i$, $b_u$ the bias for user $u$, and $\mu$ the overall average rating. We can just add these into the model:
+Let's call \\(b_i\\) the bias for movie \\(i\\), \\(b_u\\) the bias for user \\(u\\), and \\(\mu\\) the overall average rating. We can just add these into the model:
 {{< rawhtml >}}
 <div>
 $$\hat{r}_{ui}=\mu + b_i + b_u + q_i^\top p_u$$
 </div>
 {{< /rawhtml >}}
 This is the basic model we'll implement, and the same one described in the references at the top.
@@ -1160,28 +1183,32 @@ This is the basic model we'll implement, and the same one described in the refer
 More formally, the prediction model is:
 {{< rawhtml >}}
 <div>
 $$\hat{r}_{ui}=\mu + b_i + b_u + q_i^\top p_u$$
 </div>
 {{< /rawhtml >}}
 where:
- $u$ is a user
+- \\(u\\) is a user
- $i$ is an item
+- \\(i\\) is an item
- $\hat{r}_{ui}$ is user $u$'s predicted rating for item $i$
+- \\(\hat{r}_{ui}\\) is user \\(u\\)'s predicted rating for item \\(i\\)
- $\mu$ is the overall average rating
+- \\(\mu\\) is the overall average rating
 - our model parameters are:
-  - $b_i$, a per-item deviation for item $i$; 
+  - \\(b_i\\), a per-item deviation for item \\(i\\); 
-  - $b_u$, per-user deviation for user $u$
+  - \\(b_u\\), per-user deviation for user \\(u\\)
-  - $q_i$ and $p_u$, feature vectors for item $i$ and user $u$, respectively
+  - \\(q_i\\) and \\(p_u\\), feature vectors for item \\(i\\) and user \\(u\\), respectively
-The error function that we need to minimize is just sum-of-squared error between predicted and actual rating, plus $L\_2$ regularization to prevent the biases and coordinates in "concept space" from becoming too huge:
+The error function that we need to minimize is just sum-of-squared error between predicted and actual rating, plus \\(L\_2\\) regularization to prevent the biases and coordinates in "concept space" from becoming too huge:
 $$E=\sum\_{r\_{ui} \in R\_{\textrm{train}}} \left(r\_{ui} - \hat{r}\_{ui}\right)^2 + \lambda\left(b\_i^2+b\_u^2 + \lvert\lvert q\_i\rvert\rvert^2 + \lvert\lvert p\_u\rvert\rvert^2\right)$$
 ## 6.4. Gradients & Gradient-Descent Updates
-This error function is easily differentiable with respect to model parameters $b_i$, $b_u$, $q_i$, and $p_u$, so a normal approach for minimizing it is gradient-descent.  Finding gradient with respect to $b_i$ is straightforward:
+This error function is easily differentiable with respect to model parameters \\(b_i\\), \\(b_u\\), \\(q_i\\), and \\(p_u\\), so a normal approach for minimizing it is gradient-descent.  Finding gradient with respect to \\(b_i\\) is straightforward:
 {{< rawhtml >}}
 <div>
 $$
 \begin{split}
@@ -1191,9 +1218,12 @@ $$
 \end{split}
 $$
 </div>
 {{< /rawhtml >}}
 Gradient with respect to $p_u$ proceeds similarly:
 Gradient with respect to \\(p_u\\) proceeds similarly:
 {{< rawhtml >}}
 <div>
 $$
 \begin{split}
@@ -1205,9 +1235,12 @@ p_u}q_i^\top p_u \right) + 2 \lambda p_u \\
 \end{split}
 $$
 </div>
 {{< /rawhtml >}}
 Gradient with respect to $b\_u$ is identical form to $b\_i$, and gradient with respect to $q\_i$ is identical form to $p\_u$, except that the variables switch places.  The full gradients then have the standard form for gradient descent, i.e. a summation of a gradient term for each individual data point, so they turn easily into update rules for each parameter (which match the ones in the Surprise link) after absorbing the leading 2 into learning rate $\gamma$ and separating out the summation over each data point. That's given below, with $e\_{ui}=r\_{ui} - \hat{r}\_{ui}$:
 Gradient with respect to \\(b\_u\\) is identical form to \\(b\_i\\), and gradient with respect to \\(q\_i\\) is identical form to \\(p\_u\\), except that the variables switch places.  The full gradients then have the standard form for gradient descent, i.e. a summation of a gradient term for each individual data point, so they turn easily into update rules for each parameter (which match the ones in the Surprise link) after absorbing the leading 2 into learning rate \\(\gamma\\) and separating out the summation over each data point. That's given below, with \\(e\_{ui}=r\_{ui} - \hat{r}\_{ui}\\):
 {{< rawhtml >}}
 <div>
 $$
 \begin{split}
@@ -1218,6 +1251,8 @@ $$
 \end{split}
 $$
 </div>
 {{< /rawhtml >}}
 The code below is a direct implementation of this by simply iteratively applying the above equations for each data point - in other words, stochastic gradient descent.
@@ -1352,6 +1387,7 @@ svd40 = SVDModel(max_movie, max_user, ml["rating"].mean(), num_factors=num_facto
 svd40.train(movies_train, users_train, ratings_train, epoch_callback=at_epoch)
 {{< / highlight >}}
 {{< rawhtml >}}
 <pre class="result">
    6982/s 8928/s 10378/s 12877/s 15290/s 11574/s 13230/s 
@@ -1396,7 +1432,7 @@ svd40.train(movies_train, users_train, ratings_train, epoch_callback=at_epoch)
    Epoch 20/20; Training: MAE=0.549 RMSE=0.717, Testing: MAE=0.600 RMSE=0.787
 </pre>
-
+{{< /rawhtml >}}
 {{<highlight python>}}
@@ -1416,6 +1452,7 @@ svd4 = SVDModel(max_movie, max_user, ml["rating"].mean(), 4)
 svd4.train(ml_train["movie_id"].values, ml_train["user_id"].values, ml_train["rating"].values, epoch_callback=at_epoch)
 {{< / highlight >}}
 {{< rawhtml >}}
 <pre class="result">
    48199/s 33520/s 16937/s 13842/s 13607/s 15574/s 15431/s 
@@ -1460,7 +1497,7 @@ svd4.train(ml_train["movie_id"].values, ml_train["user_id"].values, ml_train["ra
    Epoch 20/20; Training: MAE=0.599 RMSE=0.783, Testing: MAE=0.618 RMSE=0.809
 </pre>
-
+{{< /rawhtml >}}
 To limit the data, we can use just the top movies (by number of ratings):
@@ -1567,7 +1604,6 @@ latent_factor_grid(svd4.q[:2,:])
 <pre class="result">
 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15
 |--|--|--|--|--|--|--|--|--|--|---|---|---|---|---|---
 0||||||||||||||||
@@ -1586,7 +1622,6 @@ latent_factor_grid(svd4.q[:2,:])
 13||||||||Sound of Music; Spy Kids 2: The Island of Lost...|Bring It On; Legally Blonde|Fly Away Home; Parent Trap|Sense and Sensibility; Sex and the City|||||
 14|||||||Babe; Babe: Pig in the City||||Twilight|||||
 15||||||||||||||||
 </pre>
@@ -1604,7 +1639,6 @@ latent_factor_grid(svd4.q[2:,:])
 <pre class="result">
 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15
 |--|--|--|--|--|--|--|--|--|--|---|---|---|---|---|---
 0||||||||||||||||
@@ -1623,7 +1657,6 @@ latent_factor_grid(svd4.q[2:,:])
 13||||||Nightmare on Elm Street 4: The Dream Master; F...|Wes Craven's New Nightmare (Nightmare on Elm S...|Friday the 13th; Exorcist III|Candyman; Texas Chainsaw Massacre 2|Mars Attacks!; Halloween|Evil Dead II (Dead by Dawn); Re-Animator|Night of the Living Dead; Dead Alive (Braindead)||Eraserhead||
 14|||||||Nightmare on Elm Street 3: Dream Warriors; Fre...|Hellbound: Hellraiser II|Nightmare on Elm Street|||||||
 15|||||||Bride of Chucky (Child's Play 4)||||Texas Chainsaw Massacre|||||
 </pre>
@@ -1643,9 +1676,6 @@ bias.iloc[:10]
 <pre class="result">
 | movie_title | num_ratings | avg_rating | bias | movie_id |  |  |  | 
 |------------|------------|-----------|-----|---------|-|-|-|-
 318|Shawshank Redemption, The (1994)|63366.0|4.446990|1.015911
@@ -1658,7 +1688,6 @@ bias.iloc[:10]
 50|Usual Suspects, The (1995)|47006.0|4.334372|0.910651
 102217|Bill Hicks: Revelations (1993)|50.0|3.990000|0.900622
 527|Schindler's List (1993)|50054.0|4.310175|0.898633
 </pre>
@@ -1672,7 +1701,6 @@ bias.iloc[:-10:-1]
 <pre class="result">
 | movie_title | num_ratings | avg_rating | bias | movie_id |  |  |  | 
 |------------|------------|-----------|-----|---------|-|-|-|-
 8859|SuperBabies: Baby Geniuses 2 (2004)|209.0|0.837321|-2.377202
@@ -1684,7 +1712,6 @@ bias.iloc[:-10:-1]
 4775|Glitter (2001)|685.0|1.124088|-2.047287
 31698|Son of the Mask (2005)|467.0|1.252677|-2.022763
 5739|Faces of Death 6 (1996)|174.0|1.261494|-2.004086
 </pre>
@@ -1732,7 +1759,6 @@ pd.DataFrame.from_records(
 <pre class="result">
 | Library | Algorithm | MAE (test) | RMSE (test)
 |--------|----------|-----------|------------
 0||Slope One|0.656514|0.856294
@@ -1740,7 +1766,6 @@ pd.DataFrame.from_records(
 2|Surprise|Random|1.144775|1.433753
 3|Surprise|Slope One|0.704730|0.923331
 4|Surprise|SVD|0.694890|0.900350
 </pre>
--- a/layouts/partials/math.html
+++ b/layouts/partials/math.html
@@ -0,0 +1,35 @@
 <!-- Copied from hugo-notepadium in order to:
     - Remove dollar signs from inlineMath because it breaks too much
       (I can't have dollar signs twice in one paragraph, even escaped
       like \$).
  -->
 {{- if or (eq site.Params.math.enable true) (eq .Params.math true) -}}
    {{- $use := "katex" -}}
    {{- with site.Params.math -}}
        {{- if and (isset . "use") (eq (.use | lower) "mathjax") -}}
            {{- $use = "mathjax" -}}
        {{- end -}}
    {{- end -}}
    {{- if eq $use "mathjax" -}}
        {{- $url := "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS-MML_HTMLorMML" -}}
        {{- $hash := "sha384-e/4/LvThKH1gwzXhdbY2AsjR3rm7LHWyhIG5C0jiRfn8AN2eTN5ILeztWw0H9jmN" -}}
        <script defer type="text/javascript" src="{{- $url -}}" integrity="{{- $hash -}}" crossorigin="anonymous"></script>
        <script
            type="text/x-mathjax-config">MathJax.Hub.Config({ tex2jax: { inlineMath: [/*['$','$'], */['\\(','\\)']] } });</script>
    {{- else -}}
        {{- $url := "https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.css" -}}
        {{- $hash := "sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" -}}
        <link rel="stylesheet" href="{{- $url -}}" integrity="{{- $hash -}}" crossorigin="anonymous">
        {{- $url := "https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.js" -}}
        {{- $hash := "sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" -}}
        <script defer src="{{- $url -}}" integrity="{{- $hash -}}" crossorigin="anonymous"></script>
        {{- $url := "https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/contrib/auto-render.min.js" -}}
        {{- $hash := "sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" -}}
        <script defer src="{{- $url -}}" integrity="{{- $hash -}}" crossorigin="anonymous"
            onload="renderMathInElement(document.body);"></script>
    {{- end -}}
 {{- end -}}