diff --git a/hugo_blag/content/posts/2018-04-08-recommender-systems-1.md b/hugo_blag/content/posts/2018-04-08-recommender-systems-1.md
index b996cf0..db1494c 100644
--- a/hugo_blag/content/posts/2018-04-08-recommender-systems-1.md
+++ b/hugo_blag/content/posts/2018-04-08-recommender-systems-1.md
@@ -936,7 +936,7 @@ matrix multiplication:
$$C=M^\top M$$
-since $C_{i,j}=card(S_{j,i}(\chi))$ is the dot product of row $i$ of $M^T$ - which is column
+since $C\_{i,j}=card(S\_{j,i}(\chi))$ is the dot product of row $i$ of $M^T$ - which is column
$i$ of $M$ - and column $j$ of $M$.
That was the first half of what we needed for $\textrm{dev}_{j,i}$.
@@ -946,19 +946,19 @@ We still need the other half:
$$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$
-We can apply a similar trick here. Consider first what $\sum_{u \in
-S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of
+We can apply a similar trick here. Consider first what $\sum\_{u \in
+S\_{j,i}(\chi)} u\_j$ means: It is the sum of only those ratings of
movie $j$ that were done by a user who also rated movie $i$.
-Likewise, $\sum_{u \in S_{j,i}(\chi)} u_i$ is the sum of only those
+Likewise, $\sum\_{u \in S\_{j,i}(\chi)} u\_i$ is the sum of only those
ratings of movie $i$ that were done by a user who also rated movie
$j$. (Note the symmetry: it's over the same set of users, because
it's always the users who rated both $i$ and $j$.)
-Let's call the utility matrix $U$, and use $U_i$ and $U_j$ to refer
-to columns $i$ and $j$ of it (just as in $M$). $U_i$ has each rating
+Let's call the utility matrix $U$, and use $U\_i$ and $U\_j$ to refer
+to columns $i$ and $j$ of it (just as in $M$). $U\_i$ has each rating
of movie $i$, but we want only the sum of the ratings done by a user
-who also rated movie $j$. Like before, the dot product of $U_i$ and
-$M_j$ (consider the definition of $M_j$) computes this, and so:
+who also rated movie $j$. Like before, the dot product of $U\_i$ and
+$M\_j$ (consider the definition of $M\_j$) computes this, and so:
$$\sum_{u \in S_{j,i}(\chi)} u_j = M_i \cdot U_j$$
@@ -966,7 +966,7 @@ $$\sum_{u \in S_{j,i}(\chi)} u_j = M_i \cdot U_j$$
and as with $C$, since we want every pairwise dot product, this summation just
equals element $(i,j)$ of $M^\top U$. The other half of the summation,
-$\sum_{u \in S_{j,i}(\chi)} u_i$, equals $M_j \cdot U_i$, which is just
+$\sum\_{u \in S_{j,i}(\chi)} u\_i$, equals $M\_j \cdot U\_i$, which is just
the transpose of this matrix:
@@ -979,7 +979,7 @@ So, finally, we can compute an entire deviation matrix at once like:
$$D = \left(M^\top U - (M^\top U)^\top\right) /\ M^\top M$$
-where $/$ is Hadamard (i.e. elementwise) division, and $D_{j,i} = \textrm{dev}_{j,i}$.
+where $/$ is Hadamard (i.e. elementwise) division, and $D\_{j,i} = \textrm{dev}\_{j,i}$.
By convention and to avoid division by zero, we treat the case where the denominator and numerator are both 0 as just equaling 0. This comes up only where no ratings exist for there to be a deviation - hence the `np.maximum(1, counts)` below.
@@ -993,9 +993,9 @@ P(u)_j = \frac{1}{card(R_j)}\sum_{i\in R_j} \left(\textrm{dev}_{j,i}+u_i\right)
$$
-where $R_j = \{i | i \in S(u), i \ne j, card(S_{j,i}(\chi)) > 0\}$, and $S(u)$ is the set of movies that user $u$ has rated. To unpack the paper's somewhat dense notation, the summation is over every movie $i$ that user $u$ rated and that at least one other user rated, except movie $j$.
+where $R\_j = \{i | i \in S(u), i \ne j, card(S\_{j,i}(\chi)) > 0\}$, and $S(u)$ is the set of movies that user $u$ has rated. To unpack the paper's somewhat dense notation, the summation is over every movie $i$ that user $u$ rated and that at least one other user rated, except movie $j$.
-We can apply the usual trick yet one more time with a little effort. The summation already goes across a row of $U$ and $D$ (that is, user $u$ is held constant), but covers only certain elements. This is equivalent to a dot product with a mask representing $R_j$. $M_u$, row $u$ of the mask, already represents $S(u)$, and $R_j$ is just $S(u)$ with some more elements removed - which we can mostly represent with $M_u \odot (C_j > 0)$ where $\odot$ is elementwise product (i.e. Hadamard), $C_j$ is column/row $j$ of $C$ (it's symmetric), and where we abuse some notation to say that $C_j > 0$ is a binary vector. Likewise, $D_j$ is row $j$ of $D$. The one correction still required is that we subtract $u_j$ to cover for the $i \ne j$ part of $R_j$. To abuse some more notation:
+We can apply the usual trick yet one more time with a little effort. The summation already goes across a row of $U$ and $D$ (that is, user $u$ is held constant), but covers only certain elements. This is equivalent to a dot product with a mask representing $R\_j$. $M\_u$, row $u$ of the mask, already represents $S(u)$, and $R\_j$ is just $S(u)$ with some more elements removed - which we can mostly represent with $M\_u \odot (C\_j > 0)$ where $\odot$ is elementwise product (i.e. Hadamard), $C\_j$ is column/row $j$ of $C$ (it's symmetric), and where we abuse some notation to say that $C\_j > 0$ is a binary vector. Likewise, $D\_j$ is row $j$ of $D$. The one correction still required is that we subtract $u\_j$ to cover for the $i \ne j$ part of $R\_j$. To abuse some more notation:
$$P(u)_j = \frac{[M_u \odot (C_j > 0)] \cdot (D_j + U_u) - U_{u,j}}{M_u \cdot (C_j > 0)}$$
@@ -1204,8 +1204,8 @@ where:
- $b_u$, per-user deviation for user $u$
- $q_i$ and $p_u$, feature vectors for item $i$ and user $u$, respectively
-The error function that we need to minimize is just sum-of-squared error between predicted and actual rating, plus $L_2$ regularization to prevent the biases and coordinates in "concept space" from becoming too huge:
-$$E=\sum_{r_{ui} \in R_{\textrm{train}}} \left(r_{ui} - \hat{r}_{ui}\right)^2 + \lambda\left(b_i^2+b_u^2 + \lvert\lvert q_i\rvert\rvert^2 + \lvert\lvert p_u\rvert\rvert^2\right)$$
+The error function that we need to minimize is just sum-of-squared error between predicted and actual rating, plus $L\_2$ regularization to prevent the biases and coordinates in "concept space" from becoming too huge:
+$$E=\sum\_{r\_{ui} \in R\_{\textrm{train}}} \left(r\_{ui} - \hat{r}\_{ui}\right)^2 + \lambda\left(b\_i^2+b\_u^2 + \lvert\lvert q\_i\rvert\rvert^2 + \lvert\lvert p\_u\rvert\rvert^2\right)$$
## 6.4. Gradients & Gradient-Descent Updates
@@ -1235,7 +1235,7 @@ p_u}q_i^\top p_u \right) + 2 \lambda p_u \\
$$
-Gradient with respect to $b_u$ is identical form to $b_i$, and gradient with respect to $q_i$ is identical form to $p_u$, except that the variables switch places. The full gradients then have the standard form for gradient descent, i.e. a summation of a gradient term for each individual data point, so they turn easily into update rules for each parameter (which match the ones in the Surprise link) after absorbing the leading 2 into learning rate $\gamma$ and separating out the summation over each data point. That's given below, with $e_{ui}=r_{ui} - \hat{r}_{ui}$:
+Gradient with respect to $b\_u$ is identical form to $b\_i$, and gradient with respect to $q\_i$ is identical form to $p\_u$, except that the variables switch places. The full gradients then have the standard form for gradient descent, i.e. a summation of a gradient term for each individual data point, so they turn easily into update rules for each parameter (which match the ones in the Surprise link) after absorbing the leading 2 into learning rate $\gamma$ and separating out the summation over each data point. That's given below, with $e\_{ui}=r\_{ui} - \hat{r}\_{ui}$:
$$