More updates with drafts (Slope One, modularity)

2018-02-06 17:52:16 -05:00
parent c7695799e6
commit 0437bb31cd
6 changed files with 240 additions and 68 deletions
--- a/drafts/2017-01-08-retrospect-foresight.org
+++ b/drafts/2017-01-08-retrospect-foresight.org
@@ -11,25 +11,35 @@ Why are old technological ideas that were "ahead of their time", but
 which lost out to other ideas, worth studying?

 We can see them as raw ideas that "modern" understanding never
-refined - as misguided fantasies or just mistakes, even.  The flip
-side of this is that we can see them as ideas that are free of the
-modern preconceptions that are now nearly inescapable.
+refined - misguided fantasies or even just mistakes.  The flip side of
+this is that we can see them as ideas that are free of a nearly
+inescapable modern context and all of the preconceptions and blinders
+it carries.

 In some of these visionaries is a valuable combination:

- a detachment from this context (by mere virtue of it not existing
-  yet),
- the ability to imagine and analyze far beyond the preconceptions
-  that in turn surrounded them,
- the resources and freedom to actually apply this,
- the foresight and sometimes blind luck to have communicated their
-  thoughts, feelings, and analysis in a durable way.
+- they're detached from this modern context (by mere virtue of it not
+  existing yet),
+- they have considerable experience, imagination, and foresight,
+- they devoted time and effort to work extensively on something and to
+  communicate their thoughts, feelings, and analysis in a durable way.

-To put it in another way: They gave us analysis in a context that no
-longer even exists. They help us think beyond our current context.
+To put it in another way: They give us analysis done from a context
+that is long gone. They help us think beyond our current context.
 They help us answer a question, "What if we took a different path
 then?"

+[[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][Epigram #53]] from Alan Perlis offers some relevant skepticism here: "So
+many good ideas are never heard from again once they embark in a
+voyage on the semantic gulf."  My interpretation of it is that we tend
+to idolize ideas, old and new, because they sound somehow different,
+innovative, and groundbreaking, but attempts at analysis or practical
+realization of the ideas leads to a bleaker reality, perhaps that the
+idea is completely meaningless (the equivalent of a [[https://en.wiktionary.org/wiki/deepity][deepity]], perhaps),
+wildly impractical, or a mere facade over what is already established.
+
+* Examples
+
 * Scratch

 - Douglas Engelbart is perhaps one of the canonical examples of a person
@@ -37,7 +47,4 @@ then?"
  another.  Alan Turing is an early example widely regarded for his
  foresight.
 - [[https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/][As We May Think (Vannevar Bush)]]
- However, to quote [[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][epigram #53]] from Alan Perlis, "So many good ideas
-  are never heard from again once they embark in a voyage on the
-  semantic gulf."
 - "Do you remember a time when..." only goes so far.
--- a/drafts/2017-04-20-modularity.org
+++ b/drafts/2017-04-20-modularity.org
@@ -39,8 +39,8 @@ bits... It is not only necessary to make sure your own system is
 designed to be made of modular parts. It is also necessary to realize
 that your own system, no matter how big and wonderful it seems now,
 should always be designed to be a part of another larger system."  Les
-Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of
-future software]] even did an interesting derivation tying the defect
+Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of future software]]
+even did an interesting derivation tying the defect
 density in software to how it is broken into pieces.

 "Abstraction" doesn't have quite the same consensus. In software, it's
@@ -255,3 +255,7 @@ underneath, and this makes me wonder why it needs explicit support for
 - https://www.reddit.com/r/programming/comments/4bjss2/an_11_line_npm_package_called_leftpad_with_only/
 - http://www.freecode.com/articles/editorial-the-two-edged-sword
 - https://en.wikipedia.org/wiki/Essential_complexity
+
+- GObject framework: an object system that sits outside of any
+  particular language (though this is nothing particularly new)
+- libgreen
--- a/drafts/2017-12-12-dataflow.org
+++ b/drafts/2017-12-12-dataflow.org
@@ -3,6 +3,8 @@
 #+DATE: December 12, 2017
 #+TAGS: technobabble

+I don't know if there's actually anything to write here.
+
 There is a sort of parallel between the declarative nature of
 computational graphs in TensorFlow, and functional programming
 (possibly function-level - think of the J language and how important
@@ -29,3 +31,12 @@ abstractions (RDD, DataFrame, Dataset)."

 Spark does this with a database. TensorFlow does it with numerical
 calculations.  Node-RED does it with irregular, asynchronous data.
+
+- [[https://mxnet.incubator.apache.org/how_to/visualize_graph.html][mxnet: How to visualize Neural Networks as computation graph]]
+- [[https://medium.com/intuitionmachine/pytorch-dynamic-computational-graphs-and-modular-deep-learning-7e7f89f18d1][PyTorch, Dynamic Computational Graphs and Modular Deep Learning]]
+- [[https://github.com/WarBean/hyperboard][HyperBoard: A web-based dashboard for Deep Learning]]
+- [[https://www.postgresql.org/docs/current/static/sql-explain.html][EXPLAIN in PostgreSQL]]
+  - http://tatiyants.com/postgres-query-plan-visualization/
+- https://en.wikipedia.org/wiki/Dataflow_programming
+- Pure Data!
+- [[https://en.wikipedia.org/wiki/Orange_(software)][Orange]]?
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -21,8 +21,8 @@ references, and one particular [[https://github.com/fizyr/keras-retinanet][imple
 "Object detection" as it is used here refers to machine learning
 models that can not just identify a single object in an image, but can
 identify and *localize* multiple objects, like in the below photo
-taken from [[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow
-Object Detection API]]:
+taken from
+[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:

 # TODO:
 # Define mAP
@@ -143,10 +143,9 @@ explores).  The paper is fairly concise in describing FPNs; it only
 takes it around 3 pages to explain their purpose, related work, and
 their entire design.  The remainder shows experimental results and
 specific applications of FPNs.  While it shows FPNs implemented on a
-particular underlying network (ResNet), they were made purposely to be
-very simple and adaptable to nearly any kind of CNN.
-
-# TODO: Link to ResNet?
+particular underlying network (ResNet, mentioned below), they were
+made purposely to be very simple and adaptable to nearly any kind of
+CNN.

 To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]].  The below
 diagram illustrates an image pyramid:
@@ -225,6 +224,16 @@ connections.

 # Note C=256 and such

+# TODO: Link to some good explanations
+
+For two reasons, I don't explain much about ResNet here.  The first is
+that residual networks, like the ResNet used here, have seen lots of
+attention and already have many good explanations online.  The second
+is that the paper claims that the underlying network 
+
+[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
+[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
+
 * Anchors & Region Proposals

 Recall last section what was said about feature maps, and the that the
@@ -339,3 +348,21 @@ is implemented with bog-standard convolutional networks...
 * Inference

 # Top N results
+
+* References
+
+# Does org-mode have a way to make a special section for references?
+# I know I saw this somewhere
+
+1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
+2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
+3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
+4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
+5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
+6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
+7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
+8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
+9. https://github.com/KaimingHe/deep-residual-networks
+10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
+11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
+12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/
--- a/drafts/2018-01-30-slope-one.org
+++ b/drafts/2018-01-30-slope-one.org
@@ -1,7 +1,13 @@
-#+TITLE: Collaborative Filtering with Slope One Predictors
-#+AUTHOR: Chris Hodapp
-#+DATE: January 30, 2018
-#+TAGS: technobabble, machine learning
+---
+title: Collaborative Filtering with Slope One Predictors
+author: Chris Hodapp
+date: January 30, 2018
+tags: technobabble, machine learning
+---
+
+# Needs a brief intro
+
+# Needs a summary at the end

 Suppose you have a large number of users, and a large number of
 movies.  Users have watched movies, and they've provided ratings for
@@ -10,61 +16,178 @@ However, they've all watched different movies, and for any given user,
 it's only a tiny fraction of the total movies.

 Now, you want to predict how some user will rate some movie they
-haven't rated, based on what they (and other users) have.
+haven't rated, based on what they (and other users) have rated.

 That's a common problem, especially when generalized from 'movies' to
-anything else, and one with many approaches.
+anything else, and one with many approaches.  (To put some technical
+terms to it, this is the [[https://en.wikipedia.org/wiki/Collaborative_filtering][collaborative filtering]] approach to
+[[https://en.wikipedia.org/wiki/Recommender_system][recommender systems]].  [[http://www.mmds.org/][Mining of Massive Datasets]] is an excellent free
+text in which to read more in depth on this, particularly chapter 9.)

-Slope One Predictors are one such method, described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope
-One Predictors for Online Rating-Based Collaborative Filtering]].
-Despite the complex-sounding name, they are wonderfully simple to
-understand and implement, and very fast.
+Slope One Predictors are one such approach to collaborative filtering,
+described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope One Predictors for Online Rating-Based
+Collaborative Filtering]].  Despite the complex-sounding name, they are
+wonderfully simple to understand and implement, and very fast.

-Consider a user Bob.  Bob has rather simplistic tastes: he mostly just
-watches Clint Eastwood movies.  In fact, he's watched and rated nearly
-all of them, and basically nothing else.
+I'll give a contrived example below to explain them.
+
+Consider a user Bob.  Bob is enthusiastic, but has rather simple
+tastes: he mostly just watches Clint Eastwood movies.  In fact, he's
+watched and rated nearly all of them, and basically nothing else.

 Now, suppose we want to predict how much Bob will like something
 completely different and unheard of (to him at least), like... I don't
 know... /Citizen Kane/.

-First, find the users who rated both /Citizen Kane/ *and* any of the Clint
-Eastwood movies that Bob rated.
+Here's Slope One in a nutshell:

-Now, for each movie that comes up above, compute a *deviation* which
-tells us: On average, how differently (i.e. how much higher or lower)
-did users rate Citizen Kane compared to this movie?  (For instance,
-we'll have a number for how /Citizen Kane/ was rated compared to
-/Dirty Harry/, and perhaps it's +0.6 - meaning that on average, users
-who rated both movies rated /Citizen Kane/ about 0.6 stars above
-/Dirty Harry/.  We'd have another deviation for /Citizen Kane/
-compared to /Gran Torino/, another for /Citizen Kane/ compared to /The
-Good, the Bad and the Ugly/, and so on - for every movie that Bob
-rated, provided that other users who rated /Citizen Kane/ also rated
-the movie.)
+1. First, find the users who rated both /Citizen Kane/ *and* any of
+   the Clint Eastwood movies that Bob rated.
+2. Now, for each movie that comes up above, compute a *deviation*
+   which tells us: On average, how differently (i.e. how much higher
+   or lower) did users rate Citizen Kane compared to this movie?  (For
+   instance, we'll have a number for how /Citizen Kane/ was rated
+   compared to /Dirty Harry/, and perhaps it's +0.6 - meaning that on
+   average, users who rated both movies rated /Citizen Kane/ about 0.6
+   stars above /Dirty Harry/.  We'd have another deviation for
+   /Citizen Kane/ compared to /Gran Torino/, another for /Citizen
+   Kane/ compared to /The Good, the Bad and the Ugly/, and so on - for
+   every movie that Bob rated, provided that other users who rated
+   /Citizen Kane/ also rated the movie.)
+3. If that deviation between /Citizen Kane/ and /Dirty Harry/ was
+   +0.6, it's reasonable that adding 0.6 from Bob's rating on /Dirty
+   Harry/ would give one prediction of how Bob might rate /Citizen
+   Kane/.  We can then generate more predictions based on the ratings
+   he gave the other movies - anything for which we could compute a
+   deviation.
+4. To turn this to a single prediction, we could just average all
+   those predictions together.

-If that deviation between /Citizen Kane/ and /Dirty Harry/ was +0.6,
-it's reasonable that adding 0.6 from Bob's rating on /Dirty Harry/
-would give one prediction of how Bob might rate /Citizen Kane/.  We
-can then generate more predictions based on the ratings he gave the
-other movies - anything for which we could compute a deviation.
-
-To turn this to a single answer, we could just average those
-predictions together.
-
-That's the Slope One algorithm in a nutshell - and also the Weighted
-Slope One algorithm.  The only difference is in how we average those
-predictions.  In Slope One, every deviation counts equally, no matter
-how many users had differences in ratings averaged together to produce
-it.  In Weighted Slope One, deviations that came from larger numbers
-of users count for more (because, presumably, they are better
-estimates).
+One variant, Weighted Slope One, is nearly identical.  The only
+difference is in how we average those predictions in step #4.  In
+Slope One, every deviation counts equally, no matter how many users
+had differences in ratings averaged together to produce it.  In
+Weighted Slope One, deviations that came from larger numbers of users
+count for more (because, presumably, they are better estimates).

 Or, in other words: If only one person rated both /Citizen Kane/ and
 the lesser-known Eastwood classic /Revenge of the Creature/, and they
-happened to thank that /Revenge of the Creature/ deserved at least 3
-more stars, then with Slope One, this deviation of +3 would carry
+happened to think that /Revenge of the Creature/ deserved at least 3
+more stars, then with Slope One, this deviation of -3 would carry
 exactly as much weight as thousands of people rating /Citizen Kane/ as
 about 0.5 stars below /The Good, the Bad and the Ugly/.  In Weighted
 Slope One, that latter deviation would count for thousands of times as
 much.  The example makes it sound a bit more drastic than it is.
+
+The Python library [[http://surpriselib.com/][Surprise]] (a [[https://www.scipy.org/scikits.html][scikit]]) has an implementation of this
+algorithm, and the Benchmarks section of that page shows its
+performance compared to some other methods.
+
+/TODO/: Show a simple Python implementation of this (Jupyter
+notebook?)
+
+* Linear Algebra Tricks
+
+Those who aren't familiar with matrix methods or algebra can probably
+skip this section. Everything I've described above, you can compute
+given just some data to work with ([[https://grouplens.org/datasets/movielens/100k/][movielens 100k]], perhaps?) and some
+basic arithmetic.  You don't need any complicated numerical methods.
+
+However, the entire Slope One method can be implemented in a very fast
+and simple way with a couple matrix operations.
+
+First, we need to have our data encoded as a *utility matrix*.  In a
+utility matrix, each row represents one user, each column represents
+one item (a movie, in our case), and each element represents a user's
+rating of an item.  If we have $n$ users and $m$ movies, then this a
+$n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for
+movie $i$ - assuming we've numbered our users and our movies.
+
+Users have typically rated only a fraction of movies, and so most of
+the elements of this matrix are unknown.  We can represent this with
+another $n \times m$ matrix (specifically a binary matrix), a 'mask'
+$M$ in which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie
+$i$, and otherwise 0.
+
+I mentioned *deviation* above and gave an informal definition of it.
+The paper gaves a formal but rather terse definition below of the
+average deviation of item $i$ with respect to item $j$:
+
+$$\textrm{dev}_{j,i} = \sum_{u \in S_{j,i}(\chi)} \frac{u_j - u_i}{card(S_{j,i}(\chi))}$$
+
+where:
+- $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively
+- $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're
+  training on, provided a rating for both movie $i$ and movie $j$
+- $card$ is the cardinality of that set, i.e. for
+  ${card(S_{j,i}(\chi))}$ it is just how many users rated both $i$ and
+  $j$.
+
+That denominator does depend on $i$ and $j$, but doesn't depend on the
+summation term, so it can be pulled out, and also, we can split up the
+summation as long as it is kept over the same terms:
+
+$$\textrm{dev}_{j,i} = \frac{1}{card(S_{j,i}(\chi))} \sum_{u \in
+S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u
+\in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i\right)$$
+
+# TODO: These need some actual matrices to illustrate
+
+Let's start with computing ${card(S_{j,i}(\chi))}$, the number of
+users who rated both movie $i$ and movie $j$.  Consider column $i$ of
+the mask $M$.  For each value in this column, it equals 1 if the
+respective user rated movie $i$, or 0 if they did not.  Clearly,
+simply summing up column $i$ would tell us how many users rated movie
+$i$, and the same applies to column $j$ for movie $j$.
+
+Now, suppose we take element-wise logical AND of columns $i$ and $j$.
+The resultant column has a 1 only where both corresponding elements
+were 1 - where a user rated both $i$ and $j$.  If we sum up this
+column, we have exactly the number we need: the number of users who
+rated both $i$ and $j$.
+
+Some might notice that "elementwise logical AND" is just "elementwise
+multiplication", thus "sum of elementwise logical AND" is just "sum of
+elementwise multiplication", which is: dot product.  That is,
+${card(S_{j,i}(\chi))}=M_j \bullet M_i$ if we use $M_i$ and $M_j$ for
+columns $i$ and $j$ of $M$.
+
+However, we'd like to compute deviation as a matrix for all $i$ and
+$j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single
+combination of $i$ and $j$ - that is, we need a dot product between
+every single pair of columns from $M$.  Incidentally, "dot product of
+every pair of columns" happens to be almost exactly matrix
+multiplication; note that for matrices $A$ and $B$, element $(x,y)$ of
+the matrix product $AB$ is just the dot product of /row/ $x$ of $A$
+and /column/ $y$ of $B$ - and that matrix product as a whole has this
+dot product between every row of $A$ and every column of $B$.
+
+We wanted the dot product of every column of $M$ with every column of
+$M$, which is easy: just transpose $M$ for one operand.  Then, we can
+compute our count matrix like this:
+
+$$C=M^\top M$$
+
+Thus $C_{i,j}$ is the dot product of column $i$ of $M$ and column $j$
+of $M$ - or, the number of users who rated both movies $i$ and $j$.
+
+That was the first half of what we needed for $\textrm{dev}_{j,i}$.
+We still need the other half:
+
+$$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$
+
+We can apply a similar trick here.  Consider first what $\sum_{u \in
+S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of
+movie $j$ that were done by a user who also rated movie $i$.
+Likewise, $\sum_{u \in S_{j,i}(\chi)} u_j$ is the sum of only those
+ratings of movie $i$ that were done by a user who also rated movie
+$j$.  (Note the symmetry: it's over the same set of users, because
+it's always the users who rated both $i$ and $j$.)
+
+# TODO: Finish that section (mostly translate from code notes)
+
+* Implementation
+
+#+BEGIN_SRC python
+print("foo")
+#+END_SRC
--- a/templates/default.html
+++ b/templates/default.html
@@ -16,7 +16,7 @@

        <!-- From http://travis.athougies.net/posts/2013-08-13-using-math-on-your-hakyll-blog.html -->
        <script type="text/javascript"
-                src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
+                src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
        </script>
        
        <link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png">