More updates with drafts (Slope One, modularity)

2018-02-06 17:52:16 -05:00
parent c7695799e6
commit 0437bb31cd
6 changed files with 240 additions and 68 deletions
--- a/drafts/2017-01-08-retrospect-foresight.org
+++ b/drafts/2017-01-08-retrospect-foresight.org
@@ -11,25 +11,35 @@ Why are old technological ideas that were "ahead of their time", but
 which lost out to other ideas, worth studying?
 We can see them as raw ideas that "modern" understanding never
-refined - as misguided fantasies or just mistakes, even.  The flip
+refined - misguided fantasies or even just mistakes.  The flip side of
-side of this is that we can see them as ideas that are free of the
+this is that we can see them as ideas that are free of a nearly
-modern preconceptions that are now nearly inescapable.
+inescapable modern context and all of the preconceptions and blinders
 it carries.
 In some of these visionaries is a valuable combination:
- a detachment from this context (by mere virtue of it not existing
+- they're detached from this modern context (by mere virtue of it not
-  yet),
+  existing yet),
- the ability to imagine and analyze far beyond the preconceptions
+- they have considerable experience, imagination, and foresight,
-  that in turn surrounded them,
+- they devoted time and effort to work extensively on something and to
- the resources and freedom to actually apply this,
+  communicate their thoughts, feelings, and analysis in a durable way.
 - the foresight and sometimes blind luck to have communicated their
  thoughts, feelings, and analysis in a durable way.
-To put it in another way: They gave us analysis in a context that no
+To put it in another way: They give us analysis done from a context
-longer even exists. They help us think beyond our current context.
+that is long gone. They help us think beyond our current context.
 They help us answer a question, "What if we took a different path
 then?"
 [[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][Epigram #53]] from Alan Perlis offers some relevant skepticism here: "So
 many good ideas are never heard from again once they embark in a
 voyage on the semantic gulf."  My interpretation of it is that we tend
 to idolize ideas, old and new, because they sound somehow different,
 innovative, and groundbreaking, but attempts at analysis or practical
 realization of the ideas leads to a bleaker reality, perhaps that the
 idea is completely meaningless (the equivalent of a [[https://en.wiktionary.org/wiki/deepity][deepity]], perhaps),
 wildly impractical, or a mere facade over what is already established.
 * Examples
 * Scratch
 - Douglas Engelbart is perhaps one of the canonical examples of a person
@@ -37,7 +47,4 @@ then?"
  another.  Alan Turing is an early example widely regarded for his
  foresight.
 - [[https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/][As We May Think (Vannevar Bush)]]
 - However, to quote [[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][epigram #53]] from Alan Perlis, "So many good ideas
  are never heard from again once they embark in a voyage on the
  semantic gulf."
 - "Do you remember a time when..." only goes so far.
--- a/drafts/2017-04-20-modularity.org
+++ b/drafts/2017-04-20-modularity.org
@@ -39,8 +39,8 @@ bits... It is not only necessary to make sure your own system is
 designed to be made of modular parts. It is also necessary to realize
 that your own system, no matter how big and wonderful it seems now,
 should always be designed to be a part of another larger system."  Les
-Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of
+Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of future software]]
-future software]] even did an interesting derivation tying the defect
+even did an interesting derivation tying the defect
 density in software to how it is broken into pieces.
 "Abstraction" doesn't have quite the same consensus. In software, it's
@@ -255,3 +255,7 @@ underneath, and this makes me wonder why it needs explicit support for
 - https://www.reddit.com/r/programming/comments/4bjss2/an_11_line_npm_package_called_leftpad_with_only/
 - http://www.freecode.com/articles/editorial-the-two-edged-sword
 - https://en.wikipedia.org/wiki/Essential_complexity
 - GObject framework: an object system that sits outside of any
  particular language (though this is nothing particularly new)
 - libgreen
--- a/drafts/2017-12-12-dataflow.org
+++ b/drafts/2017-12-12-dataflow.org
@@ -3,6 +3,8 @@
 #+DATE: December 12, 2017
 #+TAGS: technobabble
 I don't know if there's actually anything to write here.
 There is a sort of parallel between the declarative nature of
 computational graphs in TensorFlow, and functional programming
 (possibly function-level - think of the J language and how important
@@ -29,3 +31,12 @@ abstractions (RDD, DataFrame, Dataset)."
 Spark does this with a database. TensorFlow does it with numerical
 calculations.  Node-RED does it with irregular, asynchronous data.
 - [[https://mxnet.incubator.apache.org/how_to/visualize_graph.html][mxnet: How to visualize Neural Networks as computation graph]]
 - [[https://medium.com/intuitionmachine/pytorch-dynamic-computational-graphs-and-modular-deep-learning-7e7f89f18d1][PyTorch, Dynamic Computational Graphs and Modular Deep Learning]]
 - [[https://github.com/WarBean/hyperboard][HyperBoard: A web-based dashboard for Deep Learning]]
 - [[https://www.postgresql.org/docs/current/static/sql-explain.html][EXPLAIN in PostgreSQL]]
  - http://tatiyants.com/postgres-query-plan-visualization/
 - https://en.wikipedia.org/wiki/Dataflow_programming
 - Pure Data!
 - [[https://en.wikipedia.org/wiki/Orange_(software)][Orange]]?
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -21,8 +21,8 @@ references, and one particular [[https://github.com/fizyr/keras-retinanet][imple
 "Object detection" as it is used here refers to machine learning
 models that can not just identify a single object in an image, but can
 identify and *localize* multiple objects, like in the below photo
-taken from [[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow
+taken from
-Object Detection API]]:
+[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:
 # TODO:
 # Define mAP
@@ -143,10 +143,9 @@ explores).  The paper is fairly concise in describing FPNs; it only
 takes it around 3 pages to explain their purpose, related work, and
 their entire design.  The remainder shows experimental results and
 specific applications of FPNs.  While it shows FPNs implemented on a
-particular underlying network (ResNet), they were made purposely to be
+particular underlying network (ResNet, mentioned below), they were
-very simple and adaptable to nearly any kind of CNN.
+made purposely to be very simple and adaptable to nearly any kind of
-
+CNN.
 # TODO: Link to ResNet?
 To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]].  The below
 diagram illustrates an image pyramid:
@@ -225,6 +224,16 @@ connections.
 # Note C=256 and such
 # TODO: Link to some good explanations
 For two reasons, I don't explain much about ResNet here.  The first is
 that residual networks, like the ResNet used here, have seen lots of
 attention and already have many good explanations online.  The second
 is that the paper claims that the underlying network 
 [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
 [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
 * Anchors & Region Proposals
 Recall last section what was said about feature maps, and the that the
@@ -339,3 +348,21 @@ is implemented with bog-standard convolutional networks...
 * Inference
 # Top N results
 * References
 # Does org-mode have a way to make a special section for references?
 # I know I saw this somewhere
 1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
 2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
 3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
 4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
 5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
 6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
 7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
 8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
 9. https://github.com/KaimingHe/deep-residual-networks
 10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
 11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
 12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/
--- a/drafts/2018-01-30-slope-one.org
+++ b/drafts/2018-01-30-slope-one.org
@@ -1,7 +1,13 @@
-#+TITLE: Collaborative Filtering with Slope One Predictors
+---
-#+AUTHOR: Chris Hodapp
+title: Collaborative Filtering with Slope One Predictors
-#+DATE: January 30, 2018
+author: Chris Hodapp
-#+TAGS: technobabble, machine learning
+date: January 30, 2018
 tags: technobabble, machine learning
 ---
 # Needs a brief intro
 # Needs a summary at the end
 Suppose you have a large number of users, and a large number of
 movies.  Users have watched movies, and they've provided ratings for
@@ -10,61 +16,178 @@ However, they've all watched different movies, and for any given user,
 it's only a tiny fraction of the total movies.
 Now, you want to predict how some user will rate some movie they
-haven't rated, based on what they (and other users) have.
+haven't rated, based on what they (and other users) have rated.
 That's a common problem, especially when generalized from 'movies' to
-anything else, and one with many approaches.
+anything else, and one with many approaches.  (To put some technical
 terms to it, this is the [[https://en.wikipedia.org/wiki/Collaborative_filtering][collaborative filtering]] approach to
 [[https://en.wikipedia.org/wiki/Recommender_system][recommender systems]].  [[http://www.mmds.org/][Mining of Massive Datasets]] is an excellent free
 text in which to read more in depth on this, particularly chapter 9.)
-Slope One Predictors are one such method, described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope
+Slope One Predictors are one such approach to collaborative filtering,
-One Predictors for Online Rating-Based Collaborative Filtering]].
+described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope One Predictors for Online Rating-Based
-Despite the complex-sounding name, they are wonderfully simple to
+Collaborative Filtering]].  Despite the complex-sounding name, they are
-understand and implement, and very fast.
+wonderfully simple to understand and implement, and very fast.
-Consider a user Bob.  Bob has rather simplistic tastes: he mostly just
+I'll give a contrived example below to explain them.
-watches Clint Eastwood movies.  In fact, he's watched and rated nearly
+
-all of them, and basically nothing else.
+Consider a user Bob.  Bob is enthusiastic, but has rather simple
 tastes: he mostly just watches Clint Eastwood movies.  In fact, he's
 watched and rated nearly all of them, and basically nothing else.
 Now, suppose we want to predict how much Bob will like something
 completely different and unheard of (to him at least), like... I don't
 know... /Citizen Kane/.
-First, find the users who rated both /Citizen Kane/ *and* any of the Clint
+Here's Slope One in a nutshell:
 Eastwood movies that Bob rated.
-Now, for each movie that comes up above, compute a *deviation* which
+1. First, find the users who rated both /Citizen Kane/ *and* any of
-tells us: On average, how differently (i.e. how much higher or lower)
+   the Clint Eastwood movies that Bob rated.
-did users rate Citizen Kane compared to this movie?  (For instance,
+2. Now, for each movie that comes up above, compute a *deviation*
-we'll have a number for how /Citizen Kane/ was rated compared to
+   which tells us: On average, how differently (i.e. how much higher
-/Dirty Harry/, and perhaps it's +0.6 - meaning that on average, users
+   or lower) did users rate Citizen Kane compared to this movie?  (For
-who rated both movies rated /Citizen Kane/ about 0.6 stars above
+   instance, we'll have a number for how /Citizen Kane/ was rated
-/Dirty Harry/.  We'd have another deviation for /Citizen Kane/
+   compared to /Dirty Harry/, and perhaps it's +0.6 - meaning that on
-compared to /Gran Torino/, another for /Citizen Kane/ compared to /The
+   average, users who rated both movies rated /Citizen Kane/ about 0.6
-Good, the Bad and the Ugly/, and so on - for every movie that Bob
+   stars above /Dirty Harry/.  We'd have another deviation for
-rated, provided that other users who rated /Citizen Kane/ also rated
+   /Citizen Kane/ compared to /Gran Torino/, another for /Citizen
-the movie.)
+   Kane/ compared to /The Good, the Bad and the Ugly/, and so on - for
   every movie that Bob rated, provided that other users who rated
   /Citizen Kane/ also rated the movie.)
 3. If that deviation between /Citizen Kane/ and /Dirty Harry/ was
   +0.6, it's reasonable that adding 0.6 from Bob's rating on /Dirty
   Harry/ would give one prediction of how Bob might rate /Citizen
   Kane/.  We can then generate more predictions based on the ratings
   he gave the other movies - anything for which we could compute a
   deviation.
 4. To turn this to a single prediction, we could just average all
   those predictions together.
-If that deviation between /Citizen Kane/ and /Dirty Harry/ was +0.6,
+One variant, Weighted Slope One, is nearly identical.  The only
-it's reasonable that adding 0.6 from Bob's rating on /Dirty Harry/
+difference is in how we average those predictions in step #4.  In
-would give one prediction of how Bob might rate /Citizen Kane/.  We
+Slope One, every deviation counts equally, no matter how many users
-can then generate more predictions based on the ratings he gave the
+had differences in ratings averaged together to produce it.  In
-other movies - anything for which we could compute a deviation.
+Weighted Slope One, deviations that came from larger numbers of users
-
+count for more (because, presumably, they are better estimates).
 To turn this to a single answer, we could just average those
 predictions together.
 That's the Slope One algorithm in a nutshell - and also the Weighted
 Slope One algorithm.  The only difference is in how we average those
 predictions.  In Slope One, every deviation counts equally, no matter
 how many users had differences in ratings averaged together to produce
 it.  In Weighted Slope One, deviations that came from larger numbers
 of users count for more (because, presumably, they are better
 estimates).
 Or, in other words: If only one person rated both /Citizen Kane/ and
 the lesser-known Eastwood classic /Revenge of the Creature/, and they
-happened to thank that /Revenge of the Creature/ deserved at least 3
+happened to think that /Revenge of the Creature/ deserved at least 3
-more stars, then with Slope One, this deviation of +3 would carry
+more stars, then with Slope One, this deviation of -3 would carry
 exactly as much weight as thousands of people rating /Citizen Kane/ as
 about 0.5 stars below /The Good, the Bad and the Ugly/.  In Weighted
 Slope One, that latter deviation would count for thousands of times as
 much.  The example makes it sound a bit more drastic than it is.
 The Python library [[http://surpriselib.com/][Surprise]] (a [[https://www.scipy.org/scikits.html][scikit]]) has an implementation of this
 algorithm, and the Benchmarks section of that page shows its
 performance compared to some other methods.
 /TODO/: Show a simple Python implementation of this (Jupyter
 notebook?)
 * Linear Algebra Tricks
 Those who aren't familiar with matrix methods or algebra can probably
 skip this section. Everything I've described above, you can compute
 given just some data to work with ([[https://grouplens.org/datasets/movielens/100k/][movielens 100k]], perhaps?) and some
 basic arithmetic.  You don't need any complicated numerical methods.
 However, the entire Slope One method can be implemented in a very fast
 and simple way with a couple matrix operations.
 First, we need to have our data encoded as a *utility matrix*.  In a
 utility matrix, each row represents one user, each column represents
 one item (a movie, in our case), and each element represents a user's
 rating of an item.  If we have $n$ users and $m$ movies, then this a
 $n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for
 movie $i$ - assuming we've numbered our users and our movies.
 Users have typically rated only a fraction of movies, and so most of
 the elements of this matrix are unknown.  We can represent this with
 another $n \times m$ matrix (specifically a binary matrix), a 'mask'
 $M$ in which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie
 $i$, and otherwise 0.
 I mentioned *deviation* above and gave an informal definition of it.
 The paper gaves a formal but rather terse definition below of the
 average deviation of item $i$ with respect to item $j$:
 $$\textrm{dev}_{j,i} = \sum_{u \in S_{j,i}(\chi)} \frac{u_j - u_i}{card(S_{j,i}(\chi))}$$
 where:
 - $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively
 - $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're
  training on, provided a rating for both movie $i$ and movie $j$
 - $card$ is the cardinality of that set, i.e. for
  ${card(S_{j,i}(\chi))}$ it is just how many users rated both $i$ and
  $j$.
 That denominator does depend on $i$ and $j$, but doesn't depend on the
 summation term, so it can be pulled out, and also, we can split up the
 summation as long as it is kept over the same terms:
 $$\textrm{dev}_{j,i} = \frac{1}{card(S_{j,i}(\chi))} \sum_{u \in
 S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u
 \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i\right)$$
 # TODO: These need some actual matrices to illustrate
 Let's start with computing ${card(S_{j,i}(\chi))}$, the number of
 users who rated both movie $i$ and movie $j$.  Consider column $i$ of
 the mask $M$.  For each value in this column, it equals 1 if the
 respective user rated movie $i$, or 0 if they did not.  Clearly,
 simply summing up column $i$ would tell us how many users rated movie
 $i$, and the same applies to column $j$ for movie $j$.
 Now, suppose we take element-wise logical AND of columns $i$ and $j$.
 The resultant column has a 1 only where both corresponding elements
 were 1 - where a user rated both $i$ and $j$.  If we sum up this
 column, we have exactly the number we need: the number of users who
 rated both $i$ and $j$.
 Some might notice that "elementwise logical AND" is just "elementwise
 multiplication", thus "sum of elementwise logical AND" is just "sum of
 elementwise multiplication", which is: dot product.  That is,
 ${card(S_{j,i}(\chi))}=M_j \bullet M_i$ if we use $M_i$ and $M_j$ for
 columns $i$ and $j$ of $M$.
 However, we'd like to compute deviation as a matrix for all $i$ and
 $j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single
 combination of $i$ and $j$ - that is, we need a dot product between
 every single pair of columns from $M$.  Incidentally, "dot product of
 every pair of columns" happens to be almost exactly matrix
 multiplication; note that for matrices $A$ and $B$, element $(x,y)$ of
 the matrix product $AB$ is just the dot product of /row/ $x$ of $A$
 and /column/ $y$ of $B$ - and that matrix product as a whole has this
 dot product between every row of $A$ and every column of $B$.
 We wanted the dot product of every column of $M$ with every column of
 $M$, which is easy: just transpose $M$ for one operand.  Then, we can
 compute our count matrix like this:
 $$C=M^\top M$$
 Thus $C_{i,j}$ is the dot product of column $i$ of $M$ and column $j$
 of $M$ - or, the number of users who rated both movies $i$ and $j$.
 That was the first half of what we needed for $\textrm{dev}_{j,i}$.
 We still need the other half:
 $$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$
 We can apply a similar trick here.  Consider first what $\sum_{u \in
 S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of
 movie $j$ that were done by a user who also rated movie $i$.
 Likewise, $\sum_{u \in S_{j,i}(\chi)} u_j$ is the sum of only those
 ratings of movie $i$ that were done by a user who also rated movie
 $j$.  (Note the symmetry: it's over the same set of users, because
 it's always the users who rated both $i$ and $j$.)
 # TODO: Finish that section (mostly translate from code notes)
 * Implementation
 #+BEGIN_SRC python
 print("foo")
 #+END_SRC
--- a/templates/default.html
+++ b/templates/default.html
@@ -16,7 +16,7 @@
        <!-- From http://travis.athougies.net/posts/2013-08-13-using-math-on-your-hakyll-blog.html -->
        <script type="text/javascript"
-                src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
+                src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
        </script>
        <link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png">