More updates with drafts (Slope One, modularity)

This commit is contained in:
Chris Hodapp 2018-02-06 17:52:16 -05:00
parent c7695799e6
commit 0437bb31cd
6 changed files with 240 additions and 68 deletions

View File

@ -11,25 +11,35 @@ Why are old technological ideas that were "ahead of their time", but
which lost out to other ideas, worth studying?
We can see them as raw ideas that "modern" understanding never
refined - as misguided fantasies or just mistakes, even. The flip
side of this is that we can see them as ideas that are free of the
modern preconceptions that are now nearly inescapable.
refined - misguided fantasies or even just mistakes. The flip side of
this is that we can see them as ideas that are free of a nearly
inescapable modern context and all of the preconceptions and blinders
it carries.
In some of these visionaries is a valuable combination:
- a detachment from this context (by mere virtue of it not existing
yet),
- the ability to imagine and analyze far beyond the preconceptions
that in turn surrounded them,
- the resources and freedom to actually apply this,
- the foresight and sometimes blind luck to have communicated their
thoughts, feelings, and analysis in a durable way.
- they're detached from this modern context (by mere virtue of it not
existing yet),
- they have considerable experience, imagination, and foresight,
- they devoted time and effort to work extensively on something and to
communicate their thoughts, feelings, and analysis in a durable way.
To put it in another way: They gave us analysis in a context that no
longer even exists. They help us think beyond our current context.
To put it in another way: They give us analysis done from a context
that is long gone. They help us think beyond our current context.
They help us answer a question, "What if we took a different path
then?"
[[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][Epigram #53]] from Alan Perlis offers some relevant skepticism here: "So
many good ideas are never heard from again once they embark in a
voyage on the semantic gulf." My interpretation of it is that we tend
to idolize ideas, old and new, because they sound somehow different,
innovative, and groundbreaking, but attempts at analysis or practical
realization of the ideas leads to a bleaker reality, perhaps that the
idea is completely meaningless (the equivalent of a [[https://en.wiktionary.org/wiki/deepity][deepity]], perhaps),
wildly impractical, or a mere facade over what is already established.
* Examples
* Scratch
- Douglas Engelbart is perhaps one of the canonical examples of a person
@ -37,7 +47,4 @@ then?"
another. Alan Turing is an early example widely regarded for his
foresight.
- [[https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/][As We May Think (Vannevar Bush)]]
- However, to quote [[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][epigram #53]] from Alan Perlis, "So many good ideas
are never heard from again once they embark in a voyage on the
semantic gulf."
- "Do you remember a time when..." only goes so far.

View File

@ -39,8 +39,8 @@ bits... It is not only necessary to make sure your own system is
designed to be made of modular parts. It is also necessary to realize
that your own system, no matter how big and wonderful it seems now,
should always be designed to be a part of another larger system." Les
Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of
future software]] even did an interesting derivation tying the defect
Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of future software]]
even did an interesting derivation tying the defect
density in software to how it is broken into pieces.
"Abstraction" doesn't have quite the same consensus. In software, it's
@ -255,3 +255,7 @@ underneath, and this makes me wonder why it needs explicit support for
- https://www.reddit.com/r/programming/comments/4bjss2/an_11_line_npm_package_called_leftpad_with_only/
- http://www.freecode.com/articles/editorial-the-two-edged-sword
- https://en.wikipedia.org/wiki/Essential_complexity
- GObject framework: an object system that sits outside of any
particular language (though this is nothing particularly new)
- libgreen

View File

@ -3,6 +3,8 @@
#+DATE: December 12, 2017
#+TAGS: technobabble
I don't know if there's actually anything to write here.
There is a sort of parallel between the declarative nature of
computational graphs in TensorFlow, and functional programming
(possibly function-level - think of the J language and how important
@ -29,3 +31,12 @@ abstractions (RDD, DataFrame, Dataset)."
Spark does this with a database. TensorFlow does it with numerical
calculations. Node-RED does it with irregular, asynchronous data.
- [[https://mxnet.incubator.apache.org/how_to/visualize_graph.html][mxnet: How to visualize Neural Networks as computation graph]]
- [[https://medium.com/intuitionmachine/pytorch-dynamic-computational-graphs-and-modular-deep-learning-7e7f89f18d1][PyTorch, Dynamic Computational Graphs and Modular Deep Learning]]
- [[https://github.com/WarBean/hyperboard][HyperBoard: A web-based dashboard for Deep Learning]]
- [[https://www.postgresql.org/docs/current/static/sql-explain.html][EXPLAIN in PostgreSQL]]
- http://tatiyants.com/postgres-query-plan-visualization/
- https://en.wikipedia.org/wiki/Dataflow_programming
- Pure Data!
- [[https://en.wikipedia.org/wiki/Orange_(software)][Orange]]?

View File

@ -21,8 +21,8 @@ references, and one particular [[https://github.com/fizyr/keras-retinanet][imple
"Object detection" as it is used here refers to machine learning
models that can not just identify a single object in an image, but can
identify and *localize* multiple objects, like in the below photo
taken from [[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow
Object Detection API]]:
taken from
[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:
# TODO:
# Define mAP
@ -143,10 +143,9 @@ explores). The paper is fairly concise in describing FPNs; it only
takes it around 3 pages to explain their purpose, related work, and
their entire design. The remainder shows experimental results and
specific applications of FPNs. While it shows FPNs implemented on a
particular underlying network (ResNet), they were made purposely to be
very simple and adaptable to nearly any kind of CNN.
# TODO: Link to ResNet?
particular underlying network (ResNet, mentioned below), they were
made purposely to be very simple and adaptable to nearly any kind of
CNN.
To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]]. The below
diagram illustrates an image pyramid:
@ -225,6 +224,16 @@ connections.
# Note C=256 and such
# TODO: Link to some good explanations
For two reasons, I don't explain much about ResNet here. The first is
that residual networks, like the ResNet used here, have seen lots of
attention and already have many good explanations online. The second
is that the paper claims that the underlying network
[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
* Anchors & Region Proposals
Recall last section what was said about feature maps, and the that the
@ -339,3 +348,21 @@ is implemented with bog-standard convolutional networks...
* Inference
# Top N results
* References
# Does org-mode have a way to make a special section for references?
# I know I saw this somewhere
1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
9. https://github.com/KaimingHe/deep-residual-networks
10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/

View File

@ -1,7 +1,13 @@
#+TITLE: Collaborative Filtering with Slope One Predictors
#+AUTHOR: Chris Hodapp
#+DATE: January 30, 2018
#+TAGS: technobabble, machine learning
---
title: Collaborative Filtering with Slope One Predictors
author: Chris Hodapp
date: January 30, 2018
tags: technobabble, machine learning
---
# Needs a brief intro
# Needs a summary at the end
Suppose you have a large number of users, and a large number of
movies. Users have watched movies, and they've provided ratings for
@ -10,61 +16,178 @@ However, they've all watched different movies, and for any given user,
it's only a tiny fraction of the total movies.
Now, you want to predict how some user will rate some movie they
haven't rated, based on what they (and other users) have.
haven't rated, based on what they (and other users) have rated.
That's a common problem, especially when generalized from 'movies' to
anything else, and one with many approaches.
anything else, and one with many approaches. (To put some technical
terms to it, this is the [[https://en.wikipedia.org/wiki/Collaborative_filtering][collaborative filtering]] approach to
[[https://en.wikipedia.org/wiki/Recommender_system][recommender systems]]. [[http://www.mmds.org/][Mining of Massive Datasets]] is an excellent free
text in which to read more in depth on this, particularly chapter 9.)
Slope One Predictors are one such method, described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope
One Predictors for Online Rating-Based Collaborative Filtering]].
Despite the complex-sounding name, they are wonderfully simple to
understand and implement, and very fast.
Slope One Predictors are one such approach to collaborative filtering,
described in the paper [[https://arxiv.org/pdf/cs/0702144v1.pdf][Slope One Predictors for Online Rating-Based
Collaborative Filtering]]. Despite the complex-sounding name, they are
wonderfully simple to understand and implement, and very fast.
Consider a user Bob. Bob has rather simplistic tastes: he mostly just
watches Clint Eastwood movies. In fact, he's watched and rated nearly
all of them, and basically nothing else.
I'll give a contrived example below to explain them.
Consider a user Bob. Bob is enthusiastic, but has rather simple
tastes: he mostly just watches Clint Eastwood movies. In fact, he's
watched and rated nearly all of them, and basically nothing else.
Now, suppose we want to predict how much Bob will like something
completely different and unheard of (to him at least), like... I don't
know... /Citizen Kane/.
First, find the users who rated both /Citizen Kane/ *and* any of the Clint
Eastwood movies that Bob rated.
Here's Slope One in a nutshell:
Now, for each movie that comes up above, compute a *deviation* which
tells us: On average, how differently (i.e. how much higher or lower)
did users rate Citizen Kane compared to this movie? (For instance,
we'll have a number for how /Citizen Kane/ was rated compared to
/Dirty Harry/, and perhaps it's +0.6 - meaning that on average, users
who rated both movies rated /Citizen Kane/ about 0.6 stars above
/Dirty Harry/. We'd have another deviation for /Citizen Kane/
compared to /Gran Torino/, another for /Citizen Kane/ compared to /The
Good, the Bad and the Ugly/, and so on - for every movie that Bob
rated, provided that other users who rated /Citizen Kane/ also rated
the movie.)
1. First, find the users who rated both /Citizen Kane/ *and* any of
the Clint Eastwood movies that Bob rated.
2. Now, for each movie that comes up above, compute a *deviation*
which tells us: On average, how differently (i.e. how much higher
or lower) did users rate Citizen Kane compared to this movie? (For
instance, we'll have a number for how /Citizen Kane/ was rated
compared to /Dirty Harry/, and perhaps it's +0.6 - meaning that on
average, users who rated both movies rated /Citizen Kane/ about 0.6
stars above /Dirty Harry/. We'd have another deviation for
/Citizen Kane/ compared to /Gran Torino/, another for /Citizen
Kane/ compared to /The Good, the Bad and the Ugly/, and so on - for
every movie that Bob rated, provided that other users who rated
/Citizen Kane/ also rated the movie.)
3. If that deviation between /Citizen Kane/ and /Dirty Harry/ was
+0.6, it's reasonable that adding 0.6 from Bob's rating on /Dirty
Harry/ would give one prediction of how Bob might rate /Citizen
Kane/. We can then generate more predictions based on the ratings
he gave the other movies - anything for which we could compute a
deviation.
4. To turn this to a single prediction, we could just average all
those predictions together.
If that deviation between /Citizen Kane/ and /Dirty Harry/ was +0.6,
it's reasonable that adding 0.6 from Bob's rating on /Dirty Harry/
would give one prediction of how Bob might rate /Citizen Kane/. We
can then generate more predictions based on the ratings he gave the
other movies - anything for which we could compute a deviation.
To turn this to a single answer, we could just average those
predictions together.
That's the Slope One algorithm in a nutshell - and also the Weighted
Slope One algorithm. The only difference is in how we average those
predictions. In Slope One, every deviation counts equally, no matter
how many users had differences in ratings averaged together to produce
it. In Weighted Slope One, deviations that came from larger numbers
of users count for more (because, presumably, they are better
estimates).
One variant, Weighted Slope One, is nearly identical. The only
difference is in how we average those predictions in step #4. In
Slope One, every deviation counts equally, no matter how many users
had differences in ratings averaged together to produce it. In
Weighted Slope One, deviations that came from larger numbers of users
count for more (because, presumably, they are better estimates).
Or, in other words: If only one person rated both /Citizen Kane/ and
the lesser-known Eastwood classic /Revenge of the Creature/, and they
happened to thank that /Revenge of the Creature/ deserved at least 3
more stars, then with Slope One, this deviation of +3 would carry
happened to think that /Revenge of the Creature/ deserved at least 3
more stars, then with Slope One, this deviation of -3 would carry
exactly as much weight as thousands of people rating /Citizen Kane/ as
about 0.5 stars below /The Good, the Bad and the Ugly/. In Weighted
Slope One, that latter deviation would count for thousands of times as
much. The example makes it sound a bit more drastic than it is.
The Python library [[http://surpriselib.com/][Surprise]] (a [[https://www.scipy.org/scikits.html][scikit]]) has an implementation of this
algorithm, and the Benchmarks section of that page shows its
performance compared to some other methods.
/TODO/: Show a simple Python implementation of this (Jupyter
notebook?)
* Linear Algebra Tricks
Those who aren't familiar with matrix methods or algebra can probably
skip this section. Everything I've described above, you can compute
given just some data to work with ([[https://grouplens.org/datasets/movielens/100k/][movielens 100k]], perhaps?) and some
basic arithmetic. You don't need any complicated numerical methods.
However, the entire Slope One method can be implemented in a very fast
and simple way with a couple matrix operations.
First, we need to have our data encoded as a *utility matrix*. In a
utility matrix, each row represents one user, each column represents
one item (a movie, in our case), and each element represents a user's
rating of an item. If we have $n$ users and $m$ movies, then this a
$n \times m$ matrix $U$ for which $U_{k,i}$ is user $k$'s rating for
movie $i$ - assuming we've numbered our users and our movies.
Users have typically rated only a fraction of movies, and so most of
the elements of this matrix are unknown. We can represent this with
another $n \times m$ matrix (specifically a binary matrix), a 'mask'
$M$ in which $M_{k,i}$ is 1 if user $k$ supplied a rating for movie
$i$, and otherwise 0.
I mentioned *deviation* above and gave an informal definition of it.
The paper gaves a formal but rather terse definition below of the
average deviation of item $i$ with respect to item $j$:
$$\textrm{dev}_{j,i} = \sum_{u \in S_{j,i}(\chi)} \frac{u_j - u_i}{card(S_{j,i}(\chi))}$$
where:
- $u_j$ and $u_i$ mean: user $u$'s ratings for movies $i$ and $j$, respectively
- $u \in S_{j,i}(\chi)$ means: all users $u$ who, in the dataset we're
training on, provided a rating for both movie $i$ and movie $j$
- $card$ is the cardinality of that set, i.e. for
${card(S_{j,i}(\chi))}$ it is just how many users rated both $i$ and
$j$.
That denominator does depend on $i$ and $j$, but doesn't depend on the
summation term, so it can be pulled out, and also, we can split up the
summation as long as it is kept over the same terms:
$$\textrm{dev}_{j,i} = \frac{1}{card(S_{j,i}(\chi))} \sum_{u \in
S_{j,i}(\chi)} u_j - u_i = \frac{1}{card(S_{j,i}(\chi))}\left(\sum_{u
\in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i\right)$$
# TODO: These need some actual matrices to illustrate
Let's start with computing ${card(S_{j,i}(\chi))}$, the number of
users who rated both movie $i$ and movie $j$. Consider column $i$ of
the mask $M$. For each value in this column, it equals 1 if the
respective user rated movie $i$, or 0 if they did not. Clearly,
simply summing up column $i$ would tell us how many users rated movie
$i$, and the same applies to column $j$ for movie $j$.
Now, suppose we take element-wise logical AND of columns $i$ and $j$.
The resultant column has a 1 only where both corresponding elements
were 1 - where a user rated both $i$ and $j$. If we sum up this
column, we have exactly the number we need: the number of users who
rated both $i$ and $j$.
Some might notice that "elementwise logical AND" is just "elementwise
multiplication", thus "sum of elementwise logical AND" is just "sum of
elementwise multiplication", which is: dot product. That is,
${card(S_{j,i}(\chi))}=M_j \bullet M_i$ if we use $M_i$ and $M_j$ for
columns $i$ and $j$ of $M$.
However, we'd like to compute deviation as a matrix for all $i$ and
$j$, so we'll likewise need ${card(S_{j,i}(\chi))}$ for every single
combination of $i$ and $j$ - that is, we need a dot product between
every single pair of columns from $M$. Incidentally, "dot product of
every pair of columns" happens to be almost exactly matrix
multiplication; note that for matrices $A$ and $B$, element $(x,y)$ of
the matrix product $AB$ is just the dot product of /row/ $x$ of $A$
and /column/ $y$ of $B$ - and that matrix product as a whole has this
dot product between every row of $A$ and every column of $B$.
We wanted the dot product of every column of $M$ with every column of
$M$, which is easy: just transpose $M$ for one operand. Then, we can
compute our count matrix like this:
$$C=M^\top M$$
Thus $C_{i,j}$ is the dot product of column $i$ of $M$ and column $j$
of $M$ - or, the number of users who rated both movies $i$ and $j$.
That was the first half of what we needed for $\textrm{dev}_{j,i}$.
We still need the other half:
$$\sum_{u \in S_{j,i}(\chi)} u_j - \sum_{u \in S_{j,i}(\chi)} u_i$$
We can apply a similar trick here. Consider first what $\sum_{u \in
S_{j,i}(\chi)} u_j$ means: It is the sum of only those ratings of
movie $j$ that were done by a user who also rated movie $i$.
Likewise, $\sum_{u \in S_{j,i}(\chi)} u_j$ is the sum of only those
ratings of movie $i$ that were done by a user who also rated movie
$j$. (Note the symmetry: it's over the same set of users, because
it's always the users who rated both $i$ and $j$.)
# TODO: Finish that section (mostly translate from code notes)
* Implementation
#+BEGIN_SRC python
print("foo")
#+END_SRC

View File

@ -16,7 +16,7 @@
<!-- From http://travis.athougies.net/posts/2013-08-13-using-math-on-your-hakyll-blog.html -->
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png">