From cc221d9a6f23b027750b8347c911335fc4d2aa94 Mon Sep 17 00:00:00 2001
From: Chris Hodapp <hodapp87@gmail.com>
Date: Sat, 16 Dec 2017 12:59:15 -0500
Subject: [PATCH] RetinaNet post: Fixed header; added a bit to anchor/subnets

---
 drafts/2017-12-13-retinanet.org | 154 +++++++++++++++++++++++++++-----
 1 file changed, 131 insertions(+), 23 deletions(-)

diff --git a/drafts/2017-12-13-retinanet.org b/drafts/2017-12-13-retinanet.org
index 5c41b62..b040bab 100644
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -1,10 +1,18 @@
-#+TITLE: Explaining RetinaNet
-#+AUTHOR: Chris Hodapp
-#+DATE: December 13, 2017
-#+TAGS: technobabble
+---
+title: Explaining RetinaNet
+author: Chris Hodapp
+date: December 13, 2017
+tags: technobabble
+---
 
-A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
-Detection]], from one of Facebook's teams.  The goal of this post is to
+# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
+# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
+# description: 
+# subtitle: 
+
+A paper came out in the past few months,
+[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
+Facebook's teams.  The goal of this post is to
 explain this paper as I work through it, through some of its
 references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
 
@@ -109,6 +117,7 @@ scaling makes sense: if a classification is already correct (as in the
 "easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
 of the loss multiplied by it will likewise tend toward 0.
 
+
 * RetinaNet architecture
 
 The paper gives the name *RetinaNet* to the network they created which
@@ -125,7 +134,7 @@ also very important.
 
 I go into both of these aspects below.
 
-** Feature Pyramid Network
+* Feature Pyramid Network
 
 Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
 describes the basis of this FPN in detail (and, non-coincidentally I'm
@@ -208,26 +217,125 @@ paper calls the *bottom-up* pathway - the feed-forward portion of the
 CNN.  FPN adds to this a *top-down* pathway and some lateral
 connections.
 
-*** Top-Down Pathway
+** Top-Down Pathway
 
-*** Lateral Connections
+** Lateral Connections
 
-*** As Applied to ResNet
+** As Applied to ResNet
 
 # Note C=256 and such
 
-** Anchors & Region Proposals
+* Anchors & Region Proposals
 
-The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
-Proposal Networks]] explains anchors and RPNs (Region Proposal
-Networks), which RetinaNet's design also relies on heavily.
+Recall last section what was said about feature maps, and the that the
+deeper stages of the CNN happen to be good for classifying images.
+While these deeper stages are lower-resolution than the input images,
+and while their influence is spread out over larger areas of the input
+image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
+stage spreading it a little further), the features here still maintain
+a spatial relationship with the input image.  That is, moving across
+one axis of this feature map still corresponds to moving across the
+same axis of the input image.
 
-Recall a few sections ago what was said about feature maps, and the
-fact that the deeper stages of the CNN happen to be good for
-classifying images.  While these deeper stages are lower-resolution
-than the input images, and while their influence is spread out over
-larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
-rather large due to each stage spreading it a little further), the
-features here still maintain a spatial relationship with the input
-image.  That is, moving across one axis of this feature map still
-corresponds to moving across the same axis of the input image.
+# Just re-explain the above with the feature pyramid
+
+RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
+here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
+R-CNN: Towards Real-Time Object Detection with Region Proposal
+Networks]].  I find the explanations in terms of "proposals", of
+focusing the "attention" of the neural network, or of "telling the
+neural network where to look" to be needlessly confusing and
+misleading.  I'd rather explain very plainly how they work.
+
+Central to RPNs is *anchors*.  Anchors aren't exactly a feature of the
+CNN.  They're more a property that's used in its training and
+inference.
+
+In particular:
+
+- Say that the feature pyramid has $L$ levels, and that level $l+1$ is
+  half the resolution (thus double the scale) of level $l$.
+- Say that level $l$ is a 256-channel feature map of size $W \times H$
+  (i.e. it's a tensor with shape $W \times H \times 256$).  Note that
+  $W$ and $H$ will be larger at lower levels, and smaller at higher
+  levels, but in RetinaNet at least, always 256-channel samples.
+- For every point on that feature map (all $WH$ of them), we can
+  identify a corresponding point in the input image.  This is the
+  center point of a broad region of the input image that influences
+  this point in the feature map (i.e. its receptive field).  Note that
+  as we move up to higher levels in the feature pyramid, these regions
+  grow larger, and neighboring points in the feature map correspond to
+  larger and larger jumps across the input image.
+- We can make these regions explicit by defining *anchors* - specific
+  rectangular regions associated with each point of a feature map.
+  The size of the anchor depends on the scale of the feature map, or
+  equivalently, what level of the feature map it came from.  All this
+  means is that anchors in level $l+1$ are twice as large as the
+  anchors of level $l$.
+
+The view that this should paint is that a dense collection of anchors
+covers the entire input image at different sizes - still in a very
+ordered pattern, but with lots of overlap.  Remember how I mentioned
+at the beginning of this post that one-stage object detectors use a
+very "brute force" method?
+
+My above explanation glossed over a couple things, but nothing that
+should change the fundamentals.
+
+- Anchors are actually associated with every 3x3 window in the anchor
+  map, not precisely every point, but all this really means is that
+  it's "every point and its immediate neighbors" rather than "every
+  point".  This doesn't really matter to anchors, but matters
+  elsewhere.
+- It's not a single anchor per 3x3 window, but 9 anchors - one for
+  each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
+  scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
+  This is just to handle objects of less-square shapes and to cover
+  the gap in scale in between levels of the feature pyramid.  Note
+  that the scale factors are evenly-spaced exponentially, such that an
+  additional step down wouldn't make sense (the largest anchors at the
+  pyramid level /below/ already cover this scale), and nor would an
+  additional step up (the smallest anchors at the pyramid level
+  /above/ already cover it).
+
+Here, finally, is where actual classification and regression come in.
+The *classification subnet* and *box regression subnet* are here.
+
+** Classification Subnet
+
+Every anchor associates an image region with a 3x3 window (i.e. a
+3x3x256 section - it's still 256-channel).  The classification subnet
+is responsible for learning: do the features in this 3x3 window,
+produced from some input, image indicate that an object is inside this
+anchor?  Or, more accurately: For each of $K$ object classes, what's
+the probability of each object (or just of it being background)?
+
+** Box Regression Subnet
+
+The box regression subnet takes the same input as the classification
+subnet, but tries to learn the answer to a different question.  It is
+responsible for learning: what are the coordinates to the object
+inside of this anchor (assuming there is one)?  More specifically, it
+tries to learn to produce 4 numbers values which give offsets relative
+to the anchor's bounds (thus specifying a different region).  Note
+that this subnet completely ignores the class of the object.
+
+The classification subnet already tells us whether or not a given
+anchor contains an object - which already gives rough bounds on
+it. The box regression subnet helps tighten these bounds.
+
+** Other notes (?)
+
+I've glossed over a few details here.  Everything I've described above
+is implemented with bog-standard convolutional networks...
+
+# Parameter sharing? How to explain?
+
+* Training
+
+# Ground-truth object boxes
+# Intersection-over-Union thresholds
+
+* Inference
+
+# Top N results