From e14a94ee5ed4754f1e862bd6920ed3571efd980d Mon Sep 17 00:00:00 2001
From: Chris Hodapp <hodapp87@gmail.com>
Date: Fri, 15 Dec 2017 19:45:50 -0500
Subject: [PATCH] Explained a little more on FPN/RPN in RetinaNet post

---
 drafts/2017-12-13-retinanet.org | 70 ++++++++++++++++++++++++++-------
 1 file changed, 56 insertions(+), 14 deletions(-)

diff --git a/drafts/2017-12-13-retinanet.org b/drafts/2017-12-13-retinanet.org
index 94e6693..5c41b62 100644
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -5,8 +5,8 @@
 
 A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
 Detection]], from one of Facebook's teams.  The goal of this post is to
-explain this work a bit as I work through the paper, through some of
-its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
+explain this paper as I work through it, through some of its
+references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
 
 * Object Detection
 
@@ -47,6 +47,13 @@ of many locations, many sizes, and many aspect ratios.
 This is simpler and faster - but not as accurate as the two-stage
 approaches.
 
+Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
+didn't come up with these names) merge the two models of two-stage
+approaches into a single CNN, and exploit the possibility of sharing
+computations that would otherwise be done twice.  I assume that this
+is included in the comparisons done in the paper, but I'm not entirely
+sure.
+
 * Training & Class Imbalance
 
 Briefly, the process of training these models requires minimizing some
@@ -112,7 +119,11 @@ important not to miss that /innovations in/: they are saying that they
 didn't need to invent a new network design - not that the network
 design doesn't matter.  Later in the paper, they say that it is in
 fact crucial that RetinaNet's architecture relies on FPN (Feature
-Pyramid Network) as its backbone.
+Pyramid Network) as its backbone.  As far as I can tell, the
+architecture's use of a variant of RPN (Region Proposal Network) is
+also very important.
+
+I go into both of these aspects below.
 
 ** Feature Pyramid Network
 
@@ -167,25 +178,56 @@ You may notice that this network has a structure that bears some
 resemblance to an image pyramid.  This is because deep CNNs are
 already computing a sort of pyramid in their convolutional and
 subsampling stages.  In a nutshell, deep CNNs used in image
-classification push an image through a cascade of feature detectors,
-and each successive stage contains a feature map that is built out of
-features in the prior stage - thus producing a *feature hierarchy*
-which already is something like a pyramid and contains multiple
-different scales.
+classification push an image through a cascade of feature detectors or
+filters, and each successive stage contains a feature map that is
+built out of features in the prior stage - thus producing a *feature
+hierarchy* which already is something like a pyramid and contains
+multiple different scales.  (Being able to train deep CNNs to jointly
+learn the filters at each stage of that feature hierarchy from the
+data, rather than engineering them by hand, is what sets deep learning
+apart from "shallow" machine learning.)
 
 When you move through levels of a featurized image pyramid, only scale
 should change.  When you move through levels of a feature hierarchy
 described here, scale changes, but so does the meaning of the
-features.  This is the *semantic gap* the paper references.  The
-meaning changes because each stage builds up more complex features by
+features.  This is the *semantic gap* the paper references.  Meaning
+changes because each stage builds up more complex features by
 combining simpler features of the last stage.  The first stage, for
 instance, commonly handles pixel-level features like points, lines or
 edges at a particular direction.  In the final stage, presumably, the
 model has learned complex enough features that things like "kite" and
 "person" can be identified.
 
-The goal of FPN was to find a way to exploit this feature hierarchy
-that is already being computed and to produce something that has
-similar power to a featurized image pyramid but without too high of a
-cost in speed, memory, or complexity.
+The goal in the paper was to find a way to exploit this feature
+hierarchy that is already being computed and to produce something that
+has similar power to a featurized image pyramid but without too high
+of a cost in speed, memory, or complexity.
 
+Everything described so far (none of which is specific to FPNs), the
+paper calls the *bottom-up* pathway - the feed-forward portion of the
+CNN.  FPN adds to this a *top-down* pathway and some lateral
+connections.
+
+*** Top-Down Pathway
+
+*** Lateral Connections
+
+*** As Applied to ResNet
+
+# Note C=256 and such
+
+** Anchors & Region Proposals
+
+The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
+Proposal Networks]] explains anchors and RPNs (Region Proposal
+Networks), which RetinaNet's design also relies on heavily.
+
+Recall a few sections ago what was said about feature maps, and the
+fact that the deeper stages of the CNN happen to be good for
+classifying images.  While these deeper stages are lower-resolution
+than the input images, and while their influence is spread out over
+larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
+rather large due to each stage spreading it a little further), the
+features here still maintain a spatial relationship with the input
+image.  That is, moving across one axis of this feature map still
+corresponds to moving across the same axis of the input image.