Explained a little more on FPN/RPN in RetinaNet post

2017-12-15 19:45:50 -05:00
parent 8240497272
commit e14a94ee5e
1 changed files with 56 additions and 14 deletions
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -5,8 +5,8 @@
 A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
 Detection]], from one of Facebook's teams.  The goal of this post is to
-explain this work a bit as I work through the paper, through some of
+explain this paper as I work through it, through some of its
-its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
+references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
 * Object Detection
@@ -47,6 +47,13 @@ of many locations, many sizes, and many aspect ratios.
 This is simpler and faster - but not as accurate as the two-stage
 approaches.
 Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
 didn't come up with these names) merge the two models of two-stage
 approaches into a single CNN, and exploit the possibility of sharing
 computations that would otherwise be done twice.  I assume that this
 is included in the comparisons done in the paper, but I'm not entirely
 sure.
 * Training & Class Imbalance
 Briefly, the process of training these models requires minimizing some
@@ -112,7 +119,11 @@ important not to miss that /innovations in/: they are saying that they
 didn't need to invent a new network design - not that the network
 design doesn't matter.  Later in the paper, they say that it is in
 fact crucial that RetinaNet's architecture relies on FPN (Feature
-Pyramid Network) as its backbone.
+Pyramid Network) as its backbone.  As far as I can tell, the
 architecture's use of a variant of RPN (Region Proposal Network) is
 also very important.
 I go into both of these aspects below.
 ** Feature Pyramid Network
@@ -167,25 +178,56 @@ You may notice that this network has a structure that bears some
 resemblance to an image pyramid.  This is because deep CNNs are
 already computing a sort of pyramid in their convolutional and
 subsampling stages.  In a nutshell, deep CNNs used in image
-classification push an image through a cascade of feature detectors,
+classification push an image through a cascade of feature detectors or
-and each successive stage contains a feature map that is built out of
+filters, and each successive stage contains a feature map that is
-features in the prior stage - thus producing a *feature hierarchy*
+built out of features in the prior stage - thus producing a *feature
-which already is something like a pyramid and contains multiple
+hierarchy* which already is something like a pyramid and contains
-different scales.
+multiple different scales.  (Being able to train deep CNNs to jointly
 learn the filters at each stage of that feature hierarchy from the
 data, rather than engineering them by hand, is what sets deep learning
 apart from "shallow" machine learning.)
 When you move through levels of a featurized image pyramid, only scale
 should change.  When you move through levels of a feature hierarchy
 described here, scale changes, but so does the meaning of the
-features.  This is the *semantic gap* the paper references.  The
+features.  This is the *semantic gap* the paper references.  Meaning
-meaning changes because each stage builds up more complex features by
+changes because each stage builds up more complex features by
 combining simpler features of the last stage.  The first stage, for
 instance, commonly handles pixel-level features like points, lines or
 edges at a particular direction.  In the final stage, presumably, the
 model has learned complex enough features that things like "kite" and
 "person" can be identified.
-The goal of FPN was to find a way to exploit this feature hierarchy
+The goal in the paper was to find a way to exploit this feature
-that is already being computed and to produce something that has
+hierarchy that is already being computed and to produce something that
-similar power to a featurized image pyramid but without too high of a
+has similar power to a featurized image pyramid but without too high
-cost in speed, memory, or complexity.
+of a cost in speed, memory, or complexity.
 Everything described so far (none of which is specific to FPNs), the
 paper calls the *bottom-up* pathway - the feed-forward portion of the
 CNN.  FPN adds to this a *top-down* pathway and some lateral
 connections.
 *** Top-Down Pathway
 *** Lateral Connections
 *** As Applied to ResNet
 # Note C=256 and such
 ** Anchors & Region Proposals
 The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
 Proposal Networks]] explains anchors and RPNs (Region Proposal
 Networks), which RetinaNet's design also relies on heavily.
 Recall a few sections ago what was said about feature maps, and the
 fact that the deeper stages of the CNN happen to be good for
 classifying images.  While these deeper stages are lower-resolution
 than the input images, and while their influence is spread out over
 larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
 rather large due to each stage spreading it a little further), the
 features here still maintain a spatial relationship with the input
 image.  That is, moving across one axis of this feature map still
 corresponds to moving across the same axis of the input image.