Some FPN explanation in RetinaNet post

2017-12-15 13:58:52 -05:00
parent cc3a24565f
commit 8240497272
4 changed files with 147 additions and 28 deletions
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -5,8 +5,10 @@
 A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
 Detection]], from one of Facebook's teams.  The goal of this post is to
-explain this work a bit as I work through the paper, and to look at
+explain this work a bit as I work through the paper, through some of
-one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
+its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
 * Object Detection
 "Object detection" as it is used here refers to machine learning
 models that can not just identify a single object in an image, but can
@@ -19,34 +21,43 @@ Object Detection API]]:
 #+CAPTION: TensorFlow object detection example 2.
 #+ATTR_HTML: :width 100% :height 100%
-[[../images/2017-12-13-objdet.jpg]]
+[[../images/2017-12-13-retinanet/2017-12-13-objdet.jpg]]
-The paper discusses many of the two-stage approaches, like R-CNN and
+At the time of writing, the most accurate object-detection methods
-its variants, which work in two steps:
+were based around R-CNN and its variants, and all used two-stage
 approaches:
 1. One model proposes a sparse set of locations in the image that
   probably contain something.  Ideally, this contains all objects in
   the image, but filters out the majority of negative locations
   (i.e. only background, not foreground).
-2. Another model, typically a convolutional neural network, classifies
+2. Another model, typically a CNN (convolutional neural network),
-   each location in that sparse set as either being foreground and
+   classifies each location in that sparse set as either being
-   some specific object class, or as being background.
+   foreground and some specific object class (like "kite" or "person"
   above), or as being background.
-Additionally, it discusses some existing one-stage approaches like
+Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
-[[https://pjreddie.com/darknet/yolo/][YOLO]] and [[https://arxiv.org/abs/1512.02325][SSD]].  In essence, these run only the second step - but
+OverFeat. These simplified/approximated the two-stage approach by
-instead of starting from a sparse set of locations that are probably
+replacing the first step with brute force.  That is, instead of
-something of interest, they start from a dense set of locations which
+generating a sparse set of locations that probably have something of
-has blanketed the entire image on a grid of many locations, over many
+interest, they simply handle all locations, whether or not they likely
-sizes, and over many aspect ratios, regardless of whether they may
+contain something, by blanketing the entire image in a dense sampling
-contain an object.
+of many locations, many sizes, and many aspect ratios.
-This is simpler and faster - but not nearly as accurate.
+This is simpler and faster - but not as accurate as the two-stage
 approaches.
-Broadly, the process of training these models requires minimizing some
+* Training & Class Imbalance
 Briefly, the process of training these models requires minimizing some
 kind of loss function that is based on what the model misclassifies
 when it is run on some training data.  It's preferable to be able to
 compute some loss over each individual instance, and add all of these
-losses up to produce an overall loss.
+losses up to produce an overall loss.  (Yes, far more can be said on
 this, but the details aren't really important here.)
 # TODO: What else can I say about why loss should be additive?
 # Quote DL text? ML text?
 This leads to a problem in one-stage detectors: That dense set of
 locations that it's classifying usually contains a small number of
@@ -57,16 +68,124 @@ loss function still adds all of them up - and even if the loss is
 relatively low for each of the easy negatives, their cumulative loss
 can drown out the loss from objects that are being misclassified.
 That is: A large number of tiny, irrelevant losses overwhelm a smaller
 number of larger, relevant losses.  The paper was a bit terse on this;
 it took a few re-reads to understand why "easy negatives" were an
 issue, so hopefully I have this right.
 The training process is trying to minimize this loss, and so it is
-mostly nudging the model to improve in the area least in need of it
+mostly nudging the model to improve where it least needs it (its
-(its ability to classify background areas that it already classifies
+ability to classify background areas that it already classifies well)
-well) and neglecting the area most in need of it (its ability to
+and neglecting where it most needs it (its ability to classify the
-classify the "difficult" objects that it is misclassifying).
+"difficult" objects that it is misclassifying).
 # TODO: What else can I say about why loss should be additive?
 # Quote DL text? ML text?
 This is the *class imbalance* issue in a nutshell that the paper gives
 as the limiting factor for the accuracy of one-stage detectors.
 # TODO: Visualize this. Can I?
 This is *class imbalance* in a nutshell, which the paper gives as the
 limiting factor for the accuracy of one-stage detectors.  While the
 existing approaches try to tackle it with methods like bootstrapping
 or hard example mining, the accuracy still is lower.
 ** Focal Loss
 So, the point of all this is: A tweak to the loss function can fix
 this issue, and retain the speed and simplicity of one-stage
 approaches while surpassing the accuracy of existing two-stage ones.
 At least, this is what the paper claims.  Their novel loss function is
 called *Focal Loss* (as the title references), and it multiplies the
 normal cross-entropy by a factor, $(1-p_t)^\gamma$, where $p_t$
 approaches 1 as the model predicts a higher and higher probability of
 the correct classification, or 0 for an incorrect one, and $\gamma$ is
 a "focusing" hyperparameter (they used $\gamma=2$).  Intuitively, this
 scaling makes sense: if a classification is already correct (as in the
 "easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
 of the loss multiplied by it will likewise tend toward 0.
 * RetinaNet architecture
 The paper gives the name *RetinaNet* to the network they created which
 incorporates this focal loss in its training.  While it says, "We
 emphasize that our simple detector achieves top results not based on
 innovations in network design but due to our novel loss," it is
 important not to miss that /innovations in/: they are saying that they
 didn't need to invent a new network design - not that the network
 design doesn't matter.  Later in the paper, they say that it is in
 fact crucial that RetinaNet's architecture relies on FPN (Feature
 Pyramid Network) as its backbone.
 ** Feature Pyramid Network
 Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
 describes the basis of this FPN in detail (and, non-coincidentally I'm
 sure, the paper shares 4 co-authors with the paper this post
 explores).  The paper is fairly concise in describing FPNs; it only
 takes it around 3 pages to explain their purpose, related work, and
 their entire design.  The remainder shows experimental results and
 specific applications of FPNs.  While it shows FPNs implemented on a
 particular underlying network (ResNet), they were made purposely to be
 very simple and adaptable to nearly any kind of CNN.
 # TODO: Link to ResNet?
 To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]].  The below
 diagram illustrates an image pyramid:
 #+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
 #+ATTR_HTML: :width 100% :height 100%
 [[../images/2017-12-13-retinanet/1024px-Image_pyramid.svg.png]]
 Image pyramids have many uses, but the paper focuses on their use in
 taking something that works only at a certain scale of image - for
 instance, an image classification model that only identifies objects
 that are around 50 pixels across - and adapting it to handle different
 scales by applying it at every level of the image pyramid.  If the
 model has a little flexibility, some level of the image pyramid is
 bound to have scaled the object to the correct size that the model can
 match it.
 Typically, though, detection or classification isn't done directly on
 an image, but rather, the image is converted to some more useful
 feature space. However, these feature spaces likewise tend to be
 useful only at a specific scale.  This is the rationale behind
 "featurized image pyramids", or feature pyramids built upon image
 pyramids, created by converting each level of an image pyramid to that
 feature space.
 The problem with featurized image pyramids, the paper says, is that if
 you try to use them in CNNs, they drastically slow everything down,
 and use so much memory as to make normal training impossible.
 However, take a look below at this generic diagram of a generic deep
 CNN:
 #+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
 #+ATTR_HTML: :width 100% :height 100%
 [[../images/2017-12-13-retinanet/Typical_cnn.png]]
 You may notice that this network has a structure that bears some
 resemblance to an image pyramid.  This is because deep CNNs are
 already computing a sort of pyramid in their convolutional and
 subsampling stages.  In a nutshell, deep CNNs used in image
 classification push an image through a cascade of feature detectors,
 and each successive stage contains a feature map that is built out of
 features in the prior stage - thus producing a *feature hierarchy*
 which already is something like a pyramid and contains multiple
 different scales.
 When you move through levels of a featurized image pyramid, only scale
 should change.  When you move through levels of a feature hierarchy
 described here, scale changes, but so does the meaning of the
 features.  This is the *semantic gap* the paper references.  The
 meaning changes because each stage builds up more complex features by
 combining simpler features of the last stage.  The first stage, for
 instance, commonly handles pixel-level features like points, lines or
 edges at a particular direction.  In the final stage, presumably, the
 model has learned complex enough features that things like "kite" and
 "person" can be identified.
 The goal of FPN was to find a way to exploit this feature hierarchy
 that is already being computed and to produce something that has
 similar power to a featurized image pyramid but without too high of a
 cost in speed, memory, or complexity.
--- a/images/2017-12-13-retinanet/1024px-Image_pyramid.svg.png
+++ b/images/2017-12-13-retinanet/1024px-Image_pyramid.svg.png
--- a/images/2017-12-13-retinanet/2017-12-13-objdet.jpg
+++ b/images/2017-12-13-retinanet/2017-12-13-objdet.jpg
--- a/images/2017-12-13-retinanet/Typical_cnn.png
+++ b/images/2017-12-13-retinanet/Typical_cnn.png