Some FPN explanation in RetinaNet post

2017-12-15 13:58:52 -05:00
parent cc3a24565f
commit 8240497272
4 changed files with 147 additions and 28 deletions
@@ -5,8 +5,10 @@

 A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
 Detection]], from one of Facebook's teams.  The goal of this post is to
-explain this work a bit as I work through the paper, and to look at
-one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
+explain this work a bit as I work through the paper, through some of
+its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
+
+* Object Detection

 "Object detection" as it is used here refers to machine learning
 models that can not just identify a single object in an image, but can
@@ -19,34 +21,43 @@ Object Detection API]]:

 #+CAPTION: TensorFlow object detection example 2.
 #+ATTR_HTML: :width 100% :height 100%
-[[../images/2017-12-13-objdet.jpg]]
+[[../images/2017-12-13-retinanet/2017-12-13-objdet.jpg]]

-The paper discusses many of the two-stage approaches, like R-CNN and
-its variants, which work in two steps:
+At the time of writing, the most accurate object-detection methods
+were based around R-CNN and its variants, and all used two-stage
+approaches:

 1. One model proposes a sparse set of locations in the image that
   probably contain something.  Ideally, this contains all objects in
   the image, but filters out the majority of negative locations
   (i.e. only background, not foreground).
-2. Another model, typically a convolutional neural network, classifies
-   each location in that sparse set as either being foreground and
-   some specific object class, or as being background.
+2. Another model, typically a CNN (convolutional neural network),
+   classifies each location in that sparse set as either being
+   foreground and some specific object class (like "kite" or "person"
+   above), or as being background.

-Additionally, it discusses some existing one-stage approaches like
-[[https://pjreddie.com/darknet/yolo/][YOLO]] and [[https://arxiv.org/abs/1512.02325][SSD]].  In essence, these run only the second step - but
-instead of starting from a sparse set of locations that are probably
-something of interest, they start from a dense set of locations which
-has blanketed the entire image on a grid of many locations, over many
-sizes, and over many aspect ratios, regardless of whether they may
-contain an object.
+Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
+OverFeat. These simplified/approximated the two-stage approach by
+replacing the first step with brute force.  That is, instead of
+generating a sparse set of locations that probably have something of
+interest, they simply handle all locations, whether or not they likely
+contain something, by blanketing the entire image in a dense sampling
+of many locations, many sizes, and many aspect ratios.

-This is simpler and faster - but not nearly as accurate.
+This is simpler and faster - but not as accurate as the two-stage
+approaches.

-Broadly, the process of training these models requires minimizing some
+* Training & Class Imbalance
+
+Briefly, the process of training these models requires minimizing some
 kind of loss function that is based on what the model misclassifies
 when it is run on some training data.  It's preferable to be able to
 compute some loss over each individual instance, and add all of these
-losses up to produce an overall loss.
+losses up to produce an overall loss.  (Yes, far more can be said on
+this, but the details aren't really important here.)
+
+# TODO: What else can I say about why loss should be additive?
+# Quote DL text? ML text?

 This leads to a problem in one-stage detectors: That dense set of
 locations that it's classifying usually contains a small number of
@@ -57,16 +68,124 @@ loss function still adds all of them up - and even if the loss is
 relatively low for each of the easy negatives, their cumulative loss
 can drown out the loss from objects that are being misclassified.

+That is: A large number of tiny, irrelevant losses overwhelm a smaller
+number of larger, relevant losses.  The paper was a bit terse on this;
+it took a few re-reads to understand why "easy negatives" were an
+issue, so hopefully I have this right.
+
 The training process is trying to minimize this loss, and so it is
-mostly nudging the model to improve in the area least in need of it
-(its ability to classify background areas that it already classifies
-well) and neglecting the area most in need of it (its ability to
-classify the "difficult" objects that it is misclassifying).
-
-# TODO: What else can I say about why loss should be additive?
-# Quote DL text? ML text?
-
-This is the *class imbalance* issue in a nutshell that the paper gives
-as the limiting factor for the accuracy of one-stage detectors.
+mostly nudging the model to improve where it least needs it (its
+ability to classify background areas that it already classifies well)
+and neglecting where it most needs it (its ability to classify the
+"difficult" objects that it is misclassifying).

 # TODO: Visualize this. Can I?
+
+This is *class imbalance* in a nutshell, which the paper gives as the
+limiting factor for the accuracy of one-stage detectors.  While the
+existing approaches try to tackle it with methods like bootstrapping
+or hard example mining, the accuracy still is lower.
+
+** Focal Loss
+
+So, the point of all this is: A tweak to the loss function can fix
+this issue, and retain the speed and simplicity of one-stage
+approaches while surpassing the accuracy of existing two-stage ones.
+
+At least, this is what the paper claims.  Their novel loss function is
+called *Focal Loss* (as the title references), and it multiplies the
+normal cross-entropy by a factor, $(1-p_t)^\gamma$, where $p_t$
+approaches 1 as the model predicts a higher and higher probability of
+the correct classification, or 0 for an incorrect one, and $\gamma$ is
+a "focusing" hyperparameter (they used $\gamma=2$).  Intuitively, this
+scaling makes sense: if a classification is already correct (as in the
+"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
+of the loss multiplied by it will likewise tend toward 0.
+
+* RetinaNet architecture
+
+The paper gives the name *RetinaNet* to the network they created which
+incorporates this focal loss in its training.  While it says, "We
+emphasize that our simple detector achieves top results not based on
+innovations in network design but due to our novel loss," it is
+important not to miss that /innovations in/: they are saying that they
+didn't need to invent a new network design - not that the network
+design doesn't matter.  Later in the paper, they say that it is in
+fact crucial that RetinaNet's architecture relies on FPN (Feature
+Pyramid Network) as its backbone.
+
+** Feature Pyramid Network
+
+Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
+describes the basis of this FPN in detail (and, non-coincidentally I'm
+sure, the paper shares 4 co-authors with the paper this post
+explores).  The paper is fairly concise in describing FPNs; it only
+takes it around 3 pages to explain their purpose, related work, and
+their entire design.  The remainder shows experimental results and
+specific applications of FPNs.  While it shows FPNs implemented on a
+particular underlying network (ResNet), they were made purposely to be
+very simple and adaptable to nearly any kind of CNN.
+
+# TODO: Link to ResNet?
+
+To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]].  The below
+diagram illustrates an image pyramid:
+
+#+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
+#+ATTR_HTML: :width 100% :height 100%
+[[../images/2017-12-13-retinanet/1024px-Image_pyramid.svg.png]]
+
+Image pyramids have many uses, but the paper focuses on their use in
+taking something that works only at a certain scale of image - for
+instance, an image classification model that only identifies objects
+that are around 50 pixels across - and adapting it to handle different
+scales by applying it at every level of the image pyramid.  If the
+model has a little flexibility, some level of the image pyramid is
+bound to have scaled the object to the correct size that the model can
+match it.
+
+Typically, though, detection or classification isn't done directly on
+an image, but rather, the image is converted to some more useful
+feature space. However, these feature spaces likewise tend to be
+useful only at a specific scale.  This is the rationale behind
+"featurized image pyramids", or feature pyramids built upon image
+pyramids, created by converting each level of an image pyramid to that
+feature space.
+
+The problem with featurized image pyramids, the paper says, is that if
+you try to use them in CNNs, they drastically slow everything down,
+and use so much memory as to make normal training impossible.
+
+However, take a look below at this generic diagram of a generic deep
+CNN:
+
+#+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
+#+ATTR_HTML: :width 100% :height 100%
+[[../images/2017-12-13-retinanet/Typical_cnn.png]]
+
+You may notice that this network has a structure that bears some
+resemblance to an image pyramid.  This is because deep CNNs are
+already computing a sort of pyramid in their convolutional and
+subsampling stages.  In a nutshell, deep CNNs used in image
+classification push an image through a cascade of feature detectors,
+and each successive stage contains a feature map that is built out of
+features in the prior stage - thus producing a *feature hierarchy*
+which already is something like a pyramid and contains multiple
+different scales.
+
+When you move through levels of a featurized image pyramid, only scale
+should change.  When you move through levels of a feature hierarchy
+described here, scale changes, but so does the meaning of the
+features.  This is the *semantic gap* the paper references.  The
+meaning changes because each stage builds up more complex features by
+combining simpler features of the last stage.  The first stage, for
+instance, commonly handles pixel-level features like points, lines or
+edges at a particular direction.  In the final stage, presumably, the
+model has learned complex enough features that things like "kite" and
+"person" can be identified.
+
+The goal of FPN was to find a way to exploit this feature hierarchy
+that is already being computed and to produce something that has
+similar power to a featurized image pyramid but without too high of a
+cost in speed, memory, or complexity.
+