RetinaNet post: Fixed header; added a bit to anchor/subnets

2017-12-16 12:59:15 -05:00
parent e14a94ee5e
commit cc221d9a6f
1 changed files with 131 additions and 23 deletions
--- a/drafts/2017-12-13-retinanet.org
+++ b/drafts/2017-12-13-retinanet.org
@@ -1,10 +1,18 @@
-#+TITLE: Explaining RetinaNet
+---
-#+AUTHOR: Chris Hodapp
+title: Explaining RetinaNet
-#+DATE: December 13, 2017
+author: Chris Hodapp
-#+TAGS: technobabble
+date: December 13, 2017
 tags: technobabble
 ---
-A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
+# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
-Detection]], from one of Facebook's teams.  The goal of this post is to
+# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
 # description: 
 # subtitle: 
 A paper came out in the past few months,
 [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
 Facebook's teams.  The goal of this post is to
 explain this paper as I work through it, through some of its
 references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
@@ -109,6 +117,7 @@ scaling makes sense: if a classification is already correct (as in the
 "easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
 of the loss multiplied by it will likewise tend toward 0.
 * RetinaNet architecture
 The paper gives the name *RetinaNet* to the network they created which
@@ -125,7 +134,7 @@ also very important.
 I go into both of these aspects below.
-** Feature Pyramid Network
+* Feature Pyramid Network
 Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
 describes the basis of this FPN in detail (and, non-coincidentally I'm
@@ -208,26 +217,125 @@ paper calls the *bottom-up* pathway - the feed-forward portion of the
 CNN.  FPN adds to this a *top-down* pathway and some lateral
 connections.
-*** Top-Down Pathway
+** Top-Down Pathway
-*** Lateral Connections
+** Lateral Connections
-*** As Applied to ResNet
+** As Applied to ResNet
 # Note C=256 and such
-** Anchors & Region Proposals
+* Anchors & Region Proposals
-The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
+Recall last section what was said about feature maps, and the that the
-Proposal Networks]] explains anchors and RPNs (Region Proposal
+deeper stages of the CNN happen to be good for classifying images.
-Networks), which RetinaNet's design also relies on heavily.
+While these deeper stages are lower-resolution than the input images,
 and while their influence is spread out over larger areas of the input
 image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
 stage spreading it a little further), the features here still maintain
 a spatial relationship with the input image.  That is, moving across
 one axis of this feature map still corresponds to moving across the
 same axis of the input image.
-Recall a few sections ago what was said about feature maps, and the
+# Just re-explain the above with the feature pyramid
-fact that the deeper stages of the CNN happen to be good for
+
-classifying images.  While these deeper stages are lower-resolution
+RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
-than the input images, and while their influence is spread out over
+here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
-larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
+R-CNN: Towards Real-Time Object Detection with Region Proposal
-rather large due to each stage spreading it a little further), the
+Networks]].  I find the explanations in terms of "proposals", of
-features here still maintain a spatial relationship with the input
+focusing the "attention" of the neural network, or of "telling the
-image.  That is, moving across one axis of this feature map still
+neural network where to look" to be needlessly confusing and
-corresponds to moving across the same axis of the input image.
+misleading.  I'd rather explain very plainly how they work.
 Central to RPNs is *anchors*.  Anchors aren't exactly a feature of the
 CNN.  They're more a property that's used in its training and
 inference.
 In particular:
 - Say that the feature pyramid has $L$ levels, and that level $l+1$ is
  half the resolution (thus double the scale) of level $l$.
 - Say that level $l$ is a 256-channel feature map of size $W \times H$
  (i.e. it's a tensor with shape $W \times H \times 256$).  Note that
  $W$ and $H$ will be larger at lower levels, and smaller at higher
  levels, but in RetinaNet at least, always 256-channel samples.
 - For every point on that feature map (all $WH$ of them), we can
  identify a corresponding point in the input image.  This is the
  center point of a broad region of the input image that influences
  this point in the feature map (i.e. its receptive field).  Note that
  as we move up to higher levels in the feature pyramid, these regions
  grow larger, and neighboring points in the feature map correspond to
  larger and larger jumps across the input image.
 - We can make these regions explicit by defining *anchors* - specific
  rectangular regions associated with each point of a feature map.
  The size of the anchor depends on the scale of the feature map, or
  equivalently, what level of the feature map it came from.  All this
  means is that anchors in level $l+1$ are twice as large as the
  anchors of level $l$.
 The view that this should paint is that a dense collection of anchors
 covers the entire input image at different sizes - still in a very
 ordered pattern, but with lots of overlap.  Remember how I mentioned
 at the beginning of this post that one-stage object detectors use a
 very "brute force" method?
 My above explanation glossed over a couple things, but nothing that
 should change the fundamentals.
 - Anchors are actually associated with every 3x3 window in the anchor
  map, not precisely every point, but all this really means is that
  it's "every point and its immediate neighbors" rather than "every
  point".  This doesn't really matter to anchors, but matters
  elsewhere.
 - It's not a single anchor per 3x3 window, but 9 anchors - one for
  each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
  scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
  This is just to handle objects of less-square shapes and to cover
  the gap in scale in between levels of the feature pyramid.  Note
  that the scale factors are evenly-spaced exponentially, such that an
  additional step down wouldn't make sense (the largest anchors at the
  pyramid level /below/ already cover this scale), and nor would an
  additional step up (the smallest anchors at the pyramid level
  /above/ already cover it).
 Here, finally, is where actual classification and regression come in.
 The *classification subnet* and *box regression subnet* are here.
 ** Classification Subnet
 Every anchor associates an image region with a 3x3 window (i.e. a
 3x3x256 section - it's still 256-channel).  The classification subnet
 is responsible for learning: do the features in this 3x3 window,
 produced from some input, image indicate that an object is inside this
 anchor?  Or, more accurately: For each of $K$ object classes, what's
 the probability of each object (or just of it being background)?
 ** Box Regression Subnet
 The box regression subnet takes the same input as the classification
 subnet, but tries to learn the answer to a different question.  It is
 responsible for learning: what are the coordinates to the object
 inside of this anchor (assuming there is one)?  More specifically, it
 tries to learn to produce 4 numbers values which give offsets relative
 to the anchor's bounds (thus specifying a different region).  Note
 that this subnet completely ignores the class of the object.
 The classification subnet already tells us whether or not a given
 anchor contains an object - which already gives rough bounds on
 it. The box regression subnet helps tighten these bounds.
 ** Other notes (?)
 I've glossed over a few details here.  Everything I've described above
 is implemented with bog-standard convolutional networks...
 # Parameter sharing? How to explain?
 * Training
 # Ground-truth object boxes
 # Intersection-over-Union thresholds
 * Inference
 # Top N results