blag/you_still_need_this/drafts/2017-12-13-retinanet.org

---
title: Explaining RetinaNet
author: Chris Hodapp
date: December 13, 2017
tags: technobabble
---

# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
# description:
# subtitle:

A paper came out in the past few months,
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
Facebook's teams.  The goal of this post is to
explain this paper as I work through it, through some of its
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].

* Object Detection

"Object detection" as it is used here refers to machine learning
models that can not just identify a single object in an image, but can
identify and *localize* multiple objects, like in the below photo
taken from
[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:

# TODO:
# Define mAP

#+CAPTION: TensorFlow object detection example 2.
#+ATTR_HTML: :width 100% :height 100%
[[../images/2017-12-13-retinanet/2017-12-13-objdet.jpg]]

At the time of writing, the most accurate object-detection methods
were based around R-CNN and its variants, and all used two-stage
approaches:

1. One model proposes a sparse set of locations in the image that
   probably contain something.  Ideally, this contains all objects in
   the image, but filters out the majority of negative locations
   (i.e. only background, not foreground).
2. Another model, typically a CNN (convolutional neural network),
   classifies each location in that sparse set as either being
   foreground and some specific object class (like "kite" or "person"
   above), or as being background.

Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
OverFeat. These simplified/approximated the two-stage approach by
replacing the first step with brute force.  That is, instead of
generating a sparse set of locations that probably have something of
interest, they simply handle all locations, whether or not they likely
contain something, by blanketing the entire image in a dense sampling
of many locations, many sizes, and many aspect ratios.

This is simpler and faster - but not as accurate as the two-stage
approaches.

Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
didn't come up with these names) merge the two models of two-stage
approaches into a single CNN, and exploit the possibility of sharing
computations that would otherwise be done twice.  I assume that this
is included in the comparisons done in the paper, but I'm not entirely
sure.

* Training & Class Imbalance

Briefly, the process of training these models requires minimizing some
kind of loss function that is based on what the model misclassifies
when it is run on some training data.  It's preferable to be able to
compute some loss over each individual instance, and add all of these
losses up to produce an overall loss.  (Yes, far more can be said on
this, but the details aren't really important here.)

# TODO: What else can I say about why loss should be additive?
# Quote DL text? ML text?

This leads to a problem in one-stage detectors: That dense set of
locations that it's classifying usually contains a small number of
locations that actually have objects (positives), and a much larger
number of locations that are just background and can be very easily
classified as being in the background (easy negatives). However, the
loss function still adds all of them up - and even if the loss is
relatively low for each of the easy negatives, their cumulative loss
can drown out the loss from objects that are being misclassified.

That is: A large number of tiny, irrelevant losses overwhelm a smaller
number of larger, relevant losses.  The paper was a bit terse on this;
it took a few re-reads to understand why "easy negatives" were an
issue, so hopefully I have this right.

The training process is trying to minimize this loss, and so it is
mostly nudging the model to improve where it least needs it (its
ability to classify background areas that it already classifies well)
and neglecting where it most needs it (its ability to classify the
"difficult" objects that it is misclassifying).

# TODO: Visualize this. Can I?

This is *class imbalance* in a nutshell, which the paper gives as the
limiting factor for the accuracy of one-stage detectors.  While the
existing approaches try to tackle it with methods like bootstrapping
or hard example mining, the accuracy still is lower.

** Focal Loss

So, the point of all this is: A tweak to the loss function can fix
this issue, and retain the speed and simplicity of one-stage
approaches while surpassing the accuracy of existing two-stage ones.

At least, this is what the paper claims.  Their novel loss function is
called *Focal Loss* (as the title references), and it multiplies the
normal cross-entropy by a factor, $(1-p_t)^\gamma$, where $p_t$
approaches 1 as the model predicts a higher and higher probability of
the correct classification, or 0 for an incorrect one, and $\gamma$ is
a "focusing" hyperparameter (they used $\gamma=2$).  Intuitively, this
scaling makes sense: if a classification is already correct (as in the
"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
of the loss multiplied by it will likewise tend toward 0.


* RetinaNet architecture

The paper gives the name *RetinaNet* to the network they created which
incorporates this focal loss in its training.  While it says, "We
emphasize that our simple detector achieves top results not based on
innovations in network design but due to our novel loss," it is
important not to miss that /innovations in/: they are saying that they
didn't need to invent a new network design - not that the network
design doesn't matter.  Later in the paper, they say that it is in
fact crucial that RetinaNet's architecture relies on FPN (Feature
Pyramid Network) as its backbone.  As far as I can tell, the
architecture's use of a variant of RPN (Region Proposal Network) is
also very important.

I go into both of these aspects below.

* Feature Pyramid Network

Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
describes the basis of this FPN in detail (and, non-coincidentally I'm
sure, the paper shares 4 co-authors with the paper this post
explores).  The paper is fairly concise in describing FPNs; it only
takes it around 3 pages to explain their purpose, related work, and
their entire design.  The remainder shows experimental results and
specific applications of FPNs.  While it shows FPNs implemented on a
particular underlying network (ResNet, mentioned below), they were
made purposely to be very simple and adaptable to nearly any kind of
CNN.

To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]].  The below
diagram illustrates an image pyramid:

#+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
#+ATTR_HTML: :width 100% :height 100%
[[../images/2017-12-13-retinanet/1024px-Image_pyramid.svg.png]]

Image pyramids have many uses, but the paper focuses on their use in
taking something that works only at a certain scale of image - for
instance, an image classification model that only identifies objects
that are around 50 pixels across - and adapting it to handle different
scales by applying it at every level of the image pyramid.  If the
model has a little flexibility, some level of the image pyramid is
bound to have scaled the object to the correct size that the model can
match it.

Typically, though, detection or classification isn't done directly on
an image, but rather, the image is converted to some more useful
feature space. However, these feature spaces likewise tend to be
useful only at a specific scale.  This is the rationale behind
"featurized image pyramids", or feature pyramids built upon image
pyramids, created by converting each level of an image pyramid to that
feature space.

The problem with featurized image pyramids, the paper says, is that if
you try to use them in CNNs, they drastically slow everything down,
and use so much memory as to make normal training impossible.

However, take a look below at this generic diagram of a generic deep
CNN:

#+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
#+ATTR_HTML: :width 100% :height 100%
[[../images/2017-12-13-retinanet/Typical_cnn.png]]

You may notice that this network has a structure that bears some
resemblance to an image pyramid.  This is because deep CNNs are
already computing a sort of pyramid in their convolutional and
subsampling stages.  In a nutshell, deep CNNs used in image
classification push an image through a cascade of feature detectors or
filters, and each successive stage contains a feature map that is
built out of features in the prior stage - thus producing a *feature
hierarchy* which already is something like a pyramid and contains
multiple different scales.  (Being able to train deep CNNs to jointly
learn the filters at each stage of that feature hierarchy from the
data, rather than engineering them by hand, is what sets deep learning
apart from "shallow" machine learning.)

When you move through levels of a featurized image pyramid, only scale
should change.  When you move through levels of a feature hierarchy
described here, scale changes, but so does the meaning of the
features.  This is the *semantic gap* the paper references.  Meaning
changes because each stage builds up more complex features by
combining simpler features of the last stage.  The first stage, for
instance, commonly handles pixel-level features like points, lines or
edges at a particular direction.  In the final stage, presumably, the
model has learned complex enough features that things like "kite" and
"person" can be identified.

The goal in the paper was to find a way to exploit this feature
hierarchy that is already being computed and to produce something that
has similar power to a featurized image pyramid but without too high
of a cost in speed, memory, or complexity.

Everything described so far (none of which is specific to FPNs), the
paper calls the *bottom-up* pathway - the feed-forward portion of the
CNN.  FPN adds to this a *top-down* pathway and some lateral
connections.

** Top-Down Pathway

** Lateral Connections

** As Applied to ResNet

# Note C=256 and such

# TODO: Link to some good explanations

For two reasons, I don't explain much about ResNet here.  The first is
that residual networks, like the ResNet used here, have seen lots of
attention and already have many good explanations online.  The second
is that the paper claims that the underlying network

[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]

* Anchors & Region Proposals

Recall last section what was said about feature maps, and the that the
deeper stages of the CNN happen to be good for classifying images.
While these deeper stages are lower-resolution than the input images,
and while their influence is spread out over larger areas of the input
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
stage spreading it a little further), the features here still maintain
a spatial relationship with the input image.  That is, moving across
one axis of this feature map still corresponds to moving across the
same axis of the input image.

# Just re-explain the above with the feature pyramid

RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks]].  I find the explanations in terms of "proposals", of
focusing the "attention" of the neural network, or of "telling the
neural network where to look" to be needlessly confusing and
misleading.  I'd rather explain very plainly how they work.

Central to RPNs is *anchors*.  Anchors aren't exactly a feature of the
CNN.  They're more a property that's used in its training and
inference.

In particular:

- Say that the feature pyramid has $L$ levels, and that level $l+1$ is
  half the resolution (thus double the scale) of level $l$.
- Say that level $l$ is a 256-channel feature map of size $W \times H$
  (i.e. it's a tensor with shape $W \times H \times 256$).  Note that
  $W$ and $H$ will be larger at lower levels, and smaller at higher
  levels, but in RetinaNet at least, always 256-channel samples.
- For every point on that feature map (all $WH$ of them), we can
  identify a corresponding point in the input image.  This is the
  center point of a broad region of the input image that influences
  this point in the feature map (i.e. its receptive field).  Note that
  as we move up to higher levels in the feature pyramid, these regions
  grow larger, and neighboring points in the feature map correspond to
  larger and larger jumps across the input image.
- We can make these regions explicit by defining *anchors* - specific
  rectangular regions associated with each point of a feature map.
  The size of the anchor depends on the scale of the feature map, or
  equivalently, what level of the feature map it came from.  All this
  means is that anchors in level $l+1$ are twice as large as the
  anchors of level $l$.

The view that this should paint is that a dense collection of anchors
covers the entire input image at different sizes - still in a very
ordered pattern, but with lots of overlap.  Remember how I mentioned
at the beginning of this post that one-stage object detectors use a
very "brute force" method?

My above explanation glossed over a couple things, but nothing that
should change the fundamentals.

- Anchors are actually associated with every 3x3 window in the anchor
  map, not precisely every point, but all this really means is that
  it's "every point and its immediate neighbors" rather than "every
  point".  This doesn't really matter to anchors, but matters
  elsewhere.
- It's not a single anchor per 3x3 window, but 9 anchors - one for
  each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
  scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
  This is just to handle objects of less-square shapes and to cover
  the gap in scale in between levels of the feature pyramid.  Note
  that the scale factors are evenly-spaced exponentially, such that an
  additional step down wouldn't make sense (the largest anchors at the
  pyramid level /below/ already cover this scale), and nor would an
  additional step up (the smallest anchors at the pyramid level
  /above/ already cover it).

Here, finally, is where actual classification and regression come in.
The *classification subnet* and *box regression subnet* are here.

** Classification Subnet

Every anchor associates an image region with a 3x3 window (i.e. a
3x3x256 section - it's still 256-channel).  The classification subnet
is responsible for learning: do the features in this 3x3 window,
produced from some input, image indicate that an object is inside this
anchor?  Or, more accurately: For each of $K$ object classes, what's
the probability of each object (or just of it being background)?

** Box Regression Subnet

The box regression subnet takes the same input as the classification
subnet, but tries to learn the answer to a different question.  It is
responsible for learning: what are the coordinates to the object
inside of this anchor (assuming there is one)?  More specifically, it
tries to learn to produce 4 numbers values which give offsets relative
to the anchor's bounds (thus specifying a different region).  Note
that this subnet completely ignores the class of the object.

The classification subnet already tells us whether or not a given
anchor contains an object - which already gives rough bounds on
it. The box regression subnet helps tighten these bounds.

** Other notes (?)

I've glossed over a few details here.  Everything I've described above
is implemented with bog-standard convolutional networks...

# Parameter sharing? How to explain?

* Training

# Ground-truth object boxes
# Intersection-over-Union thresholds

* Inference

# Top N results

* References

# Does org-mode have a way to make a special section for references?
# I know I saw this somewhere

1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
9. https://github.com/KaimingHe/deep-residual-networks
10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/