Migrate some drafts into content/posts with 'draft' flag
This commit is contained in:
BIN
content/posts/2017-12-13-retinanet/1024px-Image_pyramid.svg.png
Normal file
BIN
content/posts/2017-12-13-retinanet/1024px-Image_pyramid.svg.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 369 KiB |
BIN
content/posts/2017-12-13-retinanet/2017-12-13-objdet.jpg
Normal file
BIN
content/posts/2017-12-13-retinanet/2017-12-13-objdet.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 256 KiB |
BIN
content/posts/2017-12-13-retinanet/Typical_cnn.png
Normal file
BIN
content/posts/2017-12-13-retinanet/Typical_cnn.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 124 KiB |
373
content/posts/2017-12-13-retinanet/index.org
Normal file
373
content/posts/2017-12-13-retinanet/index.org
Normal file
@@ -0,0 +1,373 @@
|
||||
---
|
||||
title: Explaining RetinaNet
|
||||
author: Chris Hodapp
|
||||
date: December 13, 2017
|
||||
tags:
|
||||
- technobabble
|
||||
draft: true
|
||||
---
|
||||
|
||||
# TODO: The inline equations are still broken (maybe because this is
|
||||
# in org format)
|
||||
|
||||
# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
|
||||
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
|
||||
# description:
|
||||
# subtitle:
|
||||
|
||||
A paper came out in the past few months,
|
||||
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
|
||||
Facebook's teams. The goal of this post is to
|
||||
explain this paper as I work through it, through some of its
|
||||
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
||||
|
||||
* Object Detection
|
||||
|
||||
"Object detection" as it is used here refers to machine learning
|
||||
models that can not just identify a single object in an image, but can
|
||||
identify and *localize* multiple objects, like in the below photo
|
||||
taken from
|
||||
[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:
|
||||
|
||||
# TODO:
|
||||
# Define mAP
|
||||
|
||||
#+CAPTION: TensorFlow object detection example 2.
|
||||
#+ATTR_HTML: :width 100% :height 100%
|
||||
[[./2017-12-13-objdet.jpg]]
|
||||
|
||||
At the time of writing, the most accurate object-detection methods
|
||||
were based around R-CNN and its variants, and all used two-stage
|
||||
approaches:
|
||||
|
||||
1. One model proposes a sparse set of locations in the image that
|
||||
probably contain something. Ideally, this contains all objects in
|
||||
the image, but filters out the majority of negative locations
|
||||
(i.e. only background, not foreground).
|
||||
2. Another model, typically a CNN (convolutional neural network),
|
||||
classifies each location in that sparse set as either being
|
||||
foreground and some specific object class (like "kite" or "person"
|
||||
above), or as being background.
|
||||
|
||||
Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
|
||||
OverFeat. These simplified/approximated the two-stage approach by
|
||||
replacing the first step with brute force. That is, instead of
|
||||
generating a sparse set of locations that probably have something of
|
||||
interest, they simply handle all locations, whether or not they likely
|
||||
contain something, by blanketing the entire image in a dense sampling
|
||||
of many locations, many sizes, and many aspect ratios.
|
||||
|
||||
This is simpler and faster - but not as accurate as the two-stage
|
||||
approaches.
|
||||
|
||||
Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
|
||||
didn't come up with these names) merge the two models of two-stage
|
||||
approaches into a single CNN, and exploit the possibility of sharing
|
||||
computations that would otherwise be done twice. I assume that this
|
||||
is included in the comparisons done in the paper, but I'm not entirely
|
||||
sure.
|
||||
|
||||
* Training & Class Imbalance
|
||||
|
||||
Briefly, the process of training these models requires minimizing some
|
||||
kind of loss function that is based on what the model misclassifies
|
||||
when it is run on some training data. It's preferable to be able to
|
||||
compute some loss over each individual instance, and add all of these
|
||||
losses up to produce an overall loss. (Yes, far more can be said on
|
||||
this, but the details aren't really important here.)
|
||||
|
||||
# TODO: What else can I say about why loss should be additive?
|
||||
# Quote DL text? ML text?
|
||||
|
||||
This leads to a problem in one-stage detectors: That dense set of
|
||||
locations that it's classifying usually contains a small number of
|
||||
locations that actually have objects (positives), and a much larger
|
||||
number of locations that are just background and can be very easily
|
||||
classified as being in the background (easy negatives). However, the
|
||||
loss function still adds all of them up - and even if the loss is
|
||||
relatively low for each of the easy negatives, their cumulative loss
|
||||
can drown out the loss from objects that are being misclassified.
|
||||
|
||||
That is: A large number of tiny, irrelevant losses overwhelm a smaller
|
||||
number of larger, relevant losses. The paper was a bit terse on this;
|
||||
it took a few re-reads to understand why "easy negatives" were an
|
||||
issue, so hopefully I have this right.
|
||||
|
||||
The training process is trying to minimize this loss, and so it is
|
||||
mostly nudging the model to improve where it least needs it (its
|
||||
ability to classify background areas that it already classifies well)
|
||||
and neglecting where it most needs it (its ability to classify the
|
||||
"difficult" objects that it is misclassifying).
|
||||
|
||||
# TODO: Visualize this. Can I?
|
||||
|
||||
This is *class imbalance* in a nutshell, which the paper gives as the
|
||||
limiting factor for the accuracy of one-stage detectors. While the
|
||||
existing approaches try to tackle it with methods like bootstrapping
|
||||
or hard example mining, the accuracy still is lower.
|
||||
|
||||
** Focal Loss
|
||||
|
||||
So, the point of all this is: A tweak to the loss function can fix
|
||||
this issue, and retain the speed and simplicity of one-stage
|
||||
approaches while surpassing the accuracy of existing two-stage ones.
|
||||
|
||||
At least, this is what the paper claims. Their novel loss function is
|
||||
called *Focal Loss* (as the title references), and it multiplies the
|
||||
normal cross-entropy by a factor, \( (1-p_t)^\gamma \), where \( p_t \)
|
||||
approaches 1 as the model predicts a higher and higher probability of
|
||||
the correct classification, or 0 for an incorrect one, and \( \gamma \) is
|
||||
a "focusing" hyperparameter (they used \( \gamma=2 \)). Intuitively, this
|
||||
scaling makes sense: if a classification is already correct (as in the
|
||||
"easy negatives"), \( (1-p_t)^\gamma \) tends toward 0, and so the portion
|
||||
of the loss multiplied by it will likewise tend toward 0.
|
||||
|
||||
|
||||
* RetinaNet architecture
|
||||
|
||||
The paper gives the name *RetinaNet* to the network they created which
|
||||
incorporates this focal loss in its training. While it says, "We
|
||||
emphasize that our simple detector achieves top results not based on
|
||||
innovations in network design but due to our novel loss," it is
|
||||
important not to miss that /innovations in/: they are saying that they
|
||||
didn't need to invent a new network design - not that the network
|
||||
design doesn't matter. Later in the paper, they say that it is in
|
||||
fact crucial that RetinaNet's architecture relies on FPN (Feature
|
||||
Pyramid Network) as its backbone. As far as I can tell, the
|
||||
architecture's use of a variant of RPN (Region Proposal Network) is
|
||||
also very important.
|
||||
|
||||
I go into both of these aspects below.
|
||||
|
||||
* Feature Pyramid Network
|
||||
|
||||
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
|
||||
describes the basis of this FPN in detail (and, non-coincidentally I'm
|
||||
sure, the paper shares 4 co-authors with the paper this post
|
||||
explores). The paper is fairly concise in describing FPNs; it only
|
||||
takes it around 3 pages to explain their purpose, related work, and
|
||||
their entire design. The remainder shows experimental results and
|
||||
specific applications of FPNs. While it shows FPNs implemented on a
|
||||
particular underlying network (ResNet, mentioned below), they were
|
||||
made purposely to be very simple and adaptable to nearly any kind of
|
||||
CNN.
|
||||
|
||||
To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]]. The below
|
||||
diagram illustrates an image pyramid:
|
||||
|
||||
#+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
|
||||
#+ATTR_HTML: :width 100% :height 100%
|
||||
[[./1024px-Image_pyramid.svg.png]]
|
||||
|
||||
Image pyramids have many uses, but the paper focuses on their use in
|
||||
taking something that works only at a certain scale of image - for
|
||||
instance, an image classification model that only identifies objects
|
||||
that are around 50 pixels across - and adapting it to handle different
|
||||
scales by applying it at every level of the image pyramid. If the
|
||||
model has a little flexibility, some level of the image pyramid is
|
||||
bound to have scaled the object to the correct size that the model can
|
||||
match it.
|
||||
|
||||
Typically, though, detection or classification isn't done directly on
|
||||
an image, but rather, the image is converted to some more useful
|
||||
feature space. However, these feature spaces likewise tend to be
|
||||
useful only at a specific scale. This is the rationale behind
|
||||
"featurized image pyramids", or feature pyramids built upon image
|
||||
pyramids, created by converting each level of an image pyramid to that
|
||||
feature space.
|
||||
|
||||
The problem with featurized image pyramids, the paper says, is that if
|
||||
you try to use them in CNNs, they drastically slow everything down,
|
||||
and use so much memory as to make normal training impossible.
|
||||
|
||||
However, take a look below at this generic diagram of a generic deep
|
||||
CNN:
|
||||
|
||||
#+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
|
||||
#+ATTR_HTML: :width 100% :height 100%
|
||||
[[./Typical_cnn.png]]
|
||||
|
||||
You may notice that this network has a structure that bears some
|
||||
resemblance to an image pyramid. This is because deep CNNs are
|
||||
already computing a sort of pyramid in their convolutional and
|
||||
subsampling stages. In a nutshell, deep CNNs used in image
|
||||
classification push an image through a cascade of feature detectors or
|
||||
filters, and each successive stage contains a feature map that is
|
||||
built out of features in the prior stage - thus producing a *feature
|
||||
hierarchy* which already is something like a pyramid and contains
|
||||
multiple different scales. (Being able to train deep CNNs to jointly
|
||||
learn the filters at each stage of that feature hierarchy from the
|
||||
data, rather than engineering them by hand, is what sets deep learning
|
||||
apart from "shallow" machine learning.)
|
||||
|
||||
When you move through levels of a featurized image pyramid, only scale
|
||||
should change. When you move through levels of a feature hierarchy
|
||||
described here, scale changes, but so does the meaning of the
|
||||
features. This is the *semantic gap* the paper references. Meaning
|
||||
changes because each stage builds up more complex features by
|
||||
combining simpler features of the last stage. The first stage, for
|
||||
instance, commonly handles pixel-level features like points, lines or
|
||||
edges at a particular direction. In the final stage, presumably, the
|
||||
model has learned complex enough features that things like "kite" and
|
||||
"person" can be identified.
|
||||
|
||||
The goal in the paper was to find a way to exploit this feature
|
||||
hierarchy that is already being computed and to produce something that
|
||||
has similar power to a featurized image pyramid but without too high
|
||||
of a cost in speed, memory, or complexity.
|
||||
|
||||
Everything described so far (none of which is specific to FPNs), the
|
||||
paper calls the *bottom-up* pathway - the feed-forward portion of the
|
||||
CNN. FPN adds to this a *top-down* pathway and some lateral
|
||||
connections.
|
||||
|
||||
** Top-Down Pathway
|
||||
|
||||
** Lateral Connections
|
||||
|
||||
** As Applied to ResNet
|
||||
|
||||
# Note C=256 and such
|
||||
|
||||
# TODO: Link to some good explanations
|
||||
|
||||
For two reasons, I don't explain much about ResNet here. The first is
|
||||
that residual networks, like the ResNet used here, have seen lots of
|
||||
attention and already have many good explanations online. The second
|
||||
is that the paper claims that the underlying network
|
||||
|
||||
[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
|
||||
[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
|
||||
|
||||
* Anchors & Region Proposals
|
||||
|
||||
Recall last section what was said about feature maps, and the that the
|
||||
deeper stages of the CNN happen to be good for classifying images.
|
||||
While these deeper stages are lower-resolution than the input images,
|
||||
and while their influence is spread out over larger areas of the input
|
||||
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
|
||||
stage spreading it a little further), the features here still maintain
|
||||
a spatial relationship with the input image. That is, moving across
|
||||
one axis of this feature map still corresponds to moving across the
|
||||
same axis of the input image.
|
||||
|
||||
# Just re-explain the above with the feature pyramid
|
||||
|
||||
RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
|
||||
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
|
||||
R-CNN: Towards Real-Time Object Detection with Region Proposal
|
||||
Networks]]. I find the explanations in terms of "proposals", of
|
||||
focusing the "attention" of the neural network, or of "telling the
|
||||
neural network where to look" to be needlessly confusing and
|
||||
misleading. I'd rather explain very plainly how they work.
|
||||
|
||||
Central to RPNs is *anchors*. Anchors aren't exactly a feature of the
|
||||
CNN. They're more a property that's used in its training and
|
||||
inference.
|
||||
|
||||
In particular:
|
||||
|
||||
- Say that the feature pyramid has \( L \) levels, and that level \( l+1 \) is
|
||||
half the resolution (thus double the scale) of level \( l \).
|
||||
- Say that level \( l \) is a 256-channel feature map of size \( W \times H \)
|
||||
(i.e. it's a tensor with shape \( W \times H \times 256 \)). Note that
|
||||
\( W \) and \( H \) will be larger at lower levels, and smaller at higher
|
||||
levels, but in RetinaNet at least, always 256-channel samples.
|
||||
- For every point on that feature map (all \( WH \) of them), we can
|
||||
identify a corresponding point in the input image. This is the
|
||||
center point of a broad region of the input image that influences
|
||||
this point in the feature map (i.e. its receptive field). Note that
|
||||
as we move up to higher levels in the feature pyramid, these regions
|
||||
grow larger, and neighboring points in the feature map correspond to
|
||||
larger and larger jumps across the input image.
|
||||
- We can make these regions explicit by defining *anchors* - specific
|
||||
rectangular regions associated with each point of a feature map.
|
||||
The size of the anchor depends on the scale of the feature map, or
|
||||
equivalently, what level of the feature map it came from. All this
|
||||
means is that anchors in level \( l+1 \) are twice as large as the
|
||||
anchors of level \( l \).
|
||||
|
||||
The view that this should paint is that a dense collection of anchors
|
||||
covers the entire input image at different sizes - still in a very
|
||||
ordered pattern, but with lots of overlap. Remember how I mentioned
|
||||
at the beginning of this post that one-stage object detectors use a
|
||||
very "brute force" method?
|
||||
|
||||
My above explanation glossed over a couple things, but nothing that
|
||||
should change the fundamentals.
|
||||
|
||||
- Anchors are actually associated with every 3x3 window in the anchor
|
||||
map, not precisely every point, but all this really means is that
|
||||
it's "every point and its immediate neighbors" rather than "every
|
||||
point". This doesn't really matter to anchors, but matters
|
||||
elsewhere.
|
||||
- It's not a single anchor per 3x3 window, but 9 anchors - one for
|
||||
each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
|
||||
scale factors (\( 1, 2^{1/3}, and 2^{2/3} \)) on top of its base scale.
|
||||
This is just to handle objects of less-square shapes and to cover
|
||||
the gap in scale in between levels of the feature pyramid. Note
|
||||
that the scale factors are evenly-spaced exponentially, such that an
|
||||
additional step down wouldn't make sense (the largest anchors at the
|
||||
pyramid level /below/ already cover this scale), and nor would an
|
||||
additional step up (the smallest anchors at the pyramid level
|
||||
/above/ already cover it).
|
||||
|
||||
Here, finally, is where actual classification and regression come in.
|
||||
The *classification subnet* and *box regression subnet* are here.
|
||||
|
||||
** Classification Subnet
|
||||
|
||||
Every anchor associates an image region with a 3x3 window (i.e. a
|
||||
3x3x256 section - it's still 256-channel). The classification subnet
|
||||
is responsible for learning: do the features in this 3x3 window,
|
||||
produced from some input, image indicate that an object is inside this
|
||||
anchor? Or, more accurately: For each of \( K \) object classes, what's
|
||||
the probability of each object (or just of it being background)?
|
||||
|
||||
** Box Regression Subnet
|
||||
|
||||
The box regression subnet takes the same input as the classification
|
||||
subnet, but tries to learn the answer to a different question. It is
|
||||
responsible for learning: what are the coordinates to the object
|
||||
inside of this anchor (assuming there is one)? More specifically, it
|
||||
tries to learn to produce 4 numbers values which give offsets relative
|
||||
to the anchor's bounds (thus specifying a different region). Note
|
||||
that this subnet completely ignores the class of the object.
|
||||
|
||||
The classification subnet already tells us whether or not a given
|
||||
anchor contains an object - which already gives rough bounds on
|
||||
it. The box regression subnet helps tighten these bounds.
|
||||
|
||||
** Other notes (?)
|
||||
|
||||
I've glossed over a few details here. Everything I've described above
|
||||
is implemented with bog-standard convolutional networks...
|
||||
|
||||
# Parameter sharing? How to explain?
|
||||
|
||||
* Training
|
||||
|
||||
# Ground-truth object boxes
|
||||
# Intersection-over-Union thresholds
|
||||
|
||||
* Inference
|
||||
|
||||
# Top N results
|
||||
|
||||
* References
|
||||
|
||||
# Does org-mode have a way to make a special section for references?
|
||||
# I know I saw this somewhere
|
||||
|
||||
1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
|
||||
2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
|
||||
3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
|
||||
4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
|
||||
5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
|
||||
6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
|
||||
7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
|
||||
8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
|
||||
9. https://github.com/KaimingHe/deep-residual-networks
|
||||
10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
|
||||
11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
|
||||
12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/
|
||||
Reference in New Issue
Block a user