RetinaNet post: Fixed header; added a bit to anchor/subnets
This commit is contained in:
parent
e14a94ee5e
commit
cc221d9a6f
@ -1,10 +1,18 @@
|
||||
#+TITLE: Explaining RetinaNet
|
||||
#+AUTHOR: Chris Hodapp
|
||||
#+DATE: December 13, 2017
|
||||
#+TAGS: technobabble
|
||||
---
|
||||
title: Explaining RetinaNet
|
||||
author: Chris Hodapp
|
||||
date: December 13, 2017
|
||||
tags: technobabble
|
||||
---
|
||||
|
||||
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
|
||||
Detection]], from one of Facebook's teams. The goal of this post is to
|
||||
# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
|
||||
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
|
||||
# description:
|
||||
# subtitle:
|
||||
|
||||
A paper came out in the past few months,
|
||||
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
|
||||
Facebook's teams. The goal of this post is to
|
||||
explain this paper as I work through it, through some of its
|
||||
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
||||
|
||||
@ -109,6 +117,7 @@ scaling makes sense: if a classification is already correct (as in the
|
||||
"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
|
||||
of the loss multiplied by it will likewise tend toward 0.
|
||||
|
||||
|
||||
* RetinaNet architecture
|
||||
|
||||
The paper gives the name *RetinaNet* to the network they created which
|
||||
@ -125,7 +134,7 @@ also very important.
|
||||
|
||||
I go into both of these aspects below.
|
||||
|
||||
** Feature Pyramid Network
|
||||
* Feature Pyramid Network
|
||||
|
||||
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
|
||||
describes the basis of this FPN in detail (and, non-coincidentally I'm
|
||||
@ -208,26 +217,125 @@ paper calls the *bottom-up* pathway - the feed-forward portion of the
|
||||
CNN. FPN adds to this a *top-down* pathway and some lateral
|
||||
connections.
|
||||
|
||||
*** Top-Down Pathway
|
||||
** Top-Down Pathway
|
||||
|
||||
*** Lateral Connections
|
||||
** Lateral Connections
|
||||
|
||||
*** As Applied to ResNet
|
||||
** As Applied to ResNet
|
||||
|
||||
# Note C=256 and such
|
||||
|
||||
** Anchors & Region Proposals
|
||||
* Anchors & Region Proposals
|
||||
|
||||
The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
|
||||
Proposal Networks]] explains anchors and RPNs (Region Proposal
|
||||
Networks), which RetinaNet's design also relies on heavily.
|
||||
Recall last section what was said about feature maps, and the that the
|
||||
deeper stages of the CNN happen to be good for classifying images.
|
||||
While these deeper stages are lower-resolution than the input images,
|
||||
and while their influence is spread out over larger areas of the input
|
||||
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
|
||||
stage spreading it a little further), the features here still maintain
|
||||
a spatial relationship with the input image. That is, moving across
|
||||
one axis of this feature map still corresponds to moving across the
|
||||
same axis of the input image.
|
||||
|
||||
Recall a few sections ago what was said about feature maps, and the
|
||||
fact that the deeper stages of the CNN happen to be good for
|
||||
classifying images. While these deeper stages are lower-resolution
|
||||
than the input images, and while their influence is spread out over
|
||||
larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
|
||||
rather large due to each stage spreading it a little further), the
|
||||
features here still maintain a spatial relationship with the input
|
||||
image. That is, moving across one axis of this feature map still
|
||||
corresponds to moving across the same axis of the input image.
|
||||
# Just re-explain the above with the feature pyramid
|
||||
|
||||
RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
|
||||
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
|
||||
R-CNN: Towards Real-Time Object Detection with Region Proposal
|
||||
Networks]]. I find the explanations in terms of "proposals", of
|
||||
focusing the "attention" of the neural network, or of "telling the
|
||||
neural network where to look" to be needlessly confusing and
|
||||
misleading. I'd rather explain very plainly how they work.
|
||||
|
||||
Central to RPNs is *anchors*. Anchors aren't exactly a feature of the
|
||||
CNN. They're more a property that's used in its training and
|
||||
inference.
|
||||
|
||||
In particular:
|
||||
|
||||
- Say that the feature pyramid has $L$ levels, and that level $l+1$ is
|
||||
half the resolution (thus double the scale) of level $l$.
|
||||
- Say that level $l$ is a 256-channel feature map of size $W \times H$
|
||||
(i.e. it's a tensor with shape $W \times H \times 256$). Note that
|
||||
$W$ and $H$ will be larger at lower levels, and smaller at higher
|
||||
levels, but in RetinaNet at least, always 256-channel samples.
|
||||
- For every point on that feature map (all $WH$ of them), we can
|
||||
identify a corresponding point in the input image. This is the
|
||||
center point of a broad region of the input image that influences
|
||||
this point in the feature map (i.e. its receptive field). Note that
|
||||
as we move up to higher levels in the feature pyramid, these regions
|
||||
grow larger, and neighboring points in the feature map correspond to
|
||||
larger and larger jumps across the input image.
|
||||
- We can make these regions explicit by defining *anchors* - specific
|
||||
rectangular regions associated with each point of a feature map.
|
||||
The size of the anchor depends on the scale of the feature map, or
|
||||
equivalently, what level of the feature map it came from. All this
|
||||
means is that anchors in level $l+1$ are twice as large as the
|
||||
anchors of level $l$.
|
||||
|
||||
The view that this should paint is that a dense collection of anchors
|
||||
covers the entire input image at different sizes - still in a very
|
||||
ordered pattern, but with lots of overlap. Remember how I mentioned
|
||||
at the beginning of this post that one-stage object detectors use a
|
||||
very "brute force" method?
|
||||
|
||||
My above explanation glossed over a couple things, but nothing that
|
||||
should change the fundamentals.
|
||||
|
||||
- Anchors are actually associated with every 3x3 window in the anchor
|
||||
map, not precisely every point, but all this really means is that
|
||||
it's "every point and its immediate neighbors" rather than "every
|
||||
point". This doesn't really matter to anchors, but matters
|
||||
elsewhere.
|
||||
- It's not a single anchor per 3x3 window, but 9 anchors - one for
|
||||
each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
|
||||
scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
|
||||
This is just to handle objects of less-square shapes and to cover
|
||||
the gap in scale in between levels of the feature pyramid. Note
|
||||
that the scale factors are evenly-spaced exponentially, such that an
|
||||
additional step down wouldn't make sense (the largest anchors at the
|
||||
pyramid level /below/ already cover this scale), and nor would an
|
||||
additional step up (the smallest anchors at the pyramid level
|
||||
/above/ already cover it).
|
||||
|
||||
Here, finally, is where actual classification and regression come in.
|
||||
The *classification subnet* and *box regression subnet* are here.
|
||||
|
||||
** Classification Subnet
|
||||
|
||||
Every anchor associates an image region with a 3x3 window (i.e. a
|
||||
3x3x256 section - it's still 256-channel). The classification subnet
|
||||
is responsible for learning: do the features in this 3x3 window,
|
||||
produced from some input, image indicate that an object is inside this
|
||||
anchor? Or, more accurately: For each of $K$ object classes, what's
|
||||
the probability of each object (or just of it being background)?
|
||||
|
||||
** Box Regression Subnet
|
||||
|
||||
The box regression subnet takes the same input as the classification
|
||||
subnet, but tries to learn the answer to a different question. It is
|
||||
responsible for learning: what are the coordinates to the object
|
||||
inside of this anchor (assuming there is one)? More specifically, it
|
||||
tries to learn to produce 4 numbers values which give offsets relative
|
||||
to the anchor's bounds (thus specifying a different region). Note
|
||||
that this subnet completely ignores the class of the object.
|
||||
|
||||
The classification subnet already tells us whether or not a given
|
||||
anchor contains an object - which already gives rough bounds on
|
||||
it. The box regression subnet helps tighten these bounds.
|
||||
|
||||
** Other notes (?)
|
||||
|
||||
I've glossed over a few details here. Everything I've described above
|
||||
is implemented with bog-standard convolutional networks...
|
||||
|
||||
# Parameter sharing? How to explain?
|
||||
|
||||
* Training
|
||||
|
||||
# Ground-truth object boxes
|
||||
# Intersection-over-Union thresholds
|
||||
|
||||
* Inference
|
||||
|
||||
# Top N results
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user