RetinaNet post: Fixed header; added a bit to anchor/subnets
This commit is contained in:
parent
e14a94ee5e
commit
cc221d9a6f
@ -1,10 +1,18 @@
|
|||||||
#+TITLE: Explaining RetinaNet
|
---
|
||||||
#+AUTHOR: Chris Hodapp
|
title: Explaining RetinaNet
|
||||||
#+DATE: December 13, 2017
|
author: Chris Hodapp
|
||||||
#+TAGS: technobabble
|
date: December 13, 2017
|
||||||
|
tags: technobabble
|
||||||
|
---
|
||||||
|
|
||||||
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
|
# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
|
||||||
Detection]], from one of Facebook's teams. The goal of this post is to
|
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
|
||||||
|
# description:
|
||||||
|
# subtitle:
|
||||||
|
|
||||||
|
A paper came out in the past few months,
|
||||||
|
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
|
||||||
|
Facebook's teams. The goal of this post is to
|
||||||
explain this paper as I work through it, through some of its
|
explain this paper as I work through it, through some of its
|
||||||
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
||||||
|
|
||||||
@ -109,6 +117,7 @@ scaling makes sense: if a classification is already correct (as in the
|
|||||||
"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
|
"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
|
||||||
of the loss multiplied by it will likewise tend toward 0.
|
of the loss multiplied by it will likewise tend toward 0.
|
||||||
|
|
||||||
|
|
||||||
* RetinaNet architecture
|
* RetinaNet architecture
|
||||||
|
|
||||||
The paper gives the name *RetinaNet* to the network they created which
|
The paper gives the name *RetinaNet* to the network they created which
|
||||||
@ -125,7 +134,7 @@ also very important.
|
|||||||
|
|
||||||
I go into both of these aspects below.
|
I go into both of these aspects below.
|
||||||
|
|
||||||
** Feature Pyramid Network
|
* Feature Pyramid Network
|
||||||
|
|
||||||
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
|
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
|
||||||
describes the basis of this FPN in detail (and, non-coincidentally I'm
|
describes the basis of this FPN in detail (and, non-coincidentally I'm
|
||||||
@ -208,26 +217,125 @@ paper calls the *bottom-up* pathway - the feed-forward portion of the
|
|||||||
CNN. FPN adds to this a *top-down* pathway and some lateral
|
CNN. FPN adds to this a *top-down* pathway and some lateral
|
||||||
connections.
|
connections.
|
||||||
|
|
||||||
*** Top-Down Pathway
|
** Top-Down Pathway
|
||||||
|
|
||||||
*** Lateral Connections
|
** Lateral Connections
|
||||||
|
|
||||||
*** As Applied to ResNet
|
** As Applied to ResNet
|
||||||
|
|
||||||
# Note C=256 and such
|
# Note C=256 and such
|
||||||
|
|
||||||
** Anchors & Region Proposals
|
* Anchors & Region Proposals
|
||||||
|
|
||||||
The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
|
Recall last section what was said about feature maps, and the that the
|
||||||
Proposal Networks]] explains anchors and RPNs (Region Proposal
|
deeper stages of the CNN happen to be good for classifying images.
|
||||||
Networks), which RetinaNet's design also relies on heavily.
|
While these deeper stages are lower-resolution than the input images,
|
||||||
|
and while their influence is spread out over larger areas of the input
|
||||||
|
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
|
||||||
|
stage spreading it a little further), the features here still maintain
|
||||||
|
a spatial relationship with the input image. That is, moving across
|
||||||
|
one axis of this feature map still corresponds to moving across the
|
||||||
|
same axis of the input image.
|
||||||
|
|
||||||
Recall a few sections ago what was said about feature maps, and the
|
# Just re-explain the above with the feature pyramid
|
||||||
fact that the deeper stages of the CNN happen to be good for
|
|
||||||
classifying images. While these deeper stages are lower-resolution
|
RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
|
||||||
than the input images, and while their influence is spread out over
|
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
|
||||||
larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
|
R-CNN: Towards Real-Time Object Detection with Region Proposal
|
||||||
rather large due to each stage spreading it a little further), the
|
Networks]]. I find the explanations in terms of "proposals", of
|
||||||
features here still maintain a spatial relationship with the input
|
focusing the "attention" of the neural network, or of "telling the
|
||||||
image. That is, moving across one axis of this feature map still
|
neural network where to look" to be needlessly confusing and
|
||||||
corresponds to moving across the same axis of the input image.
|
misleading. I'd rather explain very plainly how they work.
|
||||||
|
|
||||||
|
Central to RPNs is *anchors*. Anchors aren't exactly a feature of the
|
||||||
|
CNN. They're more a property that's used in its training and
|
||||||
|
inference.
|
||||||
|
|
||||||
|
In particular:
|
||||||
|
|
||||||
|
- Say that the feature pyramid has $L$ levels, and that level $l+1$ is
|
||||||
|
half the resolution (thus double the scale) of level $l$.
|
||||||
|
- Say that level $l$ is a 256-channel feature map of size $W \times H$
|
||||||
|
(i.e. it's a tensor with shape $W \times H \times 256$). Note that
|
||||||
|
$W$ and $H$ will be larger at lower levels, and smaller at higher
|
||||||
|
levels, but in RetinaNet at least, always 256-channel samples.
|
||||||
|
- For every point on that feature map (all $WH$ of them), we can
|
||||||
|
identify a corresponding point in the input image. This is the
|
||||||
|
center point of a broad region of the input image that influences
|
||||||
|
this point in the feature map (i.e. its receptive field). Note that
|
||||||
|
as we move up to higher levels in the feature pyramid, these regions
|
||||||
|
grow larger, and neighboring points in the feature map correspond to
|
||||||
|
larger and larger jumps across the input image.
|
||||||
|
- We can make these regions explicit by defining *anchors* - specific
|
||||||
|
rectangular regions associated with each point of a feature map.
|
||||||
|
The size of the anchor depends on the scale of the feature map, or
|
||||||
|
equivalently, what level of the feature map it came from. All this
|
||||||
|
means is that anchors in level $l+1$ are twice as large as the
|
||||||
|
anchors of level $l$.
|
||||||
|
|
||||||
|
The view that this should paint is that a dense collection of anchors
|
||||||
|
covers the entire input image at different sizes - still in a very
|
||||||
|
ordered pattern, but with lots of overlap. Remember how I mentioned
|
||||||
|
at the beginning of this post that one-stage object detectors use a
|
||||||
|
very "brute force" method?
|
||||||
|
|
||||||
|
My above explanation glossed over a couple things, but nothing that
|
||||||
|
should change the fundamentals.
|
||||||
|
|
||||||
|
- Anchors are actually associated with every 3x3 window in the anchor
|
||||||
|
map, not precisely every point, but all this really means is that
|
||||||
|
it's "every point and its immediate neighbors" rather than "every
|
||||||
|
point". This doesn't really matter to anchors, but matters
|
||||||
|
elsewhere.
|
||||||
|
- It's not a single anchor per 3x3 window, but 9 anchors - one for
|
||||||
|
each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
|
||||||
|
scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
|
||||||
|
This is just to handle objects of less-square shapes and to cover
|
||||||
|
the gap in scale in between levels of the feature pyramid. Note
|
||||||
|
that the scale factors are evenly-spaced exponentially, such that an
|
||||||
|
additional step down wouldn't make sense (the largest anchors at the
|
||||||
|
pyramid level /below/ already cover this scale), and nor would an
|
||||||
|
additional step up (the smallest anchors at the pyramid level
|
||||||
|
/above/ already cover it).
|
||||||
|
|
||||||
|
Here, finally, is where actual classification and regression come in.
|
||||||
|
The *classification subnet* and *box regression subnet* are here.
|
||||||
|
|
||||||
|
** Classification Subnet
|
||||||
|
|
||||||
|
Every anchor associates an image region with a 3x3 window (i.e. a
|
||||||
|
3x3x256 section - it's still 256-channel). The classification subnet
|
||||||
|
is responsible for learning: do the features in this 3x3 window,
|
||||||
|
produced from some input, image indicate that an object is inside this
|
||||||
|
anchor? Or, more accurately: For each of $K$ object classes, what's
|
||||||
|
the probability of each object (or just of it being background)?
|
||||||
|
|
||||||
|
** Box Regression Subnet
|
||||||
|
|
||||||
|
The box regression subnet takes the same input as the classification
|
||||||
|
subnet, but tries to learn the answer to a different question. It is
|
||||||
|
responsible for learning: what are the coordinates to the object
|
||||||
|
inside of this anchor (assuming there is one)? More specifically, it
|
||||||
|
tries to learn to produce 4 numbers values which give offsets relative
|
||||||
|
to the anchor's bounds (thus specifying a different region). Note
|
||||||
|
that this subnet completely ignores the class of the object.
|
||||||
|
|
||||||
|
The classification subnet already tells us whether or not a given
|
||||||
|
anchor contains an object - which already gives rough bounds on
|
||||||
|
it. The box regression subnet helps tighten these bounds.
|
||||||
|
|
||||||
|
** Other notes (?)
|
||||||
|
|
||||||
|
I've glossed over a few details here. Everything I've described above
|
||||||
|
is implemented with bog-standard convolutional networks...
|
||||||
|
|
||||||
|
# Parameter sharing? How to explain?
|
||||||
|
|
||||||
|
* Training
|
||||||
|
|
||||||
|
# Ground-truth object boxes
|
||||||
|
# Intersection-over-Union thresholds
|
||||||
|
|
||||||
|
* Inference
|
||||||
|
|
||||||
|
# Top N results
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user