RetinaNet post: Fixed header; added a bit to anchor/subnets

This commit is contained in:
Chris Hodapp 2017-12-16 12:59:15 -05:00
parent e14a94ee5e
commit cc221d9a6f

View File

@ -1,10 +1,18 @@
#+TITLE: Explaining RetinaNet
#+AUTHOR: Chris Hodapp
#+DATE: December 13, 2017
#+TAGS: technobabble
---
title: Explaining RetinaNet
author: Chris Hodapp
date: December 13, 2017
tags: technobabble
---
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
Detection]], from one of Facebook's teams. The goal of this post is to
# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
# description:
# subtitle:
A paper came out in the past few months,
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
Facebook's teams. The goal of this post is to
explain this paper as I work through it, through some of its
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
@ -109,6 +117,7 @@ scaling makes sense: if a classification is already correct (as in the
"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
of the loss multiplied by it will likewise tend toward 0.
* RetinaNet architecture
The paper gives the name *RetinaNet* to the network they created which
@ -125,7 +134,7 @@ also very important.
I go into both of these aspects below.
** Feature Pyramid Network
* Feature Pyramid Network
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
describes the basis of this FPN in detail (and, non-coincidentally I'm
@ -208,26 +217,125 @@ paper calls the *bottom-up* pathway - the feed-forward portion of the
CNN. FPN adds to this a *top-down* pathway and some lateral
connections.
*** Top-Down Pathway
** Top-Down Pathway
*** Lateral Connections
** Lateral Connections
*** As Applied to ResNet
** As Applied to ResNet
# Note C=256 and such
** Anchors & Region Proposals
* Anchors & Region Proposals
The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks]] explains anchors and RPNs (Region Proposal
Networks), which RetinaNet's design also relies on heavily.
Recall last section what was said about feature maps, and the that the
deeper stages of the CNN happen to be good for classifying images.
While these deeper stages are lower-resolution than the input images,
and while their influence is spread out over larger areas of the input
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
stage spreading it a little further), the features here still maintain
a spatial relationship with the input image. That is, moving across
one axis of this feature map still corresponds to moving across the
same axis of the input image.
Recall a few sections ago what was said about feature maps, and the
fact that the deeper stages of the CNN happen to be good for
classifying images. While these deeper stages are lower-resolution
than the input images, and while their influence is spread out over
larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
rather large due to each stage spreading it a little further), the
features here still maintain a spatial relationship with the input
image. That is, moving across one axis of this feature map still
corresponds to moving across the same axis of the input image.
# Just re-explain the above with the feature pyramid
RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks]]. I find the explanations in terms of "proposals", of
focusing the "attention" of the neural network, or of "telling the
neural network where to look" to be needlessly confusing and
misleading. I'd rather explain very plainly how they work.
Central to RPNs is *anchors*. Anchors aren't exactly a feature of the
CNN. They're more a property that's used in its training and
inference.
In particular:
- Say that the feature pyramid has $L$ levels, and that level $l+1$ is
half the resolution (thus double the scale) of level $l$.
- Say that level $l$ is a 256-channel feature map of size $W \times H$
(i.e. it's a tensor with shape $W \times H \times 256$). Note that
$W$ and $H$ will be larger at lower levels, and smaller at higher
levels, but in RetinaNet at least, always 256-channel samples.
- For every point on that feature map (all $WH$ of them), we can
identify a corresponding point in the input image. This is the
center point of a broad region of the input image that influences
this point in the feature map (i.e. its receptive field). Note that
as we move up to higher levels in the feature pyramid, these regions
grow larger, and neighboring points in the feature map correspond to
larger and larger jumps across the input image.
- We can make these regions explicit by defining *anchors* - specific
rectangular regions associated with each point of a feature map.
The size of the anchor depends on the scale of the feature map, or
equivalently, what level of the feature map it came from. All this
means is that anchors in level $l+1$ are twice as large as the
anchors of level $l$.
The view that this should paint is that a dense collection of anchors
covers the entire input image at different sizes - still in a very
ordered pattern, but with lots of overlap. Remember how I mentioned
at the beginning of this post that one-stage object detectors use a
very "brute force" method?
My above explanation glossed over a couple things, but nothing that
should change the fundamentals.
- Anchors are actually associated with every 3x3 window in the anchor
map, not precisely every point, but all this really means is that
it's "every point and its immediate neighbors" rather than "every
point". This doesn't really matter to anchors, but matters
elsewhere.
- It's not a single anchor per 3x3 window, but 9 anchors - one for
each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
This is just to handle objects of less-square shapes and to cover
the gap in scale in between levels of the feature pyramid. Note
that the scale factors are evenly-spaced exponentially, such that an
additional step down wouldn't make sense (the largest anchors at the
pyramid level /below/ already cover this scale), and nor would an
additional step up (the smallest anchors at the pyramid level
/above/ already cover it).
Here, finally, is where actual classification and regression come in.
The *classification subnet* and *box regression subnet* are here.
** Classification Subnet
Every anchor associates an image region with a 3x3 window (i.e. a
3x3x256 section - it's still 256-channel). The classification subnet
is responsible for learning: do the features in this 3x3 window,
produced from some input, image indicate that an object is inside this
anchor? Or, more accurately: For each of $K$ object classes, what's
the probability of each object (or just of it being background)?
** Box Regression Subnet
The box regression subnet takes the same input as the classification
subnet, but tries to learn the answer to a different question. It is
responsible for learning: what are the coordinates to the object
inside of this anchor (assuming there is one)? More specifically, it
tries to learn to produce 4 numbers values which give offsets relative
to the anchor's bounds (thus specifying a different region). Note
that this subnet completely ignores the class of the object.
The classification subnet already tells us whether or not a given
anchor contains an object - which already gives rough bounds on
it. The box regression subnet helps tighten these bounds.
** Other notes (?)
I've glossed over a few details here. Everything I've described above
is implemented with bog-standard convolutional networks...
# Parameter sharing? How to explain?
* Training
# Ground-truth object boxes
# Intersection-over-Union thresholds
* Inference
# Top N results