Explained a little more on FPN/RPN in RetinaNet post

This commit is contained in:
Chris Hodapp 2017-12-15 19:45:50 -05:00
parent 8240497272
commit e14a94ee5e

View File

@ -5,8 +5,8 @@
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
Detection]], from one of Facebook's teams. The goal of this post is to Detection]], from one of Facebook's teams. The goal of this post is to
explain this work a bit as I work through the paper, through some of explain this paper as I work through it, through some of its
its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]]. references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
* Object Detection * Object Detection
@ -47,6 +47,13 @@ of many locations, many sizes, and many aspect ratios.
This is simpler and faster - but not as accurate as the two-stage This is simpler and faster - but not as accurate as the two-stage
approaches. approaches.
Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
didn't come up with these names) merge the two models of two-stage
approaches into a single CNN, and exploit the possibility of sharing
computations that would otherwise be done twice. I assume that this
is included in the comparisons done in the paper, but I'm not entirely
sure.
* Training & Class Imbalance * Training & Class Imbalance
Briefly, the process of training these models requires minimizing some Briefly, the process of training these models requires minimizing some
@ -112,7 +119,11 @@ important not to miss that /innovations in/: they are saying that they
didn't need to invent a new network design - not that the network didn't need to invent a new network design - not that the network
design doesn't matter. Later in the paper, they say that it is in design doesn't matter. Later in the paper, they say that it is in
fact crucial that RetinaNet's architecture relies on FPN (Feature fact crucial that RetinaNet's architecture relies on FPN (Feature
Pyramid Network) as its backbone. Pyramid Network) as its backbone. As far as I can tell, the
architecture's use of a variant of RPN (Region Proposal Network) is
also very important.
I go into both of these aspects below.
** Feature Pyramid Network ** Feature Pyramid Network
@ -167,25 +178,56 @@ You may notice that this network has a structure that bears some
resemblance to an image pyramid. This is because deep CNNs are resemblance to an image pyramid. This is because deep CNNs are
already computing a sort of pyramid in their convolutional and already computing a sort of pyramid in their convolutional and
subsampling stages. In a nutshell, deep CNNs used in image subsampling stages. In a nutshell, deep CNNs used in image
classification push an image through a cascade of feature detectors, classification push an image through a cascade of feature detectors or
and each successive stage contains a feature map that is built out of filters, and each successive stage contains a feature map that is
features in the prior stage - thus producing a *feature hierarchy* built out of features in the prior stage - thus producing a *feature
which already is something like a pyramid and contains multiple hierarchy* which already is something like a pyramid and contains
different scales. multiple different scales. (Being able to train deep CNNs to jointly
learn the filters at each stage of that feature hierarchy from the
data, rather than engineering them by hand, is what sets deep learning
apart from "shallow" machine learning.)
When you move through levels of a featurized image pyramid, only scale When you move through levels of a featurized image pyramid, only scale
should change. When you move through levels of a feature hierarchy should change. When you move through levels of a feature hierarchy
described here, scale changes, but so does the meaning of the described here, scale changes, but so does the meaning of the
features. This is the *semantic gap* the paper references. The features. This is the *semantic gap* the paper references. Meaning
meaning changes because each stage builds up more complex features by changes because each stage builds up more complex features by
combining simpler features of the last stage. The first stage, for combining simpler features of the last stage. The first stage, for
instance, commonly handles pixel-level features like points, lines or instance, commonly handles pixel-level features like points, lines or
edges at a particular direction. In the final stage, presumably, the edges at a particular direction. In the final stage, presumably, the
model has learned complex enough features that things like "kite" and model has learned complex enough features that things like "kite" and
"person" can be identified. "person" can be identified.
The goal of FPN was to find a way to exploit this feature hierarchy The goal in the paper was to find a way to exploit this feature
that is already being computed and to produce something that has hierarchy that is already being computed and to produce something that
similar power to a featurized image pyramid but without too high of a has similar power to a featurized image pyramid but without too high
cost in speed, memory, or complexity. of a cost in speed, memory, or complexity.
Everything described so far (none of which is specific to FPNs), the
paper calls the *bottom-up* pathway - the feed-forward portion of the
CNN. FPN adds to this a *top-down* pathway and some lateral
connections.
*** Top-Down Pathway
*** Lateral Connections
*** As Applied to ResNet
# Note C=256 and such
** Anchors & Region Proposals
The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks]] explains anchors and RPNs (Region Proposal
Networks), which RetinaNet's design also relies on heavily.
Recall a few sections ago what was said about feature maps, and the
fact that the deeper stages of the CNN happen to be good for
classifying images. While these deeper stages are lower-resolution
than the input images, and while their influence is spread out over
larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
rather large due to each stage spreading it a little further), the
features here still maintain a spatial relationship with the input
image. That is, moving across one axis of this feature map still
corresponds to moving across the same axis of the input image.