Explained a little more on FPN/RPN in RetinaNet post
This commit is contained in:
parent
8240497272
commit
e14a94ee5e
@ -5,8 +5,8 @@
|
|||||||
|
|
||||||
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
|
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
|
||||||
Detection]], from one of Facebook's teams. The goal of this post is to
|
Detection]], from one of Facebook's teams. The goal of this post is to
|
||||||
explain this work a bit as I work through the paper, through some of
|
explain this paper as I work through it, through some of its
|
||||||
its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
||||||
|
|
||||||
* Object Detection
|
* Object Detection
|
||||||
|
|
||||||
@ -47,6 +47,13 @@ of many locations, many sizes, and many aspect ratios.
|
|||||||
This is simpler and faster - but not as accurate as the two-stage
|
This is simpler and faster - but not as accurate as the two-stage
|
||||||
approaches.
|
approaches.
|
||||||
|
|
||||||
|
Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
|
||||||
|
didn't come up with these names) merge the two models of two-stage
|
||||||
|
approaches into a single CNN, and exploit the possibility of sharing
|
||||||
|
computations that would otherwise be done twice. I assume that this
|
||||||
|
is included in the comparisons done in the paper, but I'm not entirely
|
||||||
|
sure.
|
||||||
|
|
||||||
* Training & Class Imbalance
|
* Training & Class Imbalance
|
||||||
|
|
||||||
Briefly, the process of training these models requires minimizing some
|
Briefly, the process of training these models requires minimizing some
|
||||||
@ -112,7 +119,11 @@ important not to miss that /innovations in/: they are saying that they
|
|||||||
didn't need to invent a new network design - not that the network
|
didn't need to invent a new network design - not that the network
|
||||||
design doesn't matter. Later in the paper, they say that it is in
|
design doesn't matter. Later in the paper, they say that it is in
|
||||||
fact crucial that RetinaNet's architecture relies on FPN (Feature
|
fact crucial that RetinaNet's architecture relies on FPN (Feature
|
||||||
Pyramid Network) as its backbone.
|
Pyramid Network) as its backbone. As far as I can tell, the
|
||||||
|
architecture's use of a variant of RPN (Region Proposal Network) is
|
||||||
|
also very important.
|
||||||
|
|
||||||
|
I go into both of these aspects below.
|
||||||
|
|
||||||
** Feature Pyramid Network
|
** Feature Pyramid Network
|
||||||
|
|
||||||
@ -167,25 +178,56 @@ You may notice that this network has a structure that bears some
|
|||||||
resemblance to an image pyramid. This is because deep CNNs are
|
resemblance to an image pyramid. This is because deep CNNs are
|
||||||
already computing a sort of pyramid in their convolutional and
|
already computing a sort of pyramid in their convolutional and
|
||||||
subsampling stages. In a nutshell, deep CNNs used in image
|
subsampling stages. In a nutshell, deep CNNs used in image
|
||||||
classification push an image through a cascade of feature detectors,
|
classification push an image through a cascade of feature detectors or
|
||||||
and each successive stage contains a feature map that is built out of
|
filters, and each successive stage contains a feature map that is
|
||||||
features in the prior stage - thus producing a *feature hierarchy*
|
built out of features in the prior stage - thus producing a *feature
|
||||||
which already is something like a pyramid and contains multiple
|
hierarchy* which already is something like a pyramid and contains
|
||||||
different scales.
|
multiple different scales. (Being able to train deep CNNs to jointly
|
||||||
|
learn the filters at each stage of that feature hierarchy from the
|
||||||
|
data, rather than engineering them by hand, is what sets deep learning
|
||||||
|
apart from "shallow" machine learning.)
|
||||||
|
|
||||||
When you move through levels of a featurized image pyramid, only scale
|
When you move through levels of a featurized image pyramid, only scale
|
||||||
should change. When you move through levels of a feature hierarchy
|
should change. When you move through levels of a feature hierarchy
|
||||||
described here, scale changes, but so does the meaning of the
|
described here, scale changes, but so does the meaning of the
|
||||||
features. This is the *semantic gap* the paper references. The
|
features. This is the *semantic gap* the paper references. Meaning
|
||||||
meaning changes because each stage builds up more complex features by
|
changes because each stage builds up more complex features by
|
||||||
combining simpler features of the last stage. The first stage, for
|
combining simpler features of the last stage. The first stage, for
|
||||||
instance, commonly handles pixel-level features like points, lines or
|
instance, commonly handles pixel-level features like points, lines or
|
||||||
edges at a particular direction. In the final stage, presumably, the
|
edges at a particular direction. In the final stage, presumably, the
|
||||||
model has learned complex enough features that things like "kite" and
|
model has learned complex enough features that things like "kite" and
|
||||||
"person" can be identified.
|
"person" can be identified.
|
||||||
|
|
||||||
The goal of FPN was to find a way to exploit this feature hierarchy
|
The goal in the paper was to find a way to exploit this feature
|
||||||
that is already being computed and to produce something that has
|
hierarchy that is already being computed and to produce something that
|
||||||
similar power to a featurized image pyramid but without too high of a
|
has similar power to a featurized image pyramid but without too high
|
||||||
cost in speed, memory, or complexity.
|
of a cost in speed, memory, or complexity.
|
||||||
|
|
||||||
|
Everything described so far (none of which is specific to FPNs), the
|
||||||
|
paper calls the *bottom-up* pathway - the feed-forward portion of the
|
||||||
|
CNN. FPN adds to this a *top-down* pathway and some lateral
|
||||||
|
connections.
|
||||||
|
|
||||||
|
*** Top-Down Pathway
|
||||||
|
|
||||||
|
*** Lateral Connections
|
||||||
|
|
||||||
|
*** As Applied to ResNet
|
||||||
|
|
||||||
|
# Note C=256 and such
|
||||||
|
|
||||||
|
** Anchors & Region Proposals
|
||||||
|
|
||||||
|
The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region
|
||||||
|
Proposal Networks]] explains anchors and RPNs (Region Proposal
|
||||||
|
Networks), which RetinaNet's design also relies on heavily.
|
||||||
|
|
||||||
|
Recall a few sections ago what was said about feature maps, and the
|
||||||
|
fact that the deeper stages of the CNN happen to be good for
|
||||||
|
classifying images. While these deeper stages are lower-resolution
|
||||||
|
than the input images, and while their influence is spread out over
|
||||||
|
larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is
|
||||||
|
rather large due to each stage spreading it a little further), the
|
||||||
|
features here still maintain a spatial relationship with the input
|
||||||
|
image. That is, moving across one axis of this feature map still
|
||||||
|
corresponds to moving across the same axis of the input image.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user