From e14a94ee5ed4754f1e862bd6920ed3571efd980d Mon Sep 17 00:00:00 2001 From: Chris Hodapp Date: Fri, 15 Dec 2017 19:45:50 -0500 Subject: [PATCH] Explained a little more on FPN/RPN in RetinaNet post --- drafts/2017-12-13-retinanet.org | 70 ++++++++++++++++++++++++++------- 1 file changed, 56 insertions(+), 14 deletions(-) diff --git a/drafts/2017-12-13-retinanet.org b/drafts/2017-12-13-retinanet.org index 94e6693..5c41b62 100644 --- a/drafts/2017-12-13-retinanet.org +++ b/drafts/2017-12-13-retinanet.org @@ -5,8 +5,8 @@ A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of Facebook's teams. The goal of this post is to -explain this work a bit as I work through the paper, through some of -its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]]. +explain this paper as I work through it, through some of its +references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]]. * Object Detection @@ -47,6 +47,13 @@ of many locations, many sizes, and many aspect ratios. This is simpler and faster - but not as accurate as the two-stage approaches. +Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I +didn't come up with these names) merge the two models of two-stage +approaches into a single CNN, and exploit the possibility of sharing +computations that would otherwise be done twice. I assume that this +is included in the comparisons done in the paper, but I'm not entirely +sure. + * Training & Class Imbalance Briefly, the process of training these models requires minimizing some @@ -112,7 +119,11 @@ important not to miss that /innovations in/: they are saying that they didn't need to invent a new network design - not that the network design doesn't matter. Later in the paper, they say that it is in fact crucial that RetinaNet's architecture relies on FPN (Feature -Pyramid Network) as its backbone. +Pyramid Network) as its backbone. As far as I can tell, the +architecture's use of a variant of RPN (Region Proposal Network) is +also very important. + +I go into both of these aspects below. ** Feature Pyramid Network @@ -167,25 +178,56 @@ You may notice that this network has a structure that bears some resemblance to an image pyramid. This is because deep CNNs are already computing a sort of pyramid in their convolutional and subsampling stages. In a nutshell, deep CNNs used in image -classification push an image through a cascade of feature detectors, -and each successive stage contains a feature map that is built out of -features in the prior stage - thus producing a *feature hierarchy* -which already is something like a pyramid and contains multiple -different scales. +classification push an image through a cascade of feature detectors or +filters, and each successive stage contains a feature map that is +built out of features in the prior stage - thus producing a *feature +hierarchy* which already is something like a pyramid and contains +multiple different scales. (Being able to train deep CNNs to jointly +learn the filters at each stage of that feature hierarchy from the +data, rather than engineering them by hand, is what sets deep learning +apart from "shallow" machine learning.) When you move through levels of a featurized image pyramid, only scale should change. When you move through levels of a feature hierarchy described here, scale changes, but so does the meaning of the -features. This is the *semantic gap* the paper references. The -meaning changes because each stage builds up more complex features by +features. This is the *semantic gap* the paper references. Meaning +changes because each stage builds up more complex features by combining simpler features of the last stage. The first stage, for instance, commonly handles pixel-level features like points, lines or edges at a particular direction. In the final stage, presumably, the model has learned complex enough features that things like "kite" and "person" can be identified. -The goal of FPN was to find a way to exploit this feature hierarchy -that is already being computed and to produce something that has -similar power to a featurized image pyramid but without too high of a -cost in speed, memory, or complexity. +The goal in the paper was to find a way to exploit this feature +hierarchy that is already being computed and to produce something that +has similar power to a featurized image pyramid but without too high +of a cost in speed, memory, or complexity. +Everything described so far (none of which is specific to FPNs), the +paper calls the *bottom-up* pathway - the feed-forward portion of the +CNN. FPN adds to this a *top-down* pathway and some lateral +connections. + +*** Top-Down Pathway + +*** Lateral Connections + +*** As Applied to ResNet + +# Note C=256 and such + +** Anchors & Region Proposals + +The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region +Proposal Networks]] explains anchors and RPNs (Region Proposal +Networks), which RetinaNet's design also relies on heavily. + +Recall a few sections ago what was said about feature maps, and the +fact that the deeper stages of the CNN happen to be good for +classifying images. While these deeper stages are lower-resolution +than the input images, and while their influence is spread out over +larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is +rather large due to each stage spreading it a little further), the +features here still maintain a spatial relationship with the input +image. That is, moving across one axis of this feature map still +corresponds to moving across the same axis of the input image.