From cc221d9a6f23b027750b8347c911335fc4d2aa94 Mon Sep 17 00:00:00 2001 From: Chris Hodapp Date: Sat, 16 Dec 2017 12:59:15 -0500 Subject: [PATCH] RetinaNet post: Fixed header; added a bit to anchor/subnets --- drafts/2017-12-13-retinanet.org | 154 +++++++++++++++++++++++++++----- 1 file changed, 131 insertions(+), 23 deletions(-) diff --git a/drafts/2017-12-13-retinanet.org b/drafts/2017-12-13-retinanet.org index 5c41b62..b040bab 100644 --- a/drafts/2017-12-13-retinanet.org +++ b/drafts/2017-12-13-retinanet.org @@ -1,10 +1,18 @@ -#+TITLE: Explaining RetinaNet -#+AUTHOR: Chris Hodapp -#+DATE: December 13, 2017 -#+TAGS: technobabble +--- +title: Explaining RetinaNet +author: Chris Hodapp +date: December 13, 2017 +tags: technobabble +--- -A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object -Detection]], from one of Facebook's teams. The goal of this post is to +# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org +# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html +# description: +# subtitle: + +A paper came out in the past few months, +[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of +Facebook's teams. The goal of this post is to explain this paper as I work through it, through some of its references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]]. @@ -109,6 +117,7 @@ scaling makes sense: if a classification is already correct (as in the "easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion of the loss multiplied by it will likewise tend toward 0. + * RetinaNet architecture The paper gives the name *RetinaNet* to the network they created which @@ -125,7 +134,7 @@ also very important. I go into both of these aspects below. -** Feature Pyramid Network +* Feature Pyramid Network Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]], describes the basis of this FPN in detail (and, non-coincidentally I'm @@ -208,26 +217,125 @@ paper calls the *bottom-up* pathway - the feed-forward portion of the CNN. FPN adds to this a *top-down* pathway and some lateral connections. -*** Top-Down Pathway +** Top-Down Pathway -*** Lateral Connections +** Lateral Connections -*** As Applied to ResNet +** As Applied to ResNet # Note C=256 and such -** Anchors & Region Proposals +* Anchors & Region Proposals -The paper [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region -Proposal Networks]] explains anchors and RPNs (Region Proposal -Networks), which RetinaNet's design also relies on heavily. +Recall last section what was said about feature maps, and the that the +deeper stages of the CNN happen to be good for classifying images. +While these deeper stages are lower-resolution than the input images, +and while their influence is spread out over larger areas of the input +image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each +stage spreading it a little further), the features here still maintain +a spatial relationship with the input image. That is, moving across +one axis of this feature map still corresponds to moving across the +same axis of the input image. -Recall a few sections ago what was said about feature maps, and the -fact that the deeper stages of the CNN happen to be good for -classifying images. While these deeper stages are lower-resolution -than the input images, and while their influence is spread out over -larger areas of the input image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is -rather large due to each stage spreading it a little further), the -features here still maintain a spatial relationship with the input -image. That is, moving across one axis of this feature map still -corresponds to moving across the same axis of the input image. +# Just re-explain the above with the feature pyramid + +RetinaNet's design draws heavily from RPNs (Region Proposal Networks) +here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster +R-CNN: Towards Real-Time Object Detection with Region Proposal +Networks]]. I find the explanations in terms of "proposals", of +focusing the "attention" of the neural network, or of "telling the +neural network where to look" to be needlessly confusing and +misleading. I'd rather explain very plainly how they work. + +Central to RPNs is *anchors*. Anchors aren't exactly a feature of the +CNN. They're more a property that's used in its training and +inference. + +In particular: + +- Say that the feature pyramid has $L$ levels, and that level $l+1$ is + half the resolution (thus double the scale) of level $l$. +- Say that level $l$ is a 256-channel feature map of size $W \times H$ + (i.e. it's a tensor with shape $W \times H \times 256$). Note that + $W$ and $H$ will be larger at lower levels, and smaller at higher + levels, but in RetinaNet at least, always 256-channel samples. +- For every point on that feature map (all $WH$ of them), we can + identify a corresponding point in the input image. This is the + center point of a broad region of the input image that influences + this point in the feature map (i.e. its receptive field). Note that + as we move up to higher levels in the feature pyramid, these regions + grow larger, and neighboring points in the feature map correspond to + larger and larger jumps across the input image. +- We can make these regions explicit by defining *anchors* - specific + rectangular regions associated with each point of a feature map. + The size of the anchor depends on the scale of the feature map, or + equivalently, what level of the feature map it came from. All this + means is that anchors in level $l+1$ are twice as large as the + anchors of level $l$. + +The view that this should paint is that a dense collection of anchors +covers the entire input image at different sizes - still in a very +ordered pattern, but with lots of overlap. Remember how I mentioned +at the beginning of this post that one-stage object detectors use a +very "brute force" method? + +My above explanation glossed over a couple things, but nothing that +should change the fundamentals. + +- Anchors are actually associated with every 3x3 window in the anchor + map, not precisely every point, but all this really means is that + it's "every point and its immediate neighbors" rather than "every + point". This doesn't really matter to anchors, but matters + elsewhere. +- It's not a single anchor per 3x3 window, but 9 anchors - one for + each of three aspect ratios (1:2, 1:1, and 2:1), and each of three + scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale. + This is just to handle objects of less-square shapes and to cover + the gap in scale in between levels of the feature pyramid. Note + that the scale factors are evenly-spaced exponentially, such that an + additional step down wouldn't make sense (the largest anchors at the + pyramid level /below/ already cover this scale), and nor would an + additional step up (the smallest anchors at the pyramid level + /above/ already cover it). + +Here, finally, is where actual classification and regression come in. +The *classification subnet* and *box regression subnet* are here. + +** Classification Subnet + +Every anchor associates an image region with a 3x3 window (i.e. a +3x3x256 section - it's still 256-channel). The classification subnet +is responsible for learning: do the features in this 3x3 window, +produced from some input, image indicate that an object is inside this +anchor? Or, more accurately: For each of $K$ object classes, what's +the probability of each object (or just of it being background)? + +** Box Regression Subnet + +The box regression subnet takes the same input as the classification +subnet, but tries to learn the answer to a different question. It is +responsible for learning: what are the coordinates to the object +inside of this anchor (assuming there is one)? More specifically, it +tries to learn to produce 4 numbers values which give offsets relative +to the anchor's bounds (thus specifying a different region). Note +that this subnet completely ignores the class of the object. + +The classification subnet already tells us whether or not a given +anchor contains an object - which already gives rough bounds on +it. The box regression subnet helps tighten these bounds. + +** Other notes (?) + +I've glossed over a few details here. Everything I've described above +is implemented with bog-standard convolutional networks... + +# Parameter sharing? How to explain? + +* Training + +# Ground-truth object boxes +# Intersection-over-Union thresholds + +* Inference + +# Top N results