73 lines
3.4 KiB
Org Mode
73 lines
3.4 KiB
Org Mode
#+TITLE: Explaining RetinaNet
|
|
#+AUTHOR: Chris Hodapp
|
|
#+DATE: December 13, 2017
|
|
#+TAGS: technobabble
|
|
|
|
A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
|
|
Detection]], from one of Facebook's teams. The goal of this post is to
|
|
explain this work a bit as I work through the paper, and to look at
|
|
one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
|
|
|
"Object detection" as it is used here refers to machine learning
|
|
models that can not just identify a single object in an image, but can
|
|
identify and *localize* multiple objects, like in the below photo
|
|
taken from [[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow
|
|
Object Detection API]]:
|
|
|
|
# TODO:
|
|
# Define mAP
|
|
|
|
#+CAPTION: TensorFlow object detection example 2.
|
|
#+ATTR_HTML: :width 100% :height 100%
|
|
[[../images/2017-12-13-objdet.jpg]]
|
|
|
|
The paper discusses many of the two-stage approaches, like R-CNN and
|
|
its variants, which work in two steps:
|
|
|
|
1. One model proposes a sparse set of locations in the image that
|
|
probably contain something. Ideally, this contains all objects in
|
|
the image, but filters out the majority of negative locations
|
|
(i.e. only background, not foreground).
|
|
2. Another model, typically a convolutional neural network, classifies
|
|
each location in that sparse set as either being foreground and
|
|
some specific object class, or as being background.
|
|
|
|
Additionally, it discusses some existing one-stage approaches like
|
|
[[https://pjreddie.com/darknet/yolo/][YOLO]] and [[https://arxiv.org/abs/1512.02325][SSD]]. In essence, these run only the second step - but
|
|
instead of starting from a sparse set of locations that are probably
|
|
something of interest, they start from a dense set of locations which
|
|
has blanketed the entire image on a grid of many locations, over many
|
|
sizes, and over many aspect ratios, regardless of whether they may
|
|
contain an object.
|
|
|
|
This is simpler and faster - but not nearly as accurate.
|
|
|
|
Broadly, the process of training these models requires minimizing some
|
|
kind of loss function that is based on what the model misclassifies
|
|
when it is run on some training data. It's preferable to be able to
|
|
compute some loss over each individual instance, and add all of these
|
|
losses up to produce an overall loss.
|
|
|
|
This leads to a problem in one-stage detectors: That dense set of
|
|
locations that it's classifying usually contains a small number of
|
|
locations that actually have objects (positives), and a much larger
|
|
number of locations that are just background and can be very easily
|
|
classified as being in the background (easy negatives). However, the
|
|
loss function still adds all of them up - and even if the loss is
|
|
relatively low for each of the easy negatives, their cumulative loss
|
|
can drown out the loss from objects that are being misclassified.
|
|
|
|
The training process is trying to minimize this loss, and so it is
|
|
mostly nudging the model to improve in the area least in need of it
|
|
(its ability to classify background areas that it already classifies
|
|
well) and neglecting the area most in need of it (its ability to
|
|
classify the "difficult" objects that it is misclassifying).
|
|
|
|
# TODO: What else can I say about why loss should be additive?
|
|
# Quote DL text? ML text?
|
|
|
|
This is the *class imbalance* issue in a nutshell that the paper gives
|
|
as the limiting factor for the accuracy of one-stage detectors.
|
|
|
|
# TODO: Visualize this. Can I?
|