blag/drafts/2017-12-13-retinanet.org

#+TITLE: Explaining RetinaNet
#+AUTHOR: Chris Hodapp
#+DATE: December 13, 2017
#+TAGS: technobabble

A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object
Detection]], from one of Facebook's teams.  The goal of this post is to
explain this work a bit as I work through the paper, and to look at
one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].

"Object detection" as it is used here refers to machine learning
models that can not just identify a single object in an image, but can
identify and *localize* multiple objects, like in the below photo
taken from [[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow
Object Detection API]]:

# TODO:
# Define mAP

#+CAPTION: TensorFlow object detection example 2.
#+ATTR_HTML: :width 100% :height 100%
[[../images/2017-12-13-objdet.jpg]]

The paper discusses many of the two-stage approaches, like R-CNN and
its variants, which work in two steps:

1. One model proposes a sparse set of locations in the image that
   probably contain something.  Ideally, this contains all objects in
   the image, but filters out the majority of negative locations
   (i.e. only background, not foreground).
2. Another model, typically a convolutional neural network, classifies
   each location in that sparse set as either being foreground and
   some specific object class, or as being background.

Additionally, it discusses some existing one-stage approaches like
[[https://pjreddie.com/darknet/yolo/][YOLO]] and [[https://arxiv.org/abs/1512.02325][SSD]].  In essence, these run only the second step - but
instead of starting from a sparse set of locations that are probably
something of interest, they start from a dense set of locations which
has blanketed the entire image on a grid of many locations, over many
sizes, and over many aspect ratios, regardless of whether they may
contain an object.

This is simpler and faster - but not nearly as accurate.

Broadly, the process of training these models requires minimizing some
kind of loss function that is based on what the model misclassifies
when it is run on some training data.  It's preferable to be able to
compute some loss over each individual instance, and add all of these
losses up to produce an overall loss.

This leads to a problem in one-stage detectors: That dense set of
locations that it's classifying usually contains a small number of
locations that actually have objects (positives), and a much larger
number of locations that are just background and can be very easily
classified as being in the background (easy negatives). However, the
loss function still adds all of them up - and even if the loss is
relatively low for each of the easy negatives, their cumulative loss
can drown out the loss from objects that are being misclassified.

The training process is trying to minimize this loss, and so it is
mostly nudging the model to improve in the area least in need of it
(its ability to classify background areas that it already classifies
well) and neglecting the area most in need of it (its ability to
classify the "difficult" objects that it is misclassifying).

# TODO: What else can I say about why loss should be additive?
# Quote DL text? ML text?

This is the *class imbalance* issue in a nutshell that the paper gives
as the limiting factor for the accuracy of one-stage detectors.

# TODO: Visualize this. Can I?