#+TITLE: Explaining RetinaNet #+AUTHOR: Chris Hodapp #+DATE: December 13, 2017 #+TAGS: technobabble A paper came out in the past few months, [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of Facebook's teams. The goal of this post is to explain this work a bit as I work through the paper, and to look at one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]]. "Object detection" as it is used here refers to machine learning models that can not just identify a single object in an image, but can identify and *localize* multiple objects, like in the below photo taken from [[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]: # TODO: # Define mAP #+CAPTION: TensorFlow object detection example 2. #+ATTR_HTML: :width 100% :height 100% [[../images/2017-12-13-objdet.jpg]] The paper discusses many of the two-stage approaches, like R-CNN and its variants, which work in two steps: 1. One model proposes a sparse set of locations in the image that probably contain something. Ideally, this contains all objects in the image, but filters out the majority of negative locations (i.e. only background, not foreground). 2. Another model, typically a convolutional neural network, classifies each location in that sparse set as either being foreground and some specific object class, or as being background. Additionally, it discusses some existing one-stage approaches like [[https://pjreddie.com/darknet/yolo/][YOLO]] and [[https://arxiv.org/abs/1512.02325][SSD]]. In essence, these run only the second step - but instead of starting from a sparse set of locations that are probably something of interest, they start from a dense set of locations which has blanketed the entire image on a grid of many locations, over many sizes, and over many aspect ratios, regardless of whether they may contain an object. This is simpler and faster - but not nearly as accurate. Broadly, the process of training these models requires minimizing some kind of loss function that is based on what the model misclassifies when it is run on some training data. It's preferable to be able to compute some loss over each individual instance, and add all of these losses up to produce an overall loss. This leads to a problem in one-stage detectors: That dense set of locations that it's classifying usually contains a small number of locations that actually have objects (positives), and a much larger number of locations that are just background and can be very easily classified as being in the background (easy negatives). However, the loss function still adds all of them up - and even if the loss is relatively low for each of the easy negatives, their cumulative loss can drown out the loss from objects that are being misclassified. The training process is trying to minimize this loss, and so it is mostly nudging the model to improve in the area least in need of it (its ability to classify background areas that it already classifies well) and neglecting the area most in need of it (its ability to classify the "difficult" objects that it is misclassifying). # TODO: What else can I say about why loss should be additive? # Quote DL text? ML text? This is the *class imbalance* issue in a nutshell that the paper gives as the limiting factor for the accuracy of one-stage detectors. # TODO: Visualize this. Can I?