Migrate some drafts into content/posts with 'draft' flag

This commit is contained in:
Chris Hodapp
2020-04-30 19:00:38 -04:00
parent fba8a611e3
commit 129bfeb3e7
8 changed files with 37 additions and 5195 deletions

View File

@@ -0,0 +1,56 @@
---
title: Retrospect on Foresight
author: Chris Hodapp
date: January 8, 2018
tags:
- technobabble
- rambling
draft: true
---
/(Spawned from some idle thoughts around the summer of 2015.)/
Why are old technological ideas that were "ahead of their time", but
which lost out to other ideas, worth studying?
We can see them as raw ideas that "modern" understanding never
refined - misguided fantasies or even just mistakes. The flip side of
this is that we can see them as ideas that are free of a nearly
inescapable modern context and all of the preconceptions and blinders
it carries.
In some of these visionaries is a valuable combination:
- they're detached from this modern context (by mere virtue of it not
existing yet),
- they have considerable experience, imagination, and foresight,
- they devoted time and effort to work extensively on something and to
communicate their thoughts, feelings, and analysis in a durable way.
To put it in another way: They give us analysis done from a context
that is long gone. They help us think beyond our current context.
They help us answer a question, "What if we took a different path
then?"
[[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][Epigram #53]] from Alan Perlis offers some relevant skepticism here: "So
many good ideas are never heard from again once they embark in a
voyage on the semantic gulf." My interpretation of it is that we tend
to idolize ideas, old and new, because they sound somehow different,
innovative, and groundbreaking, but attempts at analysis or practical
realization of the ideas leads to a bleaker reality, perhaps that the
idea is completely meaningless (the equivalent of a [[https://en.wiktionary.org/wiki/deepity][deepity]], perhaps),
wildly impractical, or a mere facade over what is already established.
* Examples
* Scratch
- Douglas Engelbart is perhaps one of the canonical examples of a person
who was an endless source of these ideas. Ted Nelson arguably is
another. Alan Turing is an early example widely regarded for his
foresight.
- [[https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/][As We May Think (Vannevar Bush)]]
- "Do you remember a time when..." only goes so far.
- Buckminster Fuller
# Tools For Thought

View File

@@ -0,0 +1,381 @@
---
title: Modularity & Abstraction (working title)
author: Chris Hodapp
date: April 20, 2017
tags:
- technobabble
- rambling
draft: true
---
# Why don't I turn this into a paper for arXiv too? It can still be
# posted to the blog (just also make it exportable to LaTeX perhaps)
_Modularity_ and _abstraction_ feature prominently wherever computers
are involved. This is meant very broadly: it applies to designing
software, using software, integrating software, and to a lot of
hardware as well. It applies elsewhere, and almost certainly
originated elsewhere first, however, it appears especially crucial
around software.
Definitions, though, are a bit vague (including anything in this
post). My goal in this post isn't to try to (re)define them, but to
explain their essence and expand on a few theses:
- Modularity arises naturally in a wide array of places.
- Modularity and abstraction are intrinsically connected.
- Both are for the benefit of people. This usually doesn't need
stated, but to echo Paul Graham and probably others: to the
computer, it is all the same.
- More specifically, both are there to manage *complexity* by
assigning meaningful information and boundaries which allow people
to match a problem to what they can actually think about.
# - Whether a given modularization makes sense depends strongly on
# meaning and relevance of *information* inside and outside of
# modules, and broad context matters to those.
* Why?
People generally agree that "modularity" is good. The idea that
something complex can be designed and understood in terms of smaller,
simpler pieces comes naturally to anyone that has built something out
of smaller pieces or taken something apart. (This isn't to say that
reductionism is the best way to understand everything, but that's
another matter.) It runs very deep in the Unix philosophy, which ESR
gives a good overview of in [[http://www.catb.org/~esr/writings/taoup/html/ch01s06.html][The Art of Unix Programming]] - or, listen
to it from [[https://youtu.be/tc4ROCJYbm0?t%3D248][Kernighan himself]] at Bell Labs in
1982.
Tim Berners-Lee gives some practical limitations in [[https://www.w3.org/DesignIssues/Principles.html][Principles of
Design]] and in [[https://www.w3.org/DesignIssues/Modularity.html][Modularity]]: "Modular design hinges on the simplicity and
abstract nature of the interface definition between the modules. A
design in which the insides of each module need to know all about each
other is not a modular design but an arbitrary partitioning of the
bits... It is not only necessary to make sure your own system is
designed to be made of modular parts. It is also necessary to realize
that your own system, no matter how big and wonderful it seems now,
should always be designed to be a part of another larger system." Les
Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of
future software]] even did an interesting derivation tying the defect
density in software to how it is broken into pieces. The 1972 paper
[[https://www.cs.virginia.edu/~eos/cs651/papers/parnas72.pdf][On the Criteria to be Used in Decomposing System into Modules]] cites a
1970 textbook on why modularity is important in systems programming,
but also notes that nothing is said on how to divide a systems into
modules.
"Abstraction" doesn't have quite the same consensus. In software, it's
generally understood that decoupled or loosely-coupled is better than
tightly-coupled, but at the same time, "abstraction" can have the
connotation of something that gets in the way, adds overhead, and
confuses things. Dijkstra, in one of few instances of not being
snarky, allegedly said, "Being abstract is something profoundly
different from being vague. The purpose of abstraction is not to be
vague, but to create a new semantic level in which one can be
absolutely precise." Joel Spolsky, in one of few instances of me
actually caring what he said, also has a blog post from 2002 on the
[[https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/][Law of Leaky Abstractions]] ("All non-trivial abstractions, to some
degree, are leaky.") The [[https://en.wikipedia.org/wiki/Principle_of_least_privilege][principle of least privilege]] is likewise a
thing. So, abstraction too has its practical and theoretical
limitations.
* How They Relate
I bring these up together because: *abstractions* are the boundaries
between *modules*, and the communication channels (APIs, languages,
interfaces, protocols) through which they talk. It need not
necessarily be a standardized interface or a well-documented boundary,
though that helps.
Available abstractions vary. They vary by, for instance:
- ...what language you choose. Consider, for instance, that a language
like Haskell contains various abstractions done largely within the
type system that cannot be expressed in many other languages.
Languages like Python, Ruby, or JavaScript might have various
abstractions meaningful only in the context of dynamic typing. Some
languages more readily permit the creation of new abstractions, and
this might lead to a broader range of abstractions implemented in
libraries.
- ...the operating system and its standard library. What is a
process? What is a thread? What is a dynamic library? What is a
filesystem? What is a file? What is a block device? What is a
socket? What is a virtual machine? What is a bus? What is a
commandline?
- ...the time period. How many of the abstractions named above were
around or viable in 1970, 1980, 1990, 2000? In the opposite
direction, when did you last use that lovely standardized protocol,
[[https://en.wikipedia.org/wiki/Common_Gateway_Interface][CGI]], to let your web application and your web server communicate,
use [[https://en.wikipedia.org/wiki/PHIGS][PHIGS]] to render graphics, or access a large multiuser system
via hard-wired terminals?
As such: Possible ways to modularize things vary. It may make no
sense that certain ways of modularization even can or should exist
until it's been done other ways hundreds or thousands of times.
Other terms are related too. "Loosely-coupled" (or loose coupling)
and "tightly-coupled" refer to the sort of abstractions sitting
between modules, or whether or not there even are separate modules.
"Decoupling" involves changing the relationship between modules
(sometimes, creating them in the first place), typically splitting
things into two more sensible pieces that a more sensible abstraction
separates. "Factoring out" is really a form of decoupling in which
smaller parts of something are turned into a module which the original
thing then interfaces with (one canonical example is taking some bits
of code, often that are very similar or identical in many places, and
moving them into a single function). To say one has "abstracted over"
some details implies that a module is handling those details, that the
details shouldn't matter, and what does matter is the abstraction one
is using.
One of Rich Hickey's favorite topics is *composition*, and with good
reason (and you should check out [[http://www.infoq.com/presentations/Simple-Made-Easy/][Simple Made Easy]] regardless). This
relates as well: to *compose* things together effectively into bigger
parts requires that they support some common abstraction.
In the same area, [[https://clojurefun.wordpress.com/2012/08/17/composition-over-convention/][Composition over convention]] is a good read on how
/frameworks/ run counter to modularity: they aren't built to behave
like modules of a larger system.
# -----
It has a very pragmatic reason behind it: When something is a module
unto itself, presumably it is relying on specific abstractions, and it
is possible to freely change this module's internal details (provided
that it still respects the same abstractions), to move this module to
other contexts (anywhere that provides the same abstractions), and to
replace it with other modules (anything that respects the same
abstractions).
It also has a more abstract reason: When something is a module unto
itself, the way it is designed and implemented usually presents more
insight into the fundamentals of the problem it is solving. It
contains fewer incidental details, and more essential details.
# -------
* Information
I referred earlier to the abstractions themselves as both boundaries
and communications channels. Another common view is that abstractions
are *contracts* with a communicated and agreed purpose, and I think
this is a useful definition too: it conveys the notion that there are
multiple parties involved and that they are free to behave as needed
provided that they fulfill some obligation
Some definitions refer directly to information, like the [[https://en.wikipedia.org/wiki/Abstraction_principle_(computer_programming)][abstraction
principle]] which aims to reduce duplication of information which fits
with [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][don't repeat yourself]] so that "a modification of any single
element of a system does not require a change in other logically
unrelated elements".
# ----- FIXME
Consider the information this module deals in, in essence.
What is the most general form this information could be expressed in,
without being so general as to encompass other things that are
irrelevant or so low-level as to needlessly constrain the possible
contexts?
(Aristotle's theory of definitions?)
* Less-Conventional Examples
One thing I've watched with some interest is when new abstractions
emerge (or, perhaps, old ones become more widespread) to solve
problems that I wasn't even aware existed.
[[https://circleci.com/blog/it-really-is-the-future/][It really is the future]] talks about a lot of more recent forms of
modularity from the land of devops, most of which were completely
unheard-of in, say, 2010. [[https://www.functionalgeekery.com/episode-75-eric-b-merritt/][Functional Geekery episode 75]] talks about
many similar things.
[[https://jupyter.org/][Jupyter Notebook]] is one of my favorites here. It provides a notebook
interface (similar to something like Maple or Mathematica) which:
- allows the notebook to use various different programming languages
underneath,
- decouples where the notebook is used and where it is running, due to
being implemented as a web application accessed through the browser,
- decouples the presentation of a stored notebook from Jupyter itself
by using a [[https://nbformat.readthedocs.io/en/latest/][JSON-based file format]] which can be rendered without
Jupyter (like GitHub does if you commit a .ipynb file).
I love notebook interfaces already because they simplify experimenting
by handling a lot of things I'd otherwise have to do manually - like
saving results and keeping them lined up with the exact code that
produced them. Jupyter adds some other use-cases I find marvelous -
for instance, I can let the interpreter run on my workstation which
has all of the computing power, but I can access it across the
Internet from my laptop.
[[https://zeppelin.apache.org/][Apache Zeppelin]] does similar things with different languages; I've
just used it much less.
Another favorite of mine is [[https://nixos.org/nix/][Nix]]. One excellent article, [[http://blog.ezyang.com/2014/08/the-fundamental-problem-of-programming-language-package-management/][The
fundamental problem of programming language package management]],
doesn't ever mention Nix but explains very well the problems it sets
out to solve. To be able to combine nearly all of the
programming-language specific package managers into a single module is
a very lofty goal, but Nix appears to do a decent job of it (among
other things).
The [[https://www.lua.org/][Lua]] programming language is noteworthy here. It's written in
clean C with minimal dependencies, so it runs nearly anywhere that a C
or C++ compiler targets. It's purposely very easy both to *embed*
(i.e. to put inside of a program and use as an extension language,
such as for plugins or scripting) and to *extend* (i.e. to connect
with libraries to allow their functionality to be used from Lua). [[https://www.gnu.org/software/guile/][GNU
Guile]] has many of the same properties, I'm told.
We ordinarily think of object systems as something living in the
programming language. However, the object system is sometimes made a
module that is outside of the programming language, and languages just
interact with it. [[https://en.wikipedia.org/wiki/GObject][GObject]], [[https://en.wikipedia.org/wiki/Component_Object_Model][COM]], and [[https://en.wikipedia.org/wiki/XPCOM][XPCOM]] do this, and to some
extent, so does [[https://en.wikipedia.org/wiki/Meta-object_System][Qt & MOC]] - and there are probably hundreds of others,
particularly if you allow dead ones created during the object-oriented
hype of the '90s. This seems to happen in systems where the object
hierarchy is in effect "bigger" than the language.
[[https://zeromq.org/][ZeroMQ]] is another example: a set of cross-language abstractions for
communication patterns in a distributed system. I know it's likely
not unique, but it is one of the better-known and the first I thought
of, and I think their [[http://zguide.zeromq.org/page:all][guide]] is excellent.
Interestingly, the same iMatix behind ZeroMQ also created [[https://github.com/imatix/gsl][GSL]] and
explained its value in [[https://imatix-legacy.github.io/mop/introduction.html][Model-Oriented Programming]], for which
abstraction features heavily. I've not used GSL, and am skeptical of
its stated usefulness, but it looks like it is meant to help create
compile-time abstractions that likewise sit outside of any particular
programming language.
# TODO: Expand on this.
[[https://web.hypothes.is/][hypothes.is]] is a curious one that I find fascinating. They're trying
to factor out annotation and commenting from something that is handled
on a per-webpage basis and turn it into its own module, and I really
like what I've seen. However, it does not seem to have caught on
much.
The Unix tradition lives on in certain modern tools. [[https://stedolan.github.io/jq/][jq]] has proven
very useful anytime I've had to mess with JSON data. [[http://www.dest-unreach.org/socat/][socat]] and [[http://netcat.sourceforge.net/][netcat]]
have saved me numerous times. I'm sure certain people love the fact
that [[https://neovim.io/][Neovim]] is designed to be seamlessly embedded and to extend with
plugins. [[https://suckless.org/philosophy][suckless]] perhaps takes it too far, but gets an honorary
mention...
# ???
# Also, TCP/IP and the entire notion of packet-switched networks.
# And the entire OSI 7-layer model.
# Also, caches - of all types. (CPU, disk...)
# One key is how the above let you *reason* about things without
# knowing their specifics.
People know that I love Emacs, but I also do believe many of the
complaints on how large it is. Despite that it is basically its own
operating system, /within this/ it has considerable modularity. The
same applies somewhat to Blender, I suppose.
Consider [[https://research.google.com/pubs/pub43146.html][Machine Learning: The High Interest Credit Card of Technical Debt]],
a paper that anyone working around machine learning should read and
re-read regularly. Large parts of the paper are about ways in which
machine learning conflicts with proper modularity and abstraction.
(However, [[https://colah.github.io/posts/2015-09-NN-Types-FP/][Neural Networks, Types, and Functional Programming]] is still
a good post and shows some sorts of abstraction that still exist
at least in neural networks.)
Even DOS had useful abstractions. Things like
DriveSpace/DoubleSpace/Stacker worked well enough because most
software that needed files relied on DOS's normal abstractions to
access them - so it did not matter to them that the underlying
filesystem was actually compressed, or was actually a RAM disk, or was
on some obscure SCSI interface. Likewise, for the silliness known as
[[https://en.wikipedia.org/wiki/Expanded_memory][EMS]], applications that accessed memory through the EMS abstraction
could disregard whether it was a "real" EMS board providing access to
that memory, whether it was an expanded memory manager providing
indirect access to some other memory or even to a hard disk pretending
to be memory.
Even more abstractly: emulators work because so much software
respected the abstraction of some specific CPU and hardware platform.
Submitted without further comment:
https://github.com/stevemao/left-pad/issues/4
* Fragments
- Abstracting over...
- Multiple applications
- Multiple users
- Multiple CPUs
- Multiple hosts
- [[Notes - Paper, 2016-11-13]]
- Tanenbaum vs. Linus war & microkernels
- TBL: "The choice of language is a common design choice. The low
power end of the scale is typically simpler to design, implement and
use, but the high power end of the scale has all the attraction of
being an open-ended hook into which anything can be placed: a door
to uses bounded only by the imagination of the programmer. Computer
Science in the 1960s to 80s spent a lot of effort making languages
which were as powerful as possible. Nowadays we have to appreciate
the reasons for picking not the most powerful solution but the least
powerful. The reason for this is that the less powerful the
language, the more you can do with the data stored in that
language. If you write it in a simple declarative from, anyone can
write a program to analyze it in many ways." (Languages are a kind
of abstraction - one that influences how a module is written, and
what contexts it is useful in.)
- "Self" paper & structural reification?
- I'm still not sure how this relates, but it may perhaps relate to
how *not* to make things modular (structural reification is a sort
of check on the scope of objects/classes)
- What by Rich Hickey?
- Simple Made Easy?
- The Value of Values?
- SICP: [[https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-19.html#%25_chap_3][Modularity, Objects, and State]]
- [[https://www.cs.utexas.edu/~wcook/Drafts/2009/essay.pdf][On Understanding Data Abstraction, Revisited]]
- http://www.catb.org/~esr/writings/taoup/html/apb.html#Baldwin-Clark -
Carliss Baldwin and Kim Clark. Design Rules, Vol 1: The Power of
Modularity. 2000. MIT Press. ISBN 0-262-024667.
- Brooks, No Silver Bullet?
- https://en.wikipedia.org/wiki/Essential_complexity
- https://twitter.com/fchollet/status/962074070513631232
- [[https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-9.html#%25_chap_1][From SICP chapter 1 intro]]: "The acts of the mind, wherein it exerts
its power over simple ideas, are chiefly these three: 1. Combining
several simple ideas into one compound one, and thus all complex
ideas are made. 2. The second is bringing two ideas, whether simple
or complex, together, and setting them by one another so as to take
a view of them at once, without uniting them into one, by which it
gets all its ideas of relations. 3. The third is separating them
from all other ideas that accompany them in their real existence:
this is called abstraction, and thus all its general ideas are
made." -John Locke, An Essay Concerning Human Understanding (1690)
- One point I have ignored (maybe): You clearly separate the 'inside'
of a module (its implementation) from the 'outside' (that is - its
boundaries, the abstractions that it interfaces with or that it
implements) so that the 'inside' can change more or less freely
without having any effect on the outside.
- Abstractions as a way of reducing the work required to add
functionality (changes can be made just in the relevant modules, and
other modules do not need to change to conform)
- What is more key? Communication, information content, contracts,
details?
- [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][Don't repeat yourself]]
- [[https://simplyphilosophy.org/study/aristotles-definitions/][Aristotle & theory of definitions]]
- this isn't right. I need to find the quote in the Durant book
(which will probably have an actual source) that pertains to how
specific and how general a definition must be
- [[https://en.wikipedia.org/wiki/SOLID][SOLID]]
- [[https://en.wikipedia.org/wiki/Cross-cutting_concern][Cross-cutting concerns]] and [[https://en.wikipedia.org/wiki/Aspect-oriented_programming][Aspect-oriented programming]]
- [[https://en.wikipedia.org/wiki/Separation_of_concerns][Separation of Concerns]]
- [[https://en.wikipedia.org/wiki/Abstraction_principle_(computer_programming)][Abstraction principle]]
- [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][Don't repeat yourself]]

Binary file not shown.

After

Width:  |  Height:  |  Size: 369 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

View File

@@ -0,0 +1,373 @@
---
title: Explaining RetinaNet
author: Chris Hodapp
date: December 13, 2017
tags:
- technobabble
draft: true
---
# TODO: The inline equations are still broken (maybe because this is
# in org format)
# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
# description:
# subtitle:
A paper came out in the past few months,
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
Facebook's teams. The goal of this post is to
explain this paper as I work through it, through some of its
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
* Object Detection
"Object detection" as it is used here refers to machine learning
models that can not just identify a single object in an image, but can
identify and *localize* multiple objects, like in the below photo
taken from
[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:
# TODO:
# Define mAP
#+CAPTION: TensorFlow object detection example 2.
#+ATTR_HTML: :width 100% :height 100%
[[./2017-12-13-objdet.jpg]]
At the time of writing, the most accurate object-detection methods
were based around R-CNN and its variants, and all used two-stage
approaches:
1. One model proposes a sparse set of locations in the image that
probably contain something. Ideally, this contains all objects in
the image, but filters out the majority of negative locations
(i.e. only background, not foreground).
2. Another model, typically a CNN (convolutional neural network),
classifies each location in that sparse set as either being
foreground and some specific object class (like "kite" or "person"
above), or as being background.
Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
OverFeat. These simplified/approximated the two-stage approach by
replacing the first step with brute force. That is, instead of
generating a sparse set of locations that probably have something of
interest, they simply handle all locations, whether or not they likely
contain something, by blanketing the entire image in a dense sampling
of many locations, many sizes, and many aspect ratios.
This is simpler and faster - but not as accurate as the two-stage
approaches.
Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
didn't come up with these names) merge the two models of two-stage
approaches into a single CNN, and exploit the possibility of sharing
computations that would otherwise be done twice. I assume that this
is included in the comparisons done in the paper, but I'm not entirely
sure.
* Training & Class Imbalance
Briefly, the process of training these models requires minimizing some
kind of loss function that is based on what the model misclassifies
when it is run on some training data. It's preferable to be able to
compute some loss over each individual instance, and add all of these
losses up to produce an overall loss. (Yes, far more can be said on
this, but the details aren't really important here.)
# TODO: What else can I say about why loss should be additive?
# Quote DL text? ML text?
This leads to a problem in one-stage detectors: That dense set of
locations that it's classifying usually contains a small number of
locations that actually have objects (positives), and a much larger
number of locations that are just background and can be very easily
classified as being in the background (easy negatives). However, the
loss function still adds all of them up - and even if the loss is
relatively low for each of the easy negatives, their cumulative loss
can drown out the loss from objects that are being misclassified.
That is: A large number of tiny, irrelevant losses overwhelm a smaller
number of larger, relevant losses. The paper was a bit terse on this;
it took a few re-reads to understand why "easy negatives" were an
issue, so hopefully I have this right.
The training process is trying to minimize this loss, and so it is
mostly nudging the model to improve where it least needs it (its
ability to classify background areas that it already classifies well)
and neglecting where it most needs it (its ability to classify the
"difficult" objects that it is misclassifying).
# TODO: Visualize this. Can I?
This is *class imbalance* in a nutshell, which the paper gives as the
limiting factor for the accuracy of one-stage detectors. While the
existing approaches try to tackle it with methods like bootstrapping
or hard example mining, the accuracy still is lower.
** Focal Loss
So, the point of all this is: A tweak to the loss function can fix
this issue, and retain the speed and simplicity of one-stage
approaches while surpassing the accuracy of existing two-stage ones.
At least, this is what the paper claims. Their novel loss function is
called *Focal Loss* (as the title references), and it multiplies the
normal cross-entropy by a factor, \( (1-p_t)^\gamma \), where \( p_t \)
approaches 1 as the model predicts a higher and higher probability of
the correct classification, or 0 for an incorrect one, and \( \gamma \) is
a "focusing" hyperparameter (they used \( \gamma=2 \)). Intuitively, this
scaling makes sense: if a classification is already correct (as in the
"easy negatives"), \( (1-p_t)^\gamma \) tends toward 0, and so the portion
of the loss multiplied by it will likewise tend toward 0.
* RetinaNet architecture
The paper gives the name *RetinaNet* to the network they created which
incorporates this focal loss in its training. While it says, "We
emphasize that our simple detector achieves top results not based on
innovations in network design but due to our novel loss," it is
important not to miss that /innovations in/: they are saying that they
didn't need to invent a new network design - not that the network
design doesn't matter. Later in the paper, they say that it is in
fact crucial that RetinaNet's architecture relies on FPN (Feature
Pyramid Network) as its backbone. As far as I can tell, the
architecture's use of a variant of RPN (Region Proposal Network) is
also very important.
I go into both of these aspects below.
* Feature Pyramid Network
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
describes the basis of this FPN in detail (and, non-coincidentally I'm
sure, the paper shares 4 co-authors with the paper this post
explores). The paper is fairly concise in describing FPNs; it only
takes it around 3 pages to explain their purpose, related work, and
their entire design. The remainder shows experimental results and
specific applications of FPNs. While it shows FPNs implemented on a
particular underlying network (ResNet, mentioned below), they were
made purposely to be very simple and adaptable to nearly any kind of
CNN.
To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]]. The below
diagram illustrates an image pyramid:
#+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
#+ATTR_HTML: :width 100% :height 100%
[[./1024px-Image_pyramid.svg.png]]
Image pyramids have many uses, but the paper focuses on their use in
taking something that works only at a certain scale of image - for
instance, an image classification model that only identifies objects
that are around 50 pixels across - and adapting it to handle different
scales by applying it at every level of the image pyramid. If the
model has a little flexibility, some level of the image pyramid is
bound to have scaled the object to the correct size that the model can
match it.
Typically, though, detection or classification isn't done directly on
an image, but rather, the image is converted to some more useful
feature space. However, these feature spaces likewise tend to be
useful only at a specific scale. This is the rationale behind
"featurized image pyramids", or feature pyramids built upon image
pyramids, created by converting each level of an image pyramid to that
feature space.
The problem with featurized image pyramids, the paper says, is that if
you try to use them in CNNs, they drastically slow everything down,
and use so much memory as to make normal training impossible.
However, take a look below at this generic diagram of a generic deep
CNN:
#+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
#+ATTR_HTML: :width 100% :height 100%
[[./Typical_cnn.png]]
You may notice that this network has a structure that bears some
resemblance to an image pyramid. This is because deep CNNs are
already computing a sort of pyramid in their convolutional and
subsampling stages. In a nutshell, deep CNNs used in image
classification push an image through a cascade of feature detectors or
filters, and each successive stage contains a feature map that is
built out of features in the prior stage - thus producing a *feature
hierarchy* which already is something like a pyramid and contains
multiple different scales. (Being able to train deep CNNs to jointly
learn the filters at each stage of that feature hierarchy from the
data, rather than engineering them by hand, is what sets deep learning
apart from "shallow" machine learning.)
When you move through levels of a featurized image pyramid, only scale
should change. When you move through levels of a feature hierarchy
described here, scale changes, but so does the meaning of the
features. This is the *semantic gap* the paper references. Meaning
changes because each stage builds up more complex features by
combining simpler features of the last stage. The first stage, for
instance, commonly handles pixel-level features like points, lines or
edges at a particular direction. In the final stage, presumably, the
model has learned complex enough features that things like "kite" and
"person" can be identified.
The goal in the paper was to find a way to exploit this feature
hierarchy that is already being computed and to produce something that
has similar power to a featurized image pyramid but without too high
of a cost in speed, memory, or complexity.
Everything described so far (none of which is specific to FPNs), the
paper calls the *bottom-up* pathway - the feed-forward portion of the
CNN. FPN adds to this a *top-down* pathway and some lateral
connections.
** Top-Down Pathway
** Lateral Connections
** As Applied to ResNet
# Note C=256 and such
# TODO: Link to some good explanations
For two reasons, I don't explain much about ResNet here. The first is
that residual networks, like the ResNet used here, have seen lots of
attention and already have many good explanations online. The second
is that the paper claims that the underlying network
[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
* Anchors & Region Proposals
Recall last section what was said about feature maps, and the that the
deeper stages of the CNN happen to be good for classifying images.
While these deeper stages are lower-resolution than the input images,
and while their influence is spread out over larger areas of the input
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
stage spreading it a little further), the features here still maintain
a spatial relationship with the input image. That is, moving across
one axis of this feature map still corresponds to moving across the
same axis of the input image.
# Just re-explain the above with the feature pyramid
RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks]]. I find the explanations in terms of "proposals", of
focusing the "attention" of the neural network, or of "telling the
neural network where to look" to be needlessly confusing and
misleading. I'd rather explain very plainly how they work.
Central to RPNs is *anchors*. Anchors aren't exactly a feature of the
CNN. They're more a property that's used in its training and
inference.
In particular:
- Say that the feature pyramid has \( L \) levels, and that level \( l+1 \) is
half the resolution (thus double the scale) of level \( l \).
- Say that level \( l \) is a 256-channel feature map of size \( W \times H \)
(i.e. it's a tensor with shape \( W \times H \times 256 \)). Note that
\( W \) and \( H \) will be larger at lower levels, and smaller at higher
levels, but in RetinaNet at least, always 256-channel samples.
- For every point on that feature map (all \( WH \) of them), we can
identify a corresponding point in the input image. This is the
center point of a broad region of the input image that influences
this point in the feature map (i.e. its receptive field). Note that
as we move up to higher levels in the feature pyramid, these regions
grow larger, and neighboring points in the feature map correspond to
larger and larger jumps across the input image.
- We can make these regions explicit by defining *anchors* - specific
rectangular regions associated with each point of a feature map.
The size of the anchor depends on the scale of the feature map, or
equivalently, what level of the feature map it came from. All this
means is that anchors in level \( l+1 \) are twice as large as the
anchors of level \( l \).
The view that this should paint is that a dense collection of anchors
covers the entire input image at different sizes - still in a very
ordered pattern, but with lots of overlap. Remember how I mentioned
at the beginning of this post that one-stage object detectors use a
very "brute force" method?
My above explanation glossed over a couple things, but nothing that
should change the fundamentals.
- Anchors are actually associated with every 3x3 window in the anchor
map, not precisely every point, but all this really means is that
it's "every point and its immediate neighbors" rather than "every
point". This doesn't really matter to anchors, but matters
elsewhere.
- It's not a single anchor per 3x3 window, but 9 anchors - one for
each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
scale factors (\( 1, 2^{1/3}, and 2^{2/3} \)) on top of its base scale.
This is just to handle objects of less-square shapes and to cover
the gap in scale in between levels of the feature pyramid. Note
that the scale factors are evenly-spaced exponentially, such that an
additional step down wouldn't make sense (the largest anchors at the
pyramid level /below/ already cover this scale), and nor would an
additional step up (the smallest anchors at the pyramid level
/above/ already cover it).
Here, finally, is where actual classification and regression come in.
The *classification subnet* and *box regression subnet* are here.
** Classification Subnet
Every anchor associates an image region with a 3x3 window (i.e. a
3x3x256 section - it's still 256-channel). The classification subnet
is responsible for learning: do the features in this 3x3 window,
produced from some input, image indicate that an object is inside this
anchor? Or, more accurately: For each of \( K \) object classes, what's
the probability of each object (or just of it being background)?
** Box Regression Subnet
The box regression subnet takes the same input as the classification
subnet, but tries to learn the answer to a different question. It is
responsible for learning: what are the coordinates to the object
inside of this anchor (assuming there is one)? More specifically, it
tries to learn to produce 4 numbers values which give offsets relative
to the anchor's bounds (thus specifying a different region). Note
that this subnet completely ignores the class of the object.
The classification subnet already tells us whether or not a given
anchor contains an object - which already gives rough bounds on
it. The box regression subnet helps tighten these bounds.
** Other notes (?)
I've glossed over a few details here. Everything I've described above
is implemented with bog-standard convolutional networks...
# Parameter sharing? How to explain?
* Training
# Ground-truth object boxes
# Intersection-over-Union thresholds
* Inference
# Top N results
* References
# Does org-mode have a way to make a special section for references?
# I know I saw this somewhere
1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
9. https://github.com/KaimingHe/deep-residual-networks
10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/