Migrate some drafts into content/posts with 'draft' flag
This commit is contained in:
@@ -1,52 +0,0 @@
|
||||
---
|
||||
title: Retrospect on Foresight
|
||||
author: Chris Hodapp
|
||||
date: January 8, 2018
|
||||
tags: technobabble, rambling
|
||||
---
|
||||
|
||||
/(Spawned from some idle thoughts around the summer of 2015.)/
|
||||
|
||||
Why are old technological ideas that were "ahead of their time", but
|
||||
which lost out to other ideas, worth studying?
|
||||
|
||||
We can see them as raw ideas that "modern" understanding never
|
||||
refined - misguided fantasies or even just mistakes. The flip side of
|
||||
this is that we can see them as ideas that are free of a nearly
|
||||
inescapable modern context and all of the preconceptions and blinders
|
||||
it carries.
|
||||
|
||||
In some of these visionaries is a valuable combination:
|
||||
|
||||
- they're detached from this modern context (by mere virtue of it not
|
||||
existing yet),
|
||||
- they have considerable experience, imagination, and foresight,
|
||||
- they devoted time and effort to work extensively on something and to
|
||||
communicate their thoughts, feelings, and analysis in a durable way.
|
||||
|
||||
To put it in another way: They give us analysis done from a context
|
||||
that is long gone. They help us think beyond our current context.
|
||||
They help us answer a question, "What if we took a different path
|
||||
then?"
|
||||
|
||||
[[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][Epigram #53]] from Alan Perlis offers some relevant skepticism here: "So
|
||||
many good ideas are never heard from again once they embark in a
|
||||
voyage on the semantic gulf." My interpretation of it is that we tend
|
||||
to idolize ideas, old and new, because they sound somehow different,
|
||||
innovative, and groundbreaking, but attempts at analysis or practical
|
||||
realization of the ideas leads to a bleaker reality, perhaps that the
|
||||
idea is completely meaningless (the equivalent of a [[https://en.wiktionary.org/wiki/deepity][deepity]], perhaps),
|
||||
wildly impractical, or a mere facade over what is already established.
|
||||
|
||||
* Examples
|
||||
|
||||
* Scratch
|
||||
|
||||
- Douglas Engelbart is perhaps one of the canonical examples of a person
|
||||
who was an endless source of these ideas. Ted Nelson arguably is
|
||||
another. Alan Turing is an early example widely regarded for his
|
||||
foresight.
|
||||
- [[https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/][As We May Think (Vannevar Bush)]]
|
||||
- "Do you remember a time when..." only goes so far.
|
||||
|
||||
# Tools For Thought
|
||||
@@ -1,376 +0,0 @@
|
||||
#+TITLE: Modularity & Abstraction (working title)
|
||||
#+AUTHOR: Chris Hodapp
|
||||
#+DATE: April 20, 2017
|
||||
#+TAGS: technobabble
|
||||
|
||||
# Why don't I turn this into a paper for arXiv too? It can still be
|
||||
# posted to the blog (just also make it exportable to LaTeX perhaps)
|
||||
|
||||
_Modularity_ and _abstraction_ feature prominently wherever computers
|
||||
are involved. This is meant very broadly: it applies to designing
|
||||
software, using software, integrating software, and to a lot of
|
||||
hardware as well. It applies elsewhere, and almost certainly
|
||||
originated elsewhere first, however, it appears especially crucial
|
||||
around software.
|
||||
|
||||
Definitions, though, are a bit vague (including anything in this
|
||||
post). My goal in this post isn't to try to (re)define them, but to
|
||||
explain their essence and expand on a few theses:
|
||||
|
||||
- Modularity arises naturally in a wide array of places.
|
||||
- Modularity and abstraction are intrinsically connected.
|
||||
- Both are for the benefit of people. This usually doesn't need
|
||||
stated, but to echo Paul Graham and probably others: to the
|
||||
computer, it is all the same.
|
||||
- More specifically, both are there to manage *complexity* by
|
||||
assigning meaningful information and boundaries which allow people
|
||||
to match a problem to what they can actually think about.
|
||||
|
||||
# - Whether a given modularization makes sense depends strongly on
|
||||
# meaning and relevance of *information* inside and outside of
|
||||
# modules, and broad context matters to those.
|
||||
|
||||
* Why?
|
||||
|
||||
People generally agree that "modularity" is good. The idea that
|
||||
something complex can be designed and understood in terms of smaller,
|
||||
simpler pieces comes naturally to anyone that has built something out
|
||||
of smaller pieces or taken something apart. (This isn't to say that
|
||||
reductionism is the best way to understand everything, but that's
|
||||
another matter.) It runs very deep in the Unix philosophy, which ESR
|
||||
gives a good overview of in [[http://www.catb.org/~esr/writings/taoup/html/ch01s06.html][The Art of Unix Programming]] - or, listen
|
||||
to it from [[https://youtu.be/tc4ROCJYbm0?t%3D248][Kernighan himself]] at Bell Labs in
|
||||
1982.
|
||||
|
||||
Tim Berners-Lee gives some practical limitations in [[https://www.w3.org/DesignIssues/Principles.html][Principles of
|
||||
Design]] and in [[https://www.w3.org/DesignIssues/Modularity.html][Modularity]]: "Modular design hinges on the simplicity and
|
||||
abstract nature of the interface definition between the modules. A
|
||||
design in which the insides of each module need to know all about each
|
||||
other is not a modular design but an arbitrary partitioning of the
|
||||
bits... It is not only necessary to make sure your own system is
|
||||
designed to be made of modular parts. It is also necessary to realize
|
||||
that your own system, no matter how big and wonderful it seems now,
|
||||
should always be designed to be a part of another larger system." Les
|
||||
Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of
|
||||
future software]] even did an interesting derivation tying the defect
|
||||
density in software to how it is broken into pieces. The 1972 paper
|
||||
[[https://www.cs.virginia.edu/~eos/cs651/papers/parnas72.pdf][On the Criteria to be Used in Decomposing System into Modules]] cites a
|
||||
1970 textbook on why modularity is important in systems programming,
|
||||
but also notes that nothing is said on how to divide a systems into
|
||||
modules.
|
||||
|
||||
"Abstraction" doesn't have quite the same consensus. In software, it's
|
||||
generally understood that decoupled or loosely-coupled is better than
|
||||
tightly-coupled, but at the same time, "abstraction" can have the
|
||||
connotation of something that gets in the way, adds overhead, and
|
||||
confuses things. Dijkstra, in one of few instances of not being
|
||||
snarky, allegedly said, "Being abstract is something profoundly
|
||||
different from being vague. The purpose of abstraction is not to be
|
||||
vague, but to create a new semantic level in which one can be
|
||||
absolutely precise." Joel Spolsky, in one of few instances of me
|
||||
actually caring what he said, also has a blog post from 2002 on the
|
||||
[[https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/][Law of Leaky Abstractions]] ("All non-trivial abstractions, to some
|
||||
degree, are leaky.") The [[https://en.wikipedia.org/wiki/Principle_of_least_privilege][principle of least privilege]] is likewise a
|
||||
thing. So, abstraction too has its practical and theoretical
|
||||
limitations.
|
||||
|
||||
* How They Relate
|
||||
|
||||
I bring these up together because: *abstractions* are the boundaries
|
||||
between *modules*, and the communication channels (APIs, languages,
|
||||
interfaces, protocols) through which they talk. It need not
|
||||
necessarily be a standardized interface or a well-documented boundary,
|
||||
though that helps.
|
||||
|
||||
Available abstractions vary. They vary by, for instance:
|
||||
- ...what language you choose. Consider, for instance, that a language
|
||||
like Haskell contains various abstractions done largely within the
|
||||
type system that cannot be expressed in many other languages.
|
||||
Languages like Python, Ruby, or JavaScript might have various
|
||||
abstractions meaningful only in the context of dynamic typing. Some
|
||||
languages more readily permit the creation of new abstractions, and
|
||||
this might lead to a broader range of abstractions implemented in
|
||||
libraries.
|
||||
- ...the operating system and its standard library. What is a
|
||||
process? What is a thread? What is a dynamic library? What is a
|
||||
filesystem? What is a file? What is a block device? What is a
|
||||
socket? What is a virtual machine? What is a bus? What is a
|
||||
commandline?
|
||||
- ...the time period. How many of the abstractions named above were
|
||||
around or viable in 1970, 1980, 1990, 2000? In the opposite
|
||||
direction, when did you last use that lovely standardized protocol,
|
||||
[[https://en.wikipedia.org/wiki/Common_Gateway_Interface][CGI]], to let your web application and your web server communicate,
|
||||
use [[https://en.wikipedia.org/wiki/PHIGS][PHIGS]] to render graphics, or access a large multiuser system
|
||||
via hard-wired terminals?
|
||||
|
||||
As such: Possible ways to modularize things vary. It may make no
|
||||
sense that certain ways of modularization even can or should exist
|
||||
until it's been done other ways hundreds or thousands of times.
|
||||
|
||||
Other terms are related too. "Loosely-coupled" (or loose coupling)
|
||||
and "tightly-coupled" refer to the sort of abstractions sitting
|
||||
between modules, or whether or not there even are separate modules.
|
||||
"Decoupling" involves changing the relationship between modules
|
||||
(sometimes, creating them in the first place), typically splitting
|
||||
things into two more sensible pieces that a more sensible abstraction
|
||||
separates. "Factoring out" is really a form of decoupling in which
|
||||
smaller parts of something are turned into a module which the original
|
||||
thing then interfaces with (one canonical example is taking some bits
|
||||
of code, often that are very similar or identical in many places, and
|
||||
moving them into a single function). To say one has "abstracted over"
|
||||
some details implies that a module is handling those details, that the
|
||||
details shouldn't matter, and what does matter is the abstraction one
|
||||
is using.
|
||||
|
||||
One of Rich Hickey's favorite topics is *composition*, and with good
|
||||
reason (and you should check out [[http://www.infoq.com/presentations/Simple-Made-Easy/][Simple Made Easy]] regardless). This
|
||||
relates as well: to *compose* things together effectively into bigger
|
||||
parts requires that they support some common abstraction.
|
||||
|
||||
In the same area, [[https://clojurefun.wordpress.com/2012/08/17/composition-over-convention/][Composition over convention]] is a good read on how
|
||||
/frameworks/ run counter to modularity: they aren't built to behave
|
||||
like modules of a larger system.
|
||||
|
||||
# -----
|
||||
|
||||
It has a very pragmatic reason behind it: When something is a module
|
||||
unto itself, presumably it is relying on specific abstractions, and it
|
||||
is possible to freely change this module's internal details (provided
|
||||
that it still respects the same abstractions), to move this module to
|
||||
other contexts (anywhere that provides the same abstractions), and to
|
||||
replace it with other modules (anything that respects the same
|
||||
abstractions).
|
||||
|
||||
It also has a more abstract reason: When something is a module unto
|
||||
itself, the way it is designed and implemented usually presents more
|
||||
insight into the fundamentals of the problem it is solving. It
|
||||
contains fewer incidental details, and more essential details.
|
||||
|
||||
# -------
|
||||
|
||||
* Information
|
||||
|
||||
I referred earlier to the abstractions themselves as both boundaries
|
||||
and communications channels. Another common view is that abstractions
|
||||
are *contracts* with a communicated and agreed purpose, and I think
|
||||
this is a useful definition too: it conveys the notion that there are
|
||||
multiple parties involved and that they are free to behave as needed
|
||||
provided that they fulfill some obligation
|
||||
|
||||
Some definitions refer directly to information, like the [[https://en.wikipedia.org/wiki/Abstraction_principle_(computer_programming)][abstraction
|
||||
principle]] which aims to reduce duplication of information which fits
|
||||
with [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][don't repeat yourself]] so that "a modification of any single
|
||||
element of a system does not require a change in other logically
|
||||
unrelated elements".
|
||||
|
||||
|
||||
|
||||
# ----- FIXME
|
||||
Consider the information this module deals in, in essence.
|
||||
|
||||
What is the most general form this information could be expressed in,
|
||||
without being so general as to encompass other things that are
|
||||
irrelevant or so low-level as to needlessly constrain the possible
|
||||
contexts?
|
||||
|
||||
(Aristotle's theory of definitions?)
|
||||
|
||||
* Less-Conventional Examples
|
||||
|
||||
One thing I've watched with some interest is when new abstractions
|
||||
emerge (or, perhaps, old ones become more widespread) to solve
|
||||
problems that I wasn't even aware existed.
|
||||
|
||||
[[https://circleci.com/blog/it-really-is-the-future/][It really is the future]] talks about a lot of more recent forms of
|
||||
modularity from the land of devops, most of which were completely
|
||||
unheard-of in, say, 2010. [[https://www.functionalgeekery.com/episode-75-eric-b-merritt/][Functional Geekery episode 75]] talks about
|
||||
many similar things.
|
||||
|
||||
[[https://jupyter.org/][Jupyter Notebook]] is one of my favorites here. It provides a notebook
|
||||
interface (similar to something like Maple or Mathematica) which:
|
||||
|
||||
- allows the notebook to use various different programming languages
|
||||
underneath,
|
||||
- decouples where the notebook is used and where it is running, due to
|
||||
being implemented as a web application accessed through the browser,
|
||||
- decouples the presentation of a stored notebook from Jupyter itself
|
||||
by using a [[https://nbformat.readthedocs.io/en/latest/][JSON-based file format]] which can be rendered without
|
||||
Jupyter (like GitHub does if you commit a .ipynb file).
|
||||
|
||||
I love notebook interfaces already because they simplify experimenting
|
||||
by handling a lot of things I'd otherwise have to do manually - like
|
||||
saving results and keeping them lined up with the exact code that
|
||||
produced them. Jupyter adds some other use-cases I find marvelous -
|
||||
for instance, I can let the interpreter run on my workstation which
|
||||
has all of the computing power, but I can access it across the
|
||||
Internet from my laptop.
|
||||
|
||||
[[https://zeppelin.apache.org/][Apache Zeppelin]] does similar things with different languages; I've
|
||||
just used it much less.
|
||||
|
||||
Another favorite of mine is [[https://nixos.org/nix/][Nix]]. One excellent article, [[http://blog.ezyang.com/2014/08/the-fundamental-problem-of-programming-language-package-management/][The
|
||||
fundamental problem of programming language package management]],
|
||||
doesn't ever mention Nix but explains very well the problems it sets
|
||||
out to solve. To be able to combine nearly all of the
|
||||
programming-language specific package managers into a single module is
|
||||
a very lofty goal, but Nix appears to do a decent job of it (among
|
||||
other things).
|
||||
|
||||
The [[https://www.lua.org/][Lua]] programming language is noteworthy here. It's written in
|
||||
clean C with minimal dependencies, so it runs nearly anywhere that a C
|
||||
or C++ compiler targets. It's purposely very easy both to *embed*
|
||||
(i.e. to put inside of a program and use as an extension language,
|
||||
such as for plugins or scripting) and to *extend* (i.e. to connect
|
||||
with libraries to allow their functionality to be used from Lua). [[https://www.gnu.org/software/guile/][GNU
|
||||
Guile]] has many of the same properties, I'm told.
|
||||
|
||||
We ordinarily think of object systems as something living in the
|
||||
programming language. However, the object system is sometimes made a
|
||||
module that is outside of the programming language, and languages just
|
||||
interact with it. [[https://en.wikipedia.org/wiki/GObject][GObject]], [[https://en.wikipedia.org/wiki/Component_Object_Model][COM]], and [[https://en.wikipedia.org/wiki/XPCOM][XPCOM]] do this, and to some
|
||||
extent, so does [[https://en.wikipedia.org/wiki/Meta-object_System][Qt & MOC]] - and there are probably hundreds of others,
|
||||
particularly if you allow dead ones created during the object-oriented
|
||||
hype of the '90s. This seems to happen in systems where the object
|
||||
hierarchy is in effect "bigger" than the language.
|
||||
|
||||
[[https://zeromq.org/][ZeroMQ]] is another example: a set of cross-language abstractions for
|
||||
communication patterns in a distributed system. I know it's likely
|
||||
not unique, but it is one of the better-known and the first I thought
|
||||
of, and I think their [[http://zguide.zeromq.org/page:all][guide]] is excellent.
|
||||
|
||||
Interestingly, the same iMatix behind ZeroMQ also created [[https://github.com/imatix/gsl][GSL]] and
|
||||
explained its value in [[https://imatix-legacy.github.io/mop/introduction.html][Model-Oriented Programming]], for which
|
||||
abstraction features heavily. I've not used GSL, and am skeptical of
|
||||
its stated usefulness, but it looks like it is meant to help create
|
||||
compile-time abstractions that likewise sit outside of any particular
|
||||
programming language.
|
||||
|
||||
# TODO: Expand on this.
|
||||
|
||||
[[https://web.hypothes.is/][hypothes.is]] is a curious one that I find fascinating. They're trying
|
||||
to factor out annotation and commenting from something that is handled
|
||||
on a per-webpage basis and turn it into its own module, and I really
|
||||
like what I've seen. However, it does not seem to have caught on
|
||||
much.
|
||||
|
||||
The Unix tradition lives on in certain modern tools. [[https://stedolan.github.io/jq/][jq]] has proven
|
||||
very useful anytime I've had to mess with JSON data. [[http://www.dest-unreach.org/socat/][socat]] and [[http://netcat.sourceforge.net/][netcat]]
|
||||
have saved me numerous times. I'm sure certain people love the fact
|
||||
that [[https://neovim.io/][Neovim]] is designed to be seamlessly embedded and to extend with
|
||||
plugins. [[https://suckless.org/philosophy][suckless]] perhaps takes it too far, but gets an honorary
|
||||
mention...
|
||||
|
||||
# ???
|
||||
|
||||
# Also, TCP/IP and the entire notion of packet-switched networks.
|
||||
# And the entire OSI 7-layer model.
|
||||
|
||||
# Also, caches - of all types. (CPU, disk...)
|
||||
|
||||
# One key is how the above let you *reason* about things without
|
||||
# knowing their specifics.
|
||||
|
||||
People know that I love Emacs, but I also do believe many of the
|
||||
complaints on how large it is. Despite that it is basically its own
|
||||
operating system, /within this/ it has considerable modularity. The
|
||||
same applies somewhat to Blender, I suppose.
|
||||
|
||||
Consider [[https://research.google.com/pubs/pub43146.html][Machine Learning: The High Interest Credit Card of Technical Debt]],
|
||||
a paper that anyone working around machine learning should read and
|
||||
re-read regularly. Large parts of the paper are about ways in which
|
||||
machine learning conflicts with proper modularity and abstraction.
|
||||
(However, [[https://colah.github.io/posts/2015-09-NN-Types-FP/][Neural Networks, Types, and Functional Programming]] is still
|
||||
a good post and shows some sorts of abstraction that still exist
|
||||
at least in neural networks.)
|
||||
|
||||
Even DOS had useful abstractions. Things like
|
||||
DriveSpace/DoubleSpace/Stacker worked well enough because most
|
||||
software that needed files relied on DOS's normal abstractions to
|
||||
access them - so it did not matter to them that the underlying
|
||||
filesystem was actually compressed, or was actually a RAM disk, or was
|
||||
on some obscure SCSI interface. Likewise, for the silliness known as
|
||||
[[https://en.wikipedia.org/wiki/Expanded_memory][EMS]], applications that accessed memory through the EMS abstraction
|
||||
could disregard whether it was a "real" EMS board providing access to
|
||||
that memory, whether it was an expanded memory manager providing
|
||||
indirect access to some other memory or even to a hard disk pretending
|
||||
to be memory.
|
||||
|
||||
Even more abstractly: emulators work because so much software
|
||||
respected the abstraction of some specific CPU and hardware platform.
|
||||
|
||||
Submitted without further comment:
|
||||
https://github.com/stevemao/left-pad/issues/4
|
||||
|
||||
* Fragments
|
||||
|
||||
- Abstracting over...
|
||||
- Multiple applications
|
||||
- Multiple users
|
||||
- Multiple CPUs
|
||||
- Multiple hosts
|
||||
|
||||
- [[Notes - Paper, 2016-11-13]]
|
||||
- Tanenbaum vs. Linus war & microkernels
|
||||
- TBL: "The choice of language is a common design choice. The low
|
||||
power end of the scale is typically simpler to design, implement and
|
||||
use, but the high power end of the scale has all the attraction of
|
||||
being an open-ended hook into which anything can be placed: a door
|
||||
to uses bounded only by the imagination of the programmer. Computer
|
||||
Science in the 1960s to 80s spent a lot of effort making languages
|
||||
which were as powerful as possible. Nowadays we have to appreciate
|
||||
the reasons for picking not the most powerful solution but the least
|
||||
powerful. The reason for this is that the less powerful the
|
||||
language, the more you can do with the data stored in that
|
||||
language. If you write it in a simple declarative from, anyone can
|
||||
write a program to analyze it in many ways." (Languages are a kind
|
||||
of abstraction - one that influences how a module is written, and
|
||||
what contexts it is useful in.)
|
||||
- "Self" paper & structural reification?
|
||||
- I'm still not sure how this relates, but it may perhaps relate to
|
||||
how *not* to make things modular (structural reification is a sort
|
||||
of check on the scope of objects/classes)
|
||||
- What by Rich Hickey?
|
||||
- Simple Made Easy?
|
||||
- The Value of Values?
|
||||
- SICP: [[https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-19.html#%25_chap_3][Modularity, Objects, and State]]
|
||||
- [[https://www.cs.utexas.edu/~wcook/Drafts/2009/essay.pdf][On Understanding Data Abstraction, Revisited]]
|
||||
- http://www.catb.org/~esr/writings/taoup/html/apb.html#Baldwin-Clark -
|
||||
Carliss Baldwin and Kim Clark. Design Rules, Vol 1: The Power of
|
||||
Modularity. 2000. MIT Press. ISBN 0-262-024667.
|
||||
- Brooks, No Silver Bullet?
|
||||
|
||||
- https://en.wikipedia.org/wiki/Essential_complexity
|
||||
|
||||
- https://twitter.com/fchollet/status/962074070513631232
|
||||
|
||||
- [[https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-9.html#%25_chap_1][From SICP chapter 1 intro]]: "The acts of the mind, wherein it exerts
|
||||
its power over simple ideas, are chiefly these three: 1. Combining
|
||||
several simple ideas into one compound one, and thus all complex
|
||||
ideas are made. 2. The second is bringing two ideas, whether simple
|
||||
or complex, together, and setting them by one another so as to take
|
||||
a view of them at once, without uniting them into one, by which it
|
||||
gets all its ideas of relations. 3. The third is separating them
|
||||
from all other ideas that accompany them in their real existence:
|
||||
this is called abstraction, and thus all its general ideas are
|
||||
made." -John Locke, An Essay Concerning Human Understanding (1690)
|
||||
- One point I have ignored (maybe): You clearly separate the 'inside'
|
||||
of a module (its implementation) from the 'outside' (that is - its
|
||||
boundaries, the abstractions that it interfaces with or that it
|
||||
implements) so that the 'inside' can change more or less freely
|
||||
without having any effect on the outside.
|
||||
- Abstractions as a way of reducing the work required to add
|
||||
functionality (changes can be made just in the relevant modules, and
|
||||
other modules do not need to change to conform)
|
||||
- What is more key? Communication, information content, contracts,
|
||||
details?
|
||||
- [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][Don't repeat yourself]]
|
||||
- [[https://simplyphilosophy.org/study/aristotles-definitions/][Aristotle & theory of definitions]]
|
||||
- this isn't right. I need to find the quote in the Durant book
|
||||
(which will probably have an actual source) that pertains to how
|
||||
specific and how general a definition must be
|
||||
|
||||
- [[https://en.wikipedia.org/wiki/SOLID][SOLID]]
|
||||
- [[https://en.wikipedia.org/wiki/Cross-cutting_concern][Cross-cutting concerns]] and [[https://en.wikipedia.org/wiki/Aspect-oriented_programming][Aspect-oriented programming]]
|
||||
- [[https://en.wikipedia.org/wiki/Separation_of_concerns][Separation of Concerns]]
|
||||
- [[https://en.wikipedia.org/wiki/Abstraction_principle_(computer_programming)][Abstraction principle]]
|
||||
- [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][Don't repeat yourself]]
|
||||
@@ -1,368 +0,0 @@
|
||||
---
|
||||
title: Explaining RetinaNet
|
||||
author: Chris Hodapp
|
||||
date: December 13, 2017
|
||||
tags: technobabble
|
||||
---
|
||||
|
||||
# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
|
||||
# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
|
||||
# description:
|
||||
# subtitle:
|
||||
|
||||
A paper came out in the past few months,
|
||||
[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
|
||||
Facebook's teams. The goal of this post is to
|
||||
explain this paper as I work through it, through some of its
|
||||
references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
|
||||
|
||||
* Object Detection
|
||||
|
||||
"Object detection" as it is used here refers to machine learning
|
||||
models that can not just identify a single object in an image, but can
|
||||
identify and *localize* multiple objects, like in the below photo
|
||||
taken from
|
||||
[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:
|
||||
|
||||
# TODO:
|
||||
# Define mAP
|
||||
|
||||
#+CAPTION: TensorFlow object detection example 2.
|
||||
#+ATTR_HTML: :width 100% :height 100%
|
||||
[[../images/2017-12-13-retinanet/2017-12-13-objdet.jpg]]
|
||||
|
||||
At the time of writing, the most accurate object-detection methods
|
||||
were based around R-CNN and its variants, and all used two-stage
|
||||
approaches:
|
||||
|
||||
1. One model proposes a sparse set of locations in the image that
|
||||
probably contain something. Ideally, this contains all objects in
|
||||
the image, but filters out the majority of negative locations
|
||||
(i.e. only background, not foreground).
|
||||
2. Another model, typically a CNN (convolutional neural network),
|
||||
classifies each location in that sparse set as either being
|
||||
foreground and some specific object class (like "kite" or "person"
|
||||
above), or as being background.
|
||||
|
||||
Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
|
||||
OverFeat. These simplified/approximated the two-stage approach by
|
||||
replacing the first step with brute force. That is, instead of
|
||||
generating a sparse set of locations that probably have something of
|
||||
interest, they simply handle all locations, whether or not they likely
|
||||
contain something, by blanketing the entire image in a dense sampling
|
||||
of many locations, many sizes, and many aspect ratios.
|
||||
|
||||
This is simpler and faster - but not as accurate as the two-stage
|
||||
approaches.
|
||||
|
||||
Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
|
||||
didn't come up with these names) merge the two models of two-stage
|
||||
approaches into a single CNN, and exploit the possibility of sharing
|
||||
computations that would otherwise be done twice. I assume that this
|
||||
is included in the comparisons done in the paper, but I'm not entirely
|
||||
sure.
|
||||
|
||||
* Training & Class Imbalance
|
||||
|
||||
Briefly, the process of training these models requires minimizing some
|
||||
kind of loss function that is based on what the model misclassifies
|
||||
when it is run on some training data. It's preferable to be able to
|
||||
compute some loss over each individual instance, and add all of these
|
||||
losses up to produce an overall loss. (Yes, far more can be said on
|
||||
this, but the details aren't really important here.)
|
||||
|
||||
# TODO: What else can I say about why loss should be additive?
|
||||
# Quote DL text? ML text?
|
||||
|
||||
This leads to a problem in one-stage detectors: That dense set of
|
||||
locations that it's classifying usually contains a small number of
|
||||
locations that actually have objects (positives), and a much larger
|
||||
number of locations that are just background and can be very easily
|
||||
classified as being in the background (easy negatives). However, the
|
||||
loss function still adds all of them up - and even if the loss is
|
||||
relatively low for each of the easy negatives, their cumulative loss
|
||||
can drown out the loss from objects that are being misclassified.
|
||||
|
||||
That is: A large number of tiny, irrelevant losses overwhelm a smaller
|
||||
number of larger, relevant losses. The paper was a bit terse on this;
|
||||
it took a few re-reads to understand why "easy negatives" were an
|
||||
issue, so hopefully I have this right.
|
||||
|
||||
The training process is trying to minimize this loss, and so it is
|
||||
mostly nudging the model to improve where it least needs it (its
|
||||
ability to classify background areas that it already classifies well)
|
||||
and neglecting where it most needs it (its ability to classify the
|
||||
"difficult" objects that it is misclassifying).
|
||||
|
||||
# TODO: Visualize this. Can I?
|
||||
|
||||
This is *class imbalance* in a nutshell, which the paper gives as the
|
||||
limiting factor for the accuracy of one-stage detectors. While the
|
||||
existing approaches try to tackle it with methods like bootstrapping
|
||||
or hard example mining, the accuracy still is lower.
|
||||
|
||||
** Focal Loss
|
||||
|
||||
So, the point of all this is: A tweak to the loss function can fix
|
||||
this issue, and retain the speed and simplicity of one-stage
|
||||
approaches while surpassing the accuracy of existing two-stage ones.
|
||||
|
||||
At least, this is what the paper claims. Their novel loss function is
|
||||
called *Focal Loss* (as the title references), and it multiplies the
|
||||
normal cross-entropy by a factor, $(1-p_t)^\gamma$, where $p_t$
|
||||
approaches 1 as the model predicts a higher and higher probability of
|
||||
the correct classification, or 0 for an incorrect one, and $\gamma$ is
|
||||
a "focusing" hyperparameter (they used $\gamma=2$). Intuitively, this
|
||||
scaling makes sense: if a classification is already correct (as in the
|
||||
"easy negatives"), $(1-p_t)^\gamma$ tends toward 0, and so the portion
|
||||
of the loss multiplied by it will likewise tend toward 0.
|
||||
|
||||
|
||||
* RetinaNet architecture
|
||||
|
||||
The paper gives the name *RetinaNet* to the network they created which
|
||||
incorporates this focal loss in its training. While it says, "We
|
||||
emphasize that our simple detector achieves top results not based on
|
||||
innovations in network design but due to our novel loss," it is
|
||||
important not to miss that /innovations in/: they are saying that they
|
||||
didn't need to invent a new network design - not that the network
|
||||
design doesn't matter. Later in the paper, they say that it is in
|
||||
fact crucial that RetinaNet's architecture relies on FPN (Feature
|
||||
Pyramid Network) as its backbone. As far as I can tell, the
|
||||
architecture's use of a variant of RPN (Region Proposal Network) is
|
||||
also very important.
|
||||
|
||||
I go into both of these aspects below.
|
||||
|
||||
* Feature Pyramid Network
|
||||
|
||||
Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
|
||||
describes the basis of this FPN in detail (and, non-coincidentally I'm
|
||||
sure, the paper shares 4 co-authors with the paper this post
|
||||
explores). The paper is fairly concise in describing FPNs; it only
|
||||
takes it around 3 pages to explain their purpose, related work, and
|
||||
their entire design. The remainder shows experimental results and
|
||||
specific applications of FPNs. While it shows FPNs implemented on a
|
||||
particular underlying network (ResNet, mentioned below), they were
|
||||
made purposely to be very simple and adaptable to nearly any kind of
|
||||
CNN.
|
||||
|
||||
To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]]. The below
|
||||
diagram illustrates an image pyramid:
|
||||
|
||||
#+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
|
||||
#+ATTR_HTML: :width 100% :height 100%
|
||||
[[../images/2017-12-13-retinanet/1024px-Image_pyramid.svg.png]]
|
||||
|
||||
Image pyramids have many uses, but the paper focuses on their use in
|
||||
taking something that works only at a certain scale of image - for
|
||||
instance, an image classification model that only identifies objects
|
||||
that are around 50 pixels across - and adapting it to handle different
|
||||
scales by applying it at every level of the image pyramid. If the
|
||||
model has a little flexibility, some level of the image pyramid is
|
||||
bound to have scaled the object to the correct size that the model can
|
||||
match it.
|
||||
|
||||
Typically, though, detection or classification isn't done directly on
|
||||
an image, but rather, the image is converted to some more useful
|
||||
feature space. However, these feature spaces likewise tend to be
|
||||
useful only at a specific scale. This is the rationale behind
|
||||
"featurized image pyramids", or feature pyramids built upon image
|
||||
pyramids, created by converting each level of an image pyramid to that
|
||||
feature space.
|
||||
|
||||
The problem with featurized image pyramids, the paper says, is that if
|
||||
you try to use them in CNNs, they drastically slow everything down,
|
||||
and use so much memory as to make normal training impossible.
|
||||
|
||||
However, take a look below at this generic diagram of a generic deep
|
||||
CNN:
|
||||
|
||||
#+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
|
||||
#+ATTR_HTML: :width 100% :height 100%
|
||||
[[../images/2017-12-13-retinanet/Typical_cnn.png]]
|
||||
|
||||
You may notice that this network has a structure that bears some
|
||||
resemblance to an image pyramid. This is because deep CNNs are
|
||||
already computing a sort of pyramid in their convolutional and
|
||||
subsampling stages. In a nutshell, deep CNNs used in image
|
||||
classification push an image through a cascade of feature detectors or
|
||||
filters, and each successive stage contains a feature map that is
|
||||
built out of features in the prior stage - thus producing a *feature
|
||||
hierarchy* which already is something like a pyramid and contains
|
||||
multiple different scales. (Being able to train deep CNNs to jointly
|
||||
learn the filters at each stage of that feature hierarchy from the
|
||||
data, rather than engineering them by hand, is what sets deep learning
|
||||
apart from "shallow" machine learning.)
|
||||
|
||||
When you move through levels of a featurized image pyramid, only scale
|
||||
should change. When you move through levels of a feature hierarchy
|
||||
described here, scale changes, but so does the meaning of the
|
||||
features. This is the *semantic gap* the paper references. Meaning
|
||||
changes because each stage builds up more complex features by
|
||||
combining simpler features of the last stage. The first stage, for
|
||||
instance, commonly handles pixel-level features like points, lines or
|
||||
edges at a particular direction. In the final stage, presumably, the
|
||||
model has learned complex enough features that things like "kite" and
|
||||
"person" can be identified.
|
||||
|
||||
The goal in the paper was to find a way to exploit this feature
|
||||
hierarchy that is already being computed and to produce something that
|
||||
has similar power to a featurized image pyramid but without too high
|
||||
of a cost in speed, memory, or complexity.
|
||||
|
||||
Everything described so far (none of which is specific to FPNs), the
|
||||
paper calls the *bottom-up* pathway - the feed-forward portion of the
|
||||
CNN. FPN adds to this a *top-down* pathway and some lateral
|
||||
connections.
|
||||
|
||||
** Top-Down Pathway
|
||||
|
||||
** Lateral Connections
|
||||
|
||||
** As Applied to ResNet
|
||||
|
||||
# Note C=256 and such
|
||||
|
||||
# TODO: Link to some good explanations
|
||||
|
||||
For two reasons, I don't explain much about ResNet here. The first is
|
||||
that residual networks, like the ResNet used here, have seen lots of
|
||||
attention and already have many good explanations online. The second
|
||||
is that the paper claims that the underlying network
|
||||
|
||||
[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
|
||||
[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
|
||||
|
||||
* Anchors & Region Proposals
|
||||
|
||||
Recall last section what was said about feature maps, and the that the
|
||||
deeper stages of the CNN happen to be good for classifying images.
|
||||
While these deeper stages are lower-resolution than the input images,
|
||||
and while their influence is spread out over larger areas of the input
|
||||
image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
|
||||
stage spreading it a little further), the features here still maintain
|
||||
a spatial relationship with the input image. That is, moving across
|
||||
one axis of this feature map still corresponds to moving across the
|
||||
same axis of the input image.
|
||||
|
||||
# Just re-explain the above with the feature pyramid
|
||||
|
||||
RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
|
||||
here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
|
||||
R-CNN: Towards Real-Time Object Detection with Region Proposal
|
||||
Networks]]. I find the explanations in terms of "proposals", of
|
||||
focusing the "attention" of the neural network, or of "telling the
|
||||
neural network where to look" to be needlessly confusing and
|
||||
misleading. I'd rather explain very plainly how they work.
|
||||
|
||||
Central to RPNs is *anchors*. Anchors aren't exactly a feature of the
|
||||
CNN. They're more a property that's used in its training and
|
||||
inference.
|
||||
|
||||
In particular:
|
||||
|
||||
- Say that the feature pyramid has $L$ levels, and that level $l+1$ is
|
||||
half the resolution (thus double the scale) of level $l$.
|
||||
- Say that level $l$ is a 256-channel feature map of size $W \times H$
|
||||
(i.e. it's a tensor with shape $W \times H \times 256$). Note that
|
||||
$W$ and $H$ will be larger at lower levels, and smaller at higher
|
||||
levels, but in RetinaNet at least, always 256-channel samples.
|
||||
- For every point on that feature map (all $WH$ of them), we can
|
||||
identify a corresponding point in the input image. This is the
|
||||
center point of a broad region of the input image that influences
|
||||
this point in the feature map (i.e. its receptive field). Note that
|
||||
as we move up to higher levels in the feature pyramid, these regions
|
||||
grow larger, and neighboring points in the feature map correspond to
|
||||
larger and larger jumps across the input image.
|
||||
- We can make these regions explicit by defining *anchors* - specific
|
||||
rectangular regions associated with each point of a feature map.
|
||||
The size of the anchor depends on the scale of the feature map, or
|
||||
equivalently, what level of the feature map it came from. All this
|
||||
means is that anchors in level $l+1$ are twice as large as the
|
||||
anchors of level $l$.
|
||||
|
||||
The view that this should paint is that a dense collection of anchors
|
||||
covers the entire input image at different sizes - still in a very
|
||||
ordered pattern, but with lots of overlap. Remember how I mentioned
|
||||
at the beginning of this post that one-stage object detectors use a
|
||||
very "brute force" method?
|
||||
|
||||
My above explanation glossed over a couple things, but nothing that
|
||||
should change the fundamentals.
|
||||
|
||||
- Anchors are actually associated with every 3x3 window in the anchor
|
||||
map, not precisely every point, but all this really means is that
|
||||
it's "every point and its immediate neighbors" rather than "every
|
||||
point". This doesn't really matter to anchors, but matters
|
||||
elsewhere.
|
||||
- It's not a single anchor per 3x3 window, but 9 anchors - one for
|
||||
each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
|
||||
scale factors ($1, 2^{1/3}, and 2^{2/3}$) on top of its base scale.
|
||||
This is just to handle objects of less-square shapes and to cover
|
||||
the gap in scale in between levels of the feature pyramid. Note
|
||||
that the scale factors are evenly-spaced exponentially, such that an
|
||||
additional step down wouldn't make sense (the largest anchors at the
|
||||
pyramid level /below/ already cover this scale), and nor would an
|
||||
additional step up (the smallest anchors at the pyramid level
|
||||
/above/ already cover it).
|
||||
|
||||
Here, finally, is where actual classification and regression come in.
|
||||
The *classification subnet* and *box regression subnet* are here.
|
||||
|
||||
** Classification Subnet
|
||||
|
||||
Every anchor associates an image region with a 3x3 window (i.e. a
|
||||
3x3x256 section - it's still 256-channel). The classification subnet
|
||||
is responsible for learning: do the features in this 3x3 window,
|
||||
produced from some input, image indicate that an object is inside this
|
||||
anchor? Or, more accurately: For each of $K$ object classes, what's
|
||||
the probability of each object (or just of it being background)?
|
||||
|
||||
** Box Regression Subnet
|
||||
|
||||
The box regression subnet takes the same input as the classification
|
||||
subnet, but tries to learn the answer to a different question. It is
|
||||
responsible for learning: what are the coordinates to the object
|
||||
inside of this anchor (assuming there is one)? More specifically, it
|
||||
tries to learn to produce 4 numbers values which give offsets relative
|
||||
to the anchor's bounds (thus specifying a different region). Note
|
||||
that this subnet completely ignores the class of the object.
|
||||
|
||||
The classification subnet already tells us whether or not a given
|
||||
anchor contains an object - which already gives rough bounds on
|
||||
it. The box regression subnet helps tighten these bounds.
|
||||
|
||||
** Other notes (?)
|
||||
|
||||
I've glossed over a few details here. Everything I've described above
|
||||
is implemented with bog-standard convolutional networks...
|
||||
|
||||
# Parameter sharing? How to explain?
|
||||
|
||||
* Training
|
||||
|
||||
# Ground-truth object boxes
|
||||
# Intersection-over-Union thresholds
|
||||
|
||||
* Inference
|
||||
|
||||
# Top N results
|
||||
|
||||
* References
|
||||
|
||||
# Does org-mode have a way to make a special section for references?
|
||||
# I know I saw this somewhere
|
||||
|
||||
1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
|
||||
2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
|
||||
3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
|
||||
4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
|
||||
5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
|
||||
6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
|
||||
7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
|
||||
8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
|
||||
9. https://github.com/KaimingHe/deep-residual-networks
|
||||
10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
|
||||
11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
|
||||
12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/
|
||||
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
Before Width: | Height: | Size: 369 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 256 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 124 KiB |
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user