Migrate some drafts into content/posts with 'draft' flag

2020-04-30 19:00:38 -04:00
parent fba8a611e3
commit 129bfeb3e7
8 changed files with 37 additions and 5195 deletions
--- a/content/posts/2017-01-08-retrospect-foresight.org
+++ b/content/posts/2017-01-08-retrospect-foresight.org
@@ -0,0 +1,56 @@
+---
+title: Retrospect on Foresight
+author: Chris Hodapp
+date: January 8, 2018
+tags:
+- technobabble
+- rambling
+draft: true
+---
+
+/(Spawned from some idle thoughts around the summer of 2015.)/
+
+Why are old technological ideas that were "ahead of their time", but
+which lost out to other ideas, worth studying?
+
+We can see them as raw ideas that "modern" understanding never
+refined - misguided fantasies or even just mistakes.  The flip side of
+this is that we can see them as ideas that are free of a nearly
+inescapable modern context and all of the preconceptions and blinders
+it carries.
+
+In some of these visionaries is a valuable combination:
+
+- they're detached from this modern context (by mere virtue of it not
+  existing yet),
+- they have considerable experience, imagination, and foresight,
+- they devoted time and effort to work extensively on something and to
+  communicate their thoughts, feelings, and analysis in a durable way.
+
+To put it in another way: They give us analysis done from a context
+that is long gone. They help us think beyond our current context.
+They help us answer a question, "What if we took a different path
+then?"
+
+[[http://www.cs.yale.edu/homes/perlis-alan/quotes.html][Epigram #53]] from Alan Perlis offers some relevant skepticism here: "So
+many good ideas are never heard from again once they embark in a
+voyage on the semantic gulf."  My interpretation of it is that we tend
+to idolize ideas, old and new, because they sound somehow different,
+innovative, and groundbreaking, but attempts at analysis or practical
+realization of the ideas leads to a bleaker reality, perhaps that the
+idea is completely meaningless (the equivalent of a [[https://en.wiktionary.org/wiki/deepity][deepity]], perhaps),
+wildly impractical, or a mere facade over what is already established.
+
+* Examples
+
+* Scratch
+
+- Douglas Engelbart is perhaps one of the canonical examples of a person
+  who was an endless source of these ideas.  Ted Nelson arguably is
+  another.  Alan Turing is an early example widely regarded for his
+  foresight.
+- [[https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/][As We May Think (Vannevar Bush)]]
+- "Do you remember a time when..." only goes so far.
+- Buckminster Fuller
+
+# Tools For Thought
--- a/content/posts/2017-04-20-modularity.org
+++ b/content/posts/2017-04-20-modularity.org
@@ -0,0 +1,381 @@
+---
+title: Modularity & Abstraction (working title)
+author: Chris Hodapp
+date: April 20, 2017
+tags:
+- technobabble
+- rambling
+draft: true
+---
+
+# Why don't I turn this into a paper for arXiv too?  It can still be
+# posted to the blog (just also make it exportable to LaTeX perhaps)
+
+_Modularity_ and _abstraction_ feature prominently wherever computers
+are involved.  This is meant very broadly: it applies to designing
+software, using software, integrating software, and to a lot of
+hardware as well.  It applies elsewhere, and almost certainly
+originated elsewhere first, however, it appears especially crucial
+around software.
+
+Definitions, though, are a bit vague (including anything in this
+post).  My goal in this post isn't to try to (re)define them, but to
+explain their essence and expand on a few theses:
+
+- Modularity arises naturally in a wide array of places.
+- Modularity and abstraction are intrinsically connected.
+- Both are for the benefit of people.  This usually doesn't need
+  stated, but to echo Paul Graham and probably others: to the
+  computer, it is all the same.
+- More specifically, both are there to manage *complexity* by
+  assigning meaningful information and boundaries which allow people
+  to match a problem to what they can actually think about.
+
+# - Whether a given modularization makes sense depends strongly on
+#  meaning and relevance of *information* inside and outside of
+#  modules, and broad context matters to those.
+
+* Why?
+
+People generally agree that "modularity" is good.  The idea that
+something complex can be designed and understood in terms of smaller,
+simpler pieces comes naturally to anyone that has built something out
+of smaller pieces or taken something apart.  (This isn't to say that
+reductionism is the best way to understand everything, but that's
+another matter.)  It runs very deep in the Unix philosophy, which ESR
+gives a good overview of in [[http://www.catb.org/~esr/writings/taoup/html/ch01s06.html][The Art of Unix Programming]] - or, listen
+to it from [[https://youtu.be/tc4ROCJYbm0?t%3D248][Kernighan himself]] at Bell Labs in
+1982.
+
+Tim Berners-Lee gives some practical limitations in [[https://www.w3.org/DesignIssues/Principles.html][Principles of
+Design]] and in [[https://www.w3.org/DesignIssues/Modularity.html][Modularity]]: "Modular design hinges on the simplicity and
+abstract nature of the interface definition between the modules. A
+design in which the insides of each module need to know all about each
+other is not a modular design but an arbitrary partitioning of the
+bits... It is not only necessary to make sure your own system is
+designed to be made of modular parts. It is also necessary to realize
+that your own system, no matter how big and wonderful it seems now,
+should always be designed to be a part of another larger system."  Les
+Hatton in [[http://www.leshatton.org/TAIC2008-29-08-2008.html][The role of empiricism in improving the reliability of
+future software]] even did an interesting derivation tying the defect
+density in software to how it is broken into pieces.  The 1972 paper
+[[https://www.cs.virginia.edu/~eos/cs651/papers/parnas72.pdf][On the Criteria to be Used in Decomposing System into Modules]] cites a
+1970 textbook on why modularity is important in systems programming,
+but also notes that nothing is said on how to divide a systems into
+modules.
+
+"Abstraction" doesn't have quite the same consensus. In software, it's
+generally understood that decoupled or loosely-coupled is better than
+tightly-coupled, but at the same time, "abstraction" can have the
+connotation of something that gets in the way, adds overhead, and
+confuses things.  Dijkstra, in one of few instances of not being
+snarky, allegedly said, "Being abstract is something profoundly
+different from being vague.  The purpose of abstraction is not to be
+vague, but to create a new semantic level in which one can be
+absolutely precise."  Joel Spolsky, in one of few instances of me
+actually caring what he said, also has a blog post from 2002 on the
+[[https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/][Law of Leaky Abstractions]] ("All non-trivial abstractions, to some
+degree, are leaky.")  The [[https://en.wikipedia.org/wiki/Principle_of_least_privilege][principle of least privilege]] is likewise a
+thing. So, abstraction too has its practical and theoretical
+limitations.
+
+* How They Relate
+
+I bring these up together because: *abstractions* are the boundaries
+between *modules*, and the communication channels (APIs, languages,
+interfaces, protocols) through which they talk.  It need not
+necessarily be a standardized interface or a well-documented boundary,
+though that helps.
+
+Available abstractions vary. They vary by, for instance:
+- ...what language you choose.  Consider, for instance, that a language
+  like Haskell contains various abstractions done largely within the
+  type system that cannot be expressed in many other languages.
+  Languages like Python, Ruby, or JavaScript might have various
+  abstractions meaningful only in the context of dynamic typing.  Some
+  languages more readily permit the creation of new abstractions, and
+  this might lead to a broader range of abstractions implemented in
+  libraries.
+- ...the operating system and its standard library.  What is a
+  process?  What is a thread?  What is a dynamic library?  What is a
+  filesystem?  What is a file?  What is a block device?  What is a
+  socket?  What is a virtual machine?  What is a bus?  What is a
+  commandline?
+- ...the time period.  How many of the abstractions named above were
+  around or viable in 1970, 1980, 1990, 2000? In the opposite
+  direction, when did you last use that lovely standardized protocol,
+  [[https://en.wikipedia.org/wiki/Common_Gateway_Interface][CGI]], to let your web application and your web server communicate,
+  use [[https://en.wikipedia.org/wiki/PHIGS][PHIGS]] to render graphics, or access a large multiuser system
+  via hard-wired terminals?
+
+As such: Possible ways to modularize things vary.  It may make no
+sense that certain ways of modularization even can or should exist
+until it's been done other ways hundreds or thousands of times.
+
+Other terms are related too.  "Loosely-coupled" (or loose coupling)
+and "tightly-coupled" refer to the sort of abstractions sitting
+between modules, or whether or not there even are separate modules.
+"Decoupling" involves changing the relationship between modules
+(sometimes, creating them in the first place), typically splitting
+things into two more sensible pieces that a more sensible abstraction
+separates.  "Factoring out" is really a form of decoupling in which
+smaller parts of something are turned into a module which the original
+thing then interfaces with (one canonical example is taking some bits
+of code, often that are very similar or identical in many places, and
+moving them into a single function).  To say one has "abstracted over"
+some details implies that a module is handling those details, that the
+details shouldn't matter, and what does matter is the abstraction one
+is using.
+
+One of Rich Hickey's favorite topics is *composition*, and with good
+reason (and you should check out [[http://www.infoq.com/presentations/Simple-Made-Easy/][Simple Made Easy]] regardless).  This
+relates as well: to *compose* things together effectively into bigger
+parts requires that they support some common abstraction.
+
+In the same area, [[https://clojurefun.wordpress.com/2012/08/17/composition-over-convention/][Composition over convention]] is a good read on how
+/frameworks/ run counter to modularity: they aren't built to behave
+like modules of a larger system.
+
+# -----
+
+It has a very pragmatic reason behind it: When something is a module
+unto itself, presumably it is relying on specific abstractions, and it
+is possible to freely change this module's internal details (provided
+that it still respects the same abstractions), to move this module to
+other contexts (anywhere that provides the same abstractions), and to
+replace it with other modules (anything that respects the same
+abstractions).
+
+It also has a more abstract reason: When something is a module unto
+itself, the way it is designed and implemented usually presents more
+insight into the fundamentals of the problem it is solving. It
+contains fewer incidental details, and more essential details.
+
+# -------
+
+* Information
+
+I referred earlier to the abstractions themselves as both boundaries
+and communications channels.  Another common view is that abstractions
+are *contracts* with a communicated and agreed purpose, and I think
+this is a useful definition too: it conveys the notion that there are
+multiple parties involved and that they are free to behave as needed
+provided that they fulfill some obligation
+
+Some definitions refer directly to information, like the [[https://en.wikipedia.org/wiki/Abstraction_principle_(computer_programming)][abstraction
+principle]] which aims to reduce duplication of information which fits
+with [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][don't repeat yourself]] so that "a modification of any single
+element of a system does not require a change in other logically
+unrelated elements".
+
+
+
+# ----- FIXME
+Consider the information this module deals in, in essence.
+
+What is the most general form this information could be expressed in,
+without being so general as to encompass other things that are
+irrelevant or so low-level as to needlessly constrain the possible
+contexts?
+ 
+(Aristotle's theory of definitions?)
+
+* Less-Conventional Examples
+
+One thing I've watched with some interest is when new abstractions
+emerge (or, perhaps, old ones become more widespread) to solve
+problems that I wasn't even aware existed.
+
+[[https://circleci.com/blog/it-really-is-the-future/][It really is the future]] talks about a lot of more recent forms of
+modularity from the land of devops, most of which were completely
+unheard-of in, say, 2010.  [[https://www.functionalgeekery.com/episode-75-eric-b-merritt/][Functional Geekery episode 75]] talks about
+many similar things.
+
+[[https://jupyter.org/][Jupyter Notebook]] is one of my favorites here.  It provides a notebook
+interface (similar to something like Maple or Mathematica) which:
+
+- allows the notebook to use various different programming languages
+  underneath,
+- decouples where the notebook is used and where it is running, due to
+  being implemented as a web application accessed through the browser,
+- decouples the presentation of a stored notebook from Jupyter itself
+  by using a [[https://nbformat.readthedocs.io/en/latest/][JSON-based file format]] which can be rendered without
+  Jupyter (like GitHub does if you commit a .ipynb file).
+
+I love notebook interfaces already because they simplify experimenting
+by handling a lot of things I'd otherwise have to do manually - like
+saving results and keeping them lined up with the exact code that
+produced them.  Jupyter adds some other use-cases I find marvelous -
+for instance, I can let the interpreter run on my workstation which
+has all of the computing power, but I can access it across the
+Internet from my laptop.
+
+[[https://zeppelin.apache.org/][Apache Zeppelin]] does similar things with different languages; I've
+just used it much less.
+
+Another favorite of mine is [[https://nixos.org/nix/][Nix]].  One excellent article, [[http://blog.ezyang.com/2014/08/the-fundamental-problem-of-programming-language-package-management/][The
+fundamental problem of programming language package management]],
+doesn't ever mention Nix but explains very well the problems it sets
+out to solve.  To be able to combine nearly all of the
+programming-language specific package managers into a single module is
+a very lofty goal, but Nix appears to do a decent job of it (among
+other things).
+
+The [[https://www.lua.org/][Lua]] programming language is noteworthy here.  It's written in
+clean C with minimal dependencies, so it runs nearly anywhere that a C
+or C++ compiler targets.  It's purposely very easy both to *embed*
+(i.e. to put inside of a program and use as an extension language,
+such as for plugins or scripting) and to *extend* (i.e. to connect
+with libraries to allow their functionality to be used from Lua).  [[https://www.gnu.org/software/guile/][GNU
+Guile]] has many of the same properties, I'm told.
+
+We ordinarily think of object systems as something living in the
+programming language.  However, the object system is sometimes made a
+module that is outside of the programming language, and languages just
+interact with it.  [[https://en.wikipedia.org/wiki/GObject][GObject]], [[https://en.wikipedia.org/wiki/Component_Object_Model][COM]], and [[https://en.wikipedia.org/wiki/XPCOM][XPCOM]] do this, and to some
+extent, so does [[https://en.wikipedia.org/wiki/Meta-object_System][Qt & MOC]] - and there are probably hundreds of others,
+particularly if you allow dead ones created during the object-oriented
+hype of the '90s.  This seems to happen in systems where the object
+hierarchy is in effect "bigger" than the language.
+
+[[https://zeromq.org/][ZeroMQ]] is another example: a set of cross-language abstractions for
+communication patterns in a distributed system.  I know it's likely
+not unique, but it is one of the better-known and the first I thought
+of, and I think their [[http://zguide.zeromq.org/page:all][guide]] is excellent.
+
+Interestingly, the same iMatix behind ZeroMQ also created [[https://github.com/imatix/gsl][GSL]] and
+explained its value in [[https://imatix-legacy.github.io/mop/introduction.html][Model-Oriented Programming]], for which
+abstraction features heavily.  I've not used GSL, and am skeptical of
+its stated usefulness, but it looks like it is meant to help create
+compile-time abstractions that likewise sit outside of any particular
+programming language.
+
+# TODO: Expand on this.
+
+[[https://web.hypothes.is/][hypothes.is]] is a curious one that I find fascinating.  They're trying
+to factor out annotation and commenting from something that is handled
+on a per-webpage basis and turn it into its own module, and I really
+like what I've seen.  However, it does not seem to have caught on
+much.
+
+The Unix tradition lives on in certain modern tools. [[https://stedolan.github.io/jq/][jq]] has proven
+very useful anytime I've had to mess with JSON data.  [[http://www.dest-unreach.org/socat/][socat]] and [[http://netcat.sourceforge.net/][netcat]]
+have saved me numerous times.  I'm sure certain people love the fact
+that [[https://neovim.io/][Neovim]] is designed to be seamlessly embedded and to extend with
+plugins.  [[https://suckless.org/philosophy][suckless]] perhaps takes it too far, but gets an honorary
+mention...
+
+# ???
+
+# Also, TCP/IP and the entire notion of packet-switched networks.
+# And the entire OSI 7-layer model.
+
+# Also, caches - of all types.  (CPU, disk...)
+
+# One key is how the above let you *reason* about things without
+# knowing their specifics.
+
+People know that I love Emacs, but I also do believe many of the
+complaints on how large it is.  Despite that it is basically its own
+operating system, /within this/ it has considerable modularity.  The
+same applies somewhat to Blender, I suppose.
+
+Consider [[https://research.google.com/pubs/pub43146.html][Machine Learning: The High Interest Credit Card of Technical Debt]],
+a paper that anyone working around machine learning should read and
+re-read regularly.  Large parts of the paper are about ways in which
+machine learning conflicts with proper modularity and abstraction.
+(However, [[https://colah.github.io/posts/2015-09-NN-Types-FP/][Neural Networks, Types, and Functional Programming]] is still
+a good post and shows some sorts of abstraction that still exist
+at least in neural networks.)
+
+Even DOS had useful abstractions.  Things like
+DriveSpace/DoubleSpace/Stacker worked well enough because most
+software that needed files relied on DOS's normal abstractions to
+access them - so it did not matter to them that the underlying
+filesystem was actually compressed, or was actually a RAM disk, or was
+on some obscure SCSI interface.  Likewise, for the silliness known as
+[[https://en.wikipedia.org/wiki/Expanded_memory][EMS]], applications that accessed memory through the EMS abstraction
+could disregard whether it was a "real" EMS board providing access to
+that memory, whether it was an expanded memory manager providing
+indirect access to some other memory or even to a hard disk pretending
+to be memory.
+
+Even more abstractly: emulators work because so much software
+respected the abstraction of some specific CPU and hardware platform.
+
+Submitted without further comment:
+https://github.com/stevemao/left-pad/issues/4
+
+* Fragments
+
+- Abstracting over...
+  - Multiple applications
+  - Multiple users
+  - Multiple CPUs
+  - Multiple hosts
+
+- [[Notes - Paper, 2016-11-13]]
+- Tanenbaum vs. Linus war & microkernels
+- TBL: "The choice of language is a common design choice. The low
+  power end of the scale is typically simpler to design, implement and
+  use, but the high power end of the scale has all the attraction of
+  being an open-ended hook into which anything can be placed: a door
+  to uses bounded only by the imagination of the programmer.  Computer
+  Science in the 1960s to 80s spent a lot of effort making languages
+  which were as powerful as possible. Nowadays we have to appreciate
+  the reasons for picking not the most powerful solution but the least
+  powerful. The reason for this is that the less powerful the
+  language, the more you can do with the data stored in that
+  language. If you write it in a simple declarative from, anyone can
+  write a program to analyze it in many ways."  (Languages are a kind
+  of abstraction - one that influences how a module is written, and
+  what contexts it is useful in.)
+- "Self" paper & structural reification?
+  - I'm still not sure how this relates, but it may perhaps relate to
+    how *not* to make things modular (structural reification is a sort
+    of check on the scope of objects/classes)
+- What by Rich Hickey?
+  - Simple Made Easy?
+  - The Value of Values?
+- SICP: [[https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-19.html#%25_chap_3][Modularity, Objects, and State]]
+- [[https://www.cs.utexas.edu/~wcook/Drafts/2009/essay.pdf][On Understanding Data Abstraction, Revisited]]
+- http://www.catb.org/~esr/writings/taoup/html/apb.html#Baldwin-Clark -
+  Carliss Baldwin and Kim Clark. Design Rules, Vol 1: The Power of
+  Modularity. 2000. MIT Press. ISBN 0-262-024667.
+- Brooks, No Silver Bullet?
+
+- https://en.wikipedia.org/wiki/Essential_complexity
+
+- https://twitter.com/fchollet/status/962074070513631232
+
+- [[https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-9.html#%25_chap_1][From SICP chapter 1 intro]]: "The acts of the mind, wherein it exerts
+  its power over simple ideas, are chiefly these three: 1. Combining
+  several simple ideas into one compound one, and thus all complex
+  ideas are made. 2. The second is bringing two ideas, whether simple
+  or complex, together, and setting them by one another so as to take
+  a view of them at once, without uniting them into one, by which it
+  gets all its ideas of relations. 3. The third is separating them
+  from all other ideas that accompany them in their real existence:
+  this is called abstraction, and thus all its general ideas are
+  made." -John Locke, An Essay Concerning Human Understanding (1690)
+- One point I have ignored (maybe): You clearly separate the 'inside'
+  of a module (its implementation) from the 'outside' (that is - its
+  boundaries, the abstractions that it interfaces with or that it
+  implements) so that the 'inside' can change more or less freely
+  without having any effect on the outside.
+- Abstractions as a way of reducing the work required to add
+  functionality (changes can be made just in the relevant modules, and
+  other modules do not need to change to conform)
+- What is more key?  Communication, information content, contracts,
+  details?
+  - [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][Don't repeat yourself]]
+- [[https://simplyphilosophy.org/study/aristotles-definitions/][Aristotle & theory of definitions]]
+  - this isn't right.  I need to find the quote in the Durant book
+    (which will probably have an actual source) that pertains to how
+    specific and how general a definition must be
+
+- [[https://en.wikipedia.org/wiki/SOLID][SOLID]]
+- [[https://en.wikipedia.org/wiki/Cross-cutting_concern][Cross-cutting concerns]] and [[https://en.wikipedia.org/wiki/Aspect-oriented_programming][Aspect-oriented programming]]
+- [[https://en.wikipedia.org/wiki/Separation_of_concerns][Separation of Concerns]]
+- [[https://en.wikipedia.org/wiki/Abstraction_principle_(computer_programming)][Abstraction principle]]
+- [[https://en.wikipedia.org/wiki/Don%2527t_repeat_yourself][Don't repeat yourself]]
--- a/content/posts/2017-12-13-retinanet/1024px-Image_pyramid.svg.png
+++ b/content/posts/2017-12-13-retinanet/1024px-Image_pyramid.svg.png
--- a/content/posts/2017-12-13-retinanet/2017-12-13-objdet.jpg
+++ b/content/posts/2017-12-13-retinanet/2017-12-13-objdet.jpg
--- a/content/posts/2017-12-13-retinanet/Typical_cnn.png
+++ b/content/posts/2017-12-13-retinanet/Typical_cnn.png
--- a/content/posts/2017-12-13-retinanet/index.org
+++ b/content/posts/2017-12-13-retinanet/index.org
@@ -0,0 +1,373 @@
+---
+title: Explaining RetinaNet
+author: Chris Hodapp
+date: December 13, 2017
+tags:
+- technobabble
+draft: true
+---
+
+# TODO: The inline equations are still broken (maybe because this is
+# in org format)
+
+# Above uses style from https://github.com/turboMaCk/turboMaCk.github.io/blob/develop/posts/2016-12-21-org-mode-in-hakyll.org
+# and https://turbomack.github.io/posts/2016-12-21-org-mode-in-hakyll.html
+# description: 
+# subtitle: 
+
+A paper came out in the past few months,
+[[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]], from one of
+Facebook's teams.  The goal of this post is to
+explain this paper as I work through it, through some of its
+references, and one particular [[https://github.com/fizyr/keras-retinanet][implementation in Keras]].
+
+* Object Detection
+
+"Object detection" as it is used here refers to machine learning
+models that can not just identify a single object in an image, but can
+identify and *localize* multiple objects, like in the below photo
+taken from
+[[https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html][Supercharge your Computer Vision models with the TensorFlow Object Detection API]]:
+
+# TODO:
+# Define mAP
+
+#+CAPTION: TensorFlow object detection example 2.
+#+ATTR_HTML: :width 100% :height 100%
+[[./2017-12-13-objdet.jpg]]
+
+At the time of writing, the most accurate object-detection methods
+were based around R-CNN and its variants, and all used two-stage
+approaches:
+
+1. One model proposes a sparse set of locations in the image that
+   probably contain something.  Ideally, this contains all objects in
+   the image, but filters out the majority of negative locations
+   (i.e. only background, not foreground).
+2. Another model, typically a CNN (convolutional neural network),
+   classifies each location in that sparse set as either being
+   foreground and some specific object class (like "kite" or "person"
+   above), or as being background.
+
+Single-stage approaches were also developed, like [[https://pjreddie.com/darknet/yolo/][YOLO]], [[https://arxiv.org/abs/1512.02325][SSD]], and
+OverFeat. These simplified/approximated the two-stage approach by
+replacing the first step with brute force.  That is, instead of
+generating a sparse set of locations that probably have something of
+interest, they simply handle all locations, whether or not they likely
+contain something, by blanketing the entire image in a dense sampling
+of many locations, many sizes, and many aspect ratios.
+
+This is simpler and faster - but not as accurate as the two-stage
+approaches.
+
+Methods like [[https://arxiv.org/abs/1506.01497][Faster R-CNN]] (not to be confused with Fast R-CNN... no, I
+didn't come up with these names) merge the two models of two-stage
+approaches into a single CNN, and exploit the possibility of sharing
+computations that would otherwise be done twice.  I assume that this
+is included in the comparisons done in the paper, but I'm not entirely
+sure.
+
+* Training & Class Imbalance
+
+Briefly, the process of training these models requires minimizing some
+kind of loss function that is based on what the model misclassifies
+when it is run on some training data.  It's preferable to be able to
+compute some loss over each individual instance, and add all of these
+losses up to produce an overall loss.  (Yes, far more can be said on
+this, but the details aren't really important here.)
+
+# TODO: What else can I say about why loss should be additive?
+# Quote DL text? ML text?
+
+This leads to a problem in one-stage detectors: That dense set of
+locations that it's classifying usually contains a small number of
+locations that actually have objects (positives), and a much larger
+number of locations that are just background and can be very easily
+classified as being in the background (easy negatives). However, the
+loss function still adds all of them up - and even if the loss is
+relatively low for each of the easy negatives, their cumulative loss
+can drown out the loss from objects that are being misclassified.
+
+That is: A large number of tiny, irrelevant losses overwhelm a smaller
+number of larger, relevant losses.  The paper was a bit terse on this;
+it took a few re-reads to understand why "easy negatives" were an
+issue, so hopefully I have this right.
+
+The training process is trying to minimize this loss, and so it is
+mostly nudging the model to improve where it least needs it (its
+ability to classify background areas that it already classifies well)
+and neglecting where it most needs it (its ability to classify the
+"difficult" objects that it is misclassifying).
+
+# TODO: Visualize this. Can I?
+
+This is *class imbalance* in a nutshell, which the paper gives as the
+limiting factor for the accuracy of one-stage detectors.  While the
+existing approaches try to tackle it with methods like bootstrapping
+or hard example mining, the accuracy still is lower.
+
+** Focal Loss
+
+So, the point of all this is: A tweak to the loss function can fix
+this issue, and retain the speed and simplicity of one-stage
+approaches while surpassing the accuracy of existing two-stage ones.
+
+At least, this is what the paper claims.  Their novel loss function is
+called *Focal Loss* (as the title references), and it multiplies the
+normal cross-entropy by a factor, \( (1-p_t)^\gamma \), where \( p_t \)
+approaches 1 as the model predicts a higher and higher probability of
+the correct classification, or 0 for an incorrect one, and \( \gamma \) is
+a "focusing" hyperparameter (they used \( \gamma=2 \)).  Intuitively, this
+scaling makes sense: if a classification is already correct (as in the
+"easy negatives"), \( (1-p_t)^\gamma \) tends toward 0, and so the portion
+of the loss multiplied by it will likewise tend toward 0.
+
+
+* RetinaNet architecture
+
+The paper gives the name *RetinaNet* to the network they created which
+incorporates this focal loss in its training.  While it says, "We
+emphasize that our simple detector achieves top results not based on
+innovations in network design but due to our novel loss," it is
+important not to miss that /innovations in/: they are saying that they
+didn't need to invent a new network design - not that the network
+design doesn't matter.  Later in the paper, they say that it is in
+fact crucial that RetinaNet's architecture relies on FPN (Feature
+Pyramid Network) as its backbone.  As far as I can tell, the
+architecture's use of a variant of RPN (Region Proposal Network) is
+also very important.
+
+I go into both of these aspects below.
+
+* Feature Pyramid Network
+
+Another recent paper, [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]],
+describes the basis of this FPN in detail (and, non-coincidentally I'm
+sure, the paper shares 4 co-authors with the paper this post
+explores).  The paper is fairly concise in describing FPNs; it only
+takes it around 3 pages to explain their purpose, related work, and
+their entire design.  The remainder shows experimental results and
+specific applications of FPNs.  While it shows FPNs implemented on a
+particular underlying network (ResNet, mentioned below), they were
+made purposely to be very simple and adaptable to nearly any kind of
+CNN.
+
+To begin understanding this, start with [[https://en.wikipedia.org/wiki/Pyramid_%2528image_processing%2529][image pyramids]].  The below
+diagram illustrates an image pyramid:
+
+#+CAPTION: Source: https://en.wikipedia.org/wiki/File:Image_pyramid.svg
+#+ATTR_HTML: :width 100% :height 100%
+[[./1024px-Image_pyramid.svg.png]]
+
+Image pyramids have many uses, but the paper focuses on their use in
+taking something that works only at a certain scale of image - for
+instance, an image classification model that only identifies objects
+that are around 50 pixels across - and adapting it to handle different
+scales by applying it at every level of the image pyramid.  If the
+model has a little flexibility, some level of the image pyramid is
+bound to have scaled the object to the correct size that the model can
+match it.
+
+Typically, though, detection or classification isn't done directly on
+an image, but rather, the image is converted to some more useful
+feature space. However, these feature spaces likewise tend to be
+useful only at a specific scale.  This is the rationale behind
+"featurized image pyramids", or feature pyramids built upon image
+pyramids, created by converting each level of an image pyramid to that
+feature space.
+
+The problem with featurized image pyramids, the paper says, is that if
+you try to use them in CNNs, they drastically slow everything down,
+and use so much memory as to make normal training impossible.
+
+However, take a look below at this generic diagram of a generic deep
+CNN:
+
+#+CAPTION: Source: https://commons.wikimedia.org/wiki/File:Typical_cnn.png
+#+ATTR_HTML: :width 100% :height 100%
+[[./Typical_cnn.png]]
+
+You may notice that this network has a structure that bears some
+resemblance to an image pyramid.  This is because deep CNNs are
+already computing a sort of pyramid in their convolutional and
+subsampling stages.  In a nutshell, deep CNNs used in image
+classification push an image through a cascade of feature detectors or
+filters, and each successive stage contains a feature map that is
+built out of features in the prior stage - thus producing a *feature
+hierarchy* which already is something like a pyramid and contains
+multiple different scales.  (Being able to train deep CNNs to jointly
+learn the filters at each stage of that feature hierarchy from the
+data, rather than engineering them by hand, is what sets deep learning
+apart from "shallow" machine learning.)
+
+When you move through levels of a featurized image pyramid, only scale
+should change.  When you move through levels of a feature hierarchy
+described here, scale changes, but so does the meaning of the
+features.  This is the *semantic gap* the paper references.  Meaning
+changes because each stage builds up more complex features by
+combining simpler features of the last stage.  The first stage, for
+instance, commonly handles pixel-level features like points, lines or
+edges at a particular direction.  In the final stage, presumably, the
+model has learned complex enough features that things like "kite" and
+"person" can be identified.
+
+The goal in the paper was to find a way to exploit this feature
+hierarchy that is already being computed and to produce something that
+has similar power to a featurized image pyramid but without too high
+of a cost in speed, memory, or complexity.
+
+Everything described so far (none of which is specific to FPNs), the
+paper calls the *bottom-up* pathway - the feed-forward portion of the
+CNN.  FPN adds to this a *top-down* pathway and some lateral
+connections.
+
+** Top-Down Pathway
+
+** Lateral Connections
+
+** As Applied to ResNet
+
+# Note C=256 and such
+
+# TODO: Link to some good explanations
+
+For two reasons, I don't explain much about ResNet here.  The first is
+that residual networks, like the ResNet used here, have seen lots of
+attention and already have many good explanations online.  The second
+is that the paper claims that the underlying network 
+
+[[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
+[[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
+
+* Anchors & Region Proposals
+
+Recall last section what was said about feature maps, and the that the
+deeper stages of the CNN happen to be good for classifying images.
+While these deeper stages are lower-resolution than the input images,
+and while their influence is spread out over larger areas of the input
+image (that is, their [[https://en.wikipedia.org/wiki/Receptive_field#In_the_context_of_neural_networks][receptive field]] is rather large due to each
+stage spreading it a little further), the features here still maintain
+a spatial relationship with the input image.  That is, moving across
+one axis of this feature map still corresponds to moving across the
+same axis of the input image.
+
+# Just re-explain the above with the feature pyramid
+
+RetinaNet's design draws heavily from RPNs (Region Proposal Networks)
+here, and here I follow the explanation given in the paper [[https://arxiv.org/abs/1506.01497][Faster
+R-CNN: Towards Real-Time Object Detection with Region Proposal
+Networks]].  I find the explanations in terms of "proposals", of
+focusing the "attention" of the neural network, or of "telling the
+neural network where to look" to be needlessly confusing and
+misleading.  I'd rather explain very plainly how they work.
+
+Central to RPNs is *anchors*.  Anchors aren't exactly a feature of the
+CNN.  They're more a property that's used in its training and
+inference.
+
+In particular:
+
+- Say that the feature pyramid has \( L \) levels, and that level \( l+1 \) is
+  half the resolution (thus double the scale) of level \( l \).
+- Say that level \( l \) is a 256-channel feature map of size \( W \times H \)
+  (i.e. it's a tensor with shape \( W \times H \times 256 \)).  Note that
+  \( W \) and \( H \) will be larger at lower levels, and smaller at higher
+  levels, but in RetinaNet at least, always 256-channel samples.
+- For every point on that feature map (all \( WH \) of them), we can
+  identify a corresponding point in the input image.  This is the
+  center point of a broad region of the input image that influences
+  this point in the feature map (i.e. its receptive field).  Note that
+  as we move up to higher levels in the feature pyramid, these regions
+  grow larger, and neighboring points in the feature map correspond to
+  larger and larger jumps across the input image.
+- We can make these regions explicit by defining *anchors* - specific
+  rectangular regions associated with each point of a feature map.
+  The size of the anchor depends on the scale of the feature map, or
+  equivalently, what level of the feature map it came from.  All this
+  means is that anchors in level \( l+1 \) are twice as large as the
+  anchors of level \( l \).
+
+The view that this should paint is that a dense collection of anchors
+covers the entire input image at different sizes - still in a very
+ordered pattern, but with lots of overlap.  Remember how I mentioned
+at the beginning of this post that one-stage object detectors use a
+very "brute force" method?
+
+My above explanation glossed over a couple things, but nothing that
+should change the fundamentals.
+
+- Anchors are actually associated with every 3x3 window in the anchor
+  map, not precisely every point, but all this really means is that
+  it's "every point and its immediate neighbors" rather than "every
+  point".  This doesn't really matter to anchors, but matters
+  elsewhere.
+- It's not a single anchor per 3x3 window, but 9 anchors - one for
+  each of three aspect ratios (1:2, 1:1, and 2:1), and each of three
+  scale factors (\( 1, 2^{1/3}, and 2^{2/3} \)) on top of its base scale.
+  This is just to handle objects of less-square shapes and to cover
+  the gap in scale in between levels of the feature pyramid.  Note
+  that the scale factors are evenly-spaced exponentially, such that an
+  additional step down wouldn't make sense (the largest anchors at the
+  pyramid level /below/ already cover this scale), and nor would an
+  additional step up (the smallest anchors at the pyramid level
+  /above/ already cover it).
+
+Here, finally, is where actual classification and regression come in.
+The *classification subnet* and *box regression subnet* are here.
+
+** Classification Subnet
+
+Every anchor associates an image region with a 3x3 window (i.e. a
+3x3x256 section - it's still 256-channel).  The classification subnet
+is responsible for learning: do the features in this 3x3 window,
+produced from some input, image indicate that an object is inside this
+anchor?  Or, more accurately: For each of \( K \) object classes, what's
+the probability of each object (or just of it being background)?
+
+** Box Regression Subnet
+
+The box regression subnet takes the same input as the classification
+subnet, but tries to learn the answer to a different question.  It is
+responsible for learning: what are the coordinates to the object
+inside of this anchor (assuming there is one)?  More specifically, it
+tries to learn to produce 4 numbers values which give offsets relative
+to the anchor's bounds (thus specifying a different region).  Note
+that this subnet completely ignores the class of the object.
+
+The classification subnet already tells us whether or not a given
+anchor contains an object - which already gives rough bounds on
+it. The box regression subnet helps tighten these bounds.
+
+** Other notes (?)
+
+I've glossed over a few details here.  Everything I've described above
+is implemented with bog-standard convolutional networks...
+
+# Parameter sharing? How to explain?
+
+* Training
+
+# Ground-truth object boxes
+# Intersection-over-Union thresholds
+
+* Inference
+
+# Top N results
+
+* References
+
+# Does org-mode have a way to make a special section for references?
+# I know I saw this somewhere
+
+1. [[https://arxiv.org/abs/1708.02002][Focal Loss for Dense Object Detection]]
+2. [[https://arxiv.org/abs/1612.03144][Feature Pyramid Networks for Object Detection]]
+3. [[https://arxiv.org/abs/1506.01497][Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks]]
+4. [[https://arxiv.org/abs/1504.08083][Fast R-CNN]]
+5. [[https://arxiv.org/abs/1512.03385][Deep Residual Learning for Image Recognition]]
+6. [[https://arxiv.org/abs/1603.05027][Identity Mappings in Deep Residual Networks]]
+7. [[https://openreview.net/pdf?id%3DSJAr0QFxe][Demystifying ResNet]]
+8. [[https://vision.cornell.edu/se3/wp-content/uploads/2016/10/nips_camera_ready_draft.pdf][Residual Networks Behave Like Ensembles of Relatively Shallow Networks]]
+9. https://github.com/KaimingHe/deep-residual-networks
+10. https://github.com/broadinstitute/keras-resnet (keras-retinanet uses this)
+11. [[https://arxiv.org/abs/1311.2524][Rich feature hierarchies for accurate object detection and semantic segmentation]] (contains the same parametrization as in the Faster R-CNN paper)
+12. http://deeplearning.csail.mit.edu/instance_ross.pdf and http://deeplearning.csail.mit.edu/