Efficiency is Key: Lessons from the Human Brain.

January 13, 2016

The human brain is intensely complicated.  Memories, motor sequences, emotions, language, and more are all maintained and enacted solely through the temporary and fleeting transfer of energy between neurons: the slow release of neurotransmitters across synapses, dendritic integration, and finally the somatic spike.  A single spike (somatic action potential) will last a small fraction of a second, and yet somehow we are able to swing a baseball bat, compose a symphony, and apply memories from decades in the past.  How can our brain be based on signals of such short duration, and yet work on such abstract concepts stretching vast time scales?

In this blog, I hope to lay out some core foundations and research in computation neuroscience and machine learning which I feel will comprise the core components of an eventual artificially intelligent system.  I’ll argue that rate-based artificial neural networks (ANN) have limited power, partially due to the removal of the important fourth dimension: time.  I also hope to highlight some important areas of research which could help bridge the gap from “useful tool” to “intelligent machine”.  I will not give a complete list of citations as that would probably take me longer to compile than writing this blog, but I will occasionally mention references which I feel are important contributions or convey a concept well.

These views are my own opinion, formed after studying many related areas in computational neuroscience, deep learning, reservoir computing, neuronal dynamics, computer vision, and more.  This personal study is heavily complemented with my background in statistics and optimization, and 25 years of experience with computer programming, design, and algorithms.  Recently I have contributed MIT-licensed software for the awesome Julia programming language. For those in New York City, we hope to see you at the next meetup!

See my bio for more information.


What is intelligence?

What does it mean to be intelligent?  Are dogs intelligent? Mice?  What about a population of ants, working together toward a common goal?  I won’t give a definitive answer, and this is a topic which easily creates heated disagreement.  However, I will roughly assume that intelligence involves robust predictive extrapolation/generalization into new environments and patterns using historical context.  As an example, an intelligent agent would predict that they would sink through mud slowly, having only experienced the properties of dirt and water independently, while a Weak AI system would likely say “I don’t know… I’ve never seen that substance” or worse: “It’s brown, so I expect it will be the same as dirt”.

Intelligence need not be human-like, though that is the easiest kind to understand.  I foresee intelligent agents sensing traffic patterns throughout a city and controlling stoplight timings, or financial regulatory agents monitoring transaction flow across markets and continents to pinpoint criminal activity.  In my eyes, these are sufficiently similar to a human brain which senses visual and auditory inputs and acts on the world through body mechanics, learning from the environment and experience as necessary.  While the sensorimotor components are obviously very different between these examples and humans, the core underlying (intelligent) algorithms may be surprisingly similar.


Some background: Neural Network Generations

I assume you have some basic understanding about artificial neural networks going forward, though a quick Google search will give you plenty of background for most of the research areas mentioned.

There is no clear consensus on the generational classification of neural networks.  Here I will take the following views:

The first two generations of networks are static, in the sense that there is no explicit time component.  Of course, they can be made to represent time through additional structure (such as in RNNs) or transformed inputs (such as appending lagged inputs as in ARIMA-type models).  Network dynamics can be changed through learning, but that structure must be explicitly represented by the network designer.

In the last few years, there has been incredible advances in the expressive power of second generation networks.  Networks have been built which can approach or surpass human ability in object recognition, language translation, pattern recognition, and even creativity.  While this is impressive, most second generation networks have problems of fragility and scalability.  A network with hundreds of millions of parameters (or more) requires tons of computing power and labeled training samples to effectively learn its goal (such as this awesome network from OpenAI’s Andrej Karpathy).  This means that acquiring massive labeled data sets and compute power are required when creating useful networks (and the reason that Google, Facebook, Apple, etc are the companies currently winning this game).

I should note that none of the network types that I’ve listed are “brain-like”.  Most have only abstract similarities to a real network of cortical neurons.  First and second order networks roughly approximate a “rate-based” model of neural activity, which means the instantaneous mean firing rate of a neuron is the only output, and the specific timings of neuronal firings are ignored.  Research areas like Deep Reinforcement Learning are worthwhile extensions to ANNs, as they get closer to the required brain functionality of an agent which learns through sensorimotor interaction with an environment, however the current attempts do not come close to the varied dynamics found in real brains.

SNN and LSM networks incorporate specific spike timing as a core piece of their models, however they still lack the functional expressiveness of the brain: dendritic computation, chemical energy propagation, synaptic delays, and more (which I hope to cover in more detail another time). In addition, the added complexity makes interpretation of the network dynamics difficult.  HTM networks get closer to “brain-like” dynamics than many other models, however the choice of binary integration and binary outputs are a questionable trade-off for many real world tasks, and it’s easy to wonder if they will beat finely tuned continuously differentiable networks in practical tasks.


More background: Methods of Learning

There are two core types of learning: supervised and unsupervised.  In supervised learning, there is a “teacher” or “critic”, which gives you input, compares your output to some “correct answer”, and gives you a numerical quantity representing your error.  The classic method of learning in second generation networks is to use the method of backpropagation to project that error backwards through your network, updating individual network parameters based on the contribution of that parameter to the resulting error.  The requirement of a (mostly) continuously differentiable error function and network topology is critical for backpropagation, as it uses a simple calculus trick known as the Chain Rule to update network weights.  This method works amazingly well when you have an accurate teacher with lots of noise-free examples.

However, with sufficient noise in your data or error, or inadequate training samples, ANNs are prone to overfitting (or worse).  Techniques such as Early Stopping or Dropout go a long way to avoid overfitting, but they may also restrict the expressive power of neural nets in the process.  Much research has gone into improving gradient-based learning rules, and advancements like AdaGrad, RMSProp, AdaDelta, Adam, and (my personal favorite) AdaMax have helped considerably in speeding the learning process.  Finally, a relatively recent movement of Batch Normalization has improved the ability to train very deep networks.

With too few (or zero) “correct answers” to compare to, how does one learn?  How does a network know that a picture of a cat and its mirror image represent the same cat?  In unsupervised learning, we ask our network to compress the input stream to a reduced (and ideally invariant) representation, so as to reduce the dimensionality.  Thus the mirror image of a cat could be represented as “cat + mirror” so as not to duplicate the representation (and without also throwing away important information).  In addition, the transformed input data will likely require a much smaller model to fit properly, as correlated or causal inputs can be reduced to smaller dimensions.  Thus, the reduced dimensionality may require fewer training examples to train an effective model.

For linear models, statisticians and machine learning practitioners will frequently employ Principle Component Analysis (PCA) as a data preprocessing step, in an attempt to reduce the model complexity and available degrees of freedom.  This is an example of simple and naive unsupervised learning, where relationships within the input data are revealed and exploited in order to extract a dataset which is easier to model.  In more advanced models unsupervised learning may take the form of a Restricted Boltzmann machine or Sparse Autoencoders.  Convolution and pooling layers in CNNs could be seen as a type of unsupervised learning, as they strive to create partially translation-invariant representations of the input data.  Concepts like Spatial Transformer Networks, Ripple Pond Networks, and Geoff Hinton’s “Capsules” are similar transformative models which promise to be interesting areas of further research.

After transforming the inputs, typically a smaller and simpler model can be used to fit the data (possibly a linear or logistic regression).  It has become common practice to combine these steps, for example by using Partial Least Squares (PLS) as an alternative to PCA + Regression.  In ANNs, weight initialization using sparse autoencoders has helped to speed learning and avoid local minima.  In reservoir computing, inputs are accumulated and aggregated over time in the reservoir, which allows for relatively simple readout models on complex time-varying data streams.


Back to the (efficient) Future

With some background out of the way, we can continue to explore why the third generation of neural networks holds more promise than current state of the art: efficiency.  Algorithms of the future will not have the pleasure of sitting in a server farm and crunching calculations through a static image dataset, sorting cats from dogs.  They will be expected to be worn on your wrist in remote jungles monitoring for poisonous species, or guiding an autonomous space probe through a distant asteroid field, or swimming through your blood stream taking vital measurements and administering precise amounts of localized medicine to maintain homeostasis through illness.

Algorithms of the future must perform brain-like feats, extrapolating and generalizing from experience, while consuming minimal power, sporting a minimal hardware footprint, and making complex decisions continuously in real time.  Compared to the general computing power which is “the brain”, current state of the art methods fall far short in generalized performance, and require much more space, time, and energy.  Advances in data and hardware will only improve the situation slightly.  Incremental improvements in algorithms can have a great impact on performance, but we’re unlikely to see the gains in efficiency we need without drastic alterations to our methods.

The human brain is super-efficient for a few reasons:

Side note: I highly recommend the book “Principles of Neural Design“ by Peter Sterling and Simon Laughlin, as they reverse-engineer neural design, while keeping the topics easily digestible.


The efficiency of time

Morse Code was developed in the 1800’s as a means of communicating over distance using only a single bit of data.  At a given moment, the bit is either ON (1) or OFF (0).  However, when the fourth dimension (time) is added to the equation, that single bit can be used to express arbitrary language and mathematics.  Theoretically, that bit could represent anything in existence (with infinitesimally small intervals and/or an infinite amount of time).

The classic phrase “SOS”, the international code for emergency, is represented in Morse Code by a sequence of ”3 dots, 3 dashes, 3 dots” which could be compactly represented as a binary sequence:

10101 00 11011011 00 10101

Here we see that we can represent “SOS” with 1 bit over 22 time-steps, or equivalently as a static binary sequence with 22 bits.  By moving storage and computation from space into time, we drastically change the scope of our problem.  Single-bit technologies (for example, a ship’s smoke stack, or a flash light) can now produce complex language when viewed through time.  For a given length of time T, and time interval dT, a single bit can represent N (= T / dT) bits when viewed as a sequence through time.

Moving storage and representation from space into time will allow for equivalent representations of data at a fraction of the resources.


Brain vs Computer

Computers (the Von Neumann type) have theoretically infinite computational ability, owing to the idea that they are Turing-Complete.  However, they have major flaws of inefficiency:

It quickly becomes clear that we need specialized hardware to approach brain-like efficiency (and thus specialized algorithms that can take advantage of this hardware).  This hardware must be incredibly flexible; allowing for minimal bit usage, variable processing speed, and highly parallel computations.  Memory should be holistically interleaved through the core processing elements, eliminating the need for external memory stores (as well as the bottleneck that is the memory bus).  In short, the computers we need look nothing like the computers we have.


Sounds impossible… I give up

Not so fast!  There is a long way to go until true AGI is developed, however technological progress tends to be exponential in ways impossible to predict.  I feel that we’re close to identifying the core algorithms that drive generalized intelligence.  Once identified, alternatives can be developed which make better use of our (inadequate) hardware.  Once intelligent algorithms (but still inefficient) can be demonstrated to outperform the current swath of “weak AI” tools in a robust and generalized way, specialized hardware will follow.  We live in a Renaissance for AI, where I expect exponential improvements and ground-breaking discoveries in the years to come.


What next?

There are several important areas of research which I feel will contribute to identifying those core algorithms comprising intelligence. I’ll highlight the importance of:


Summary

The human brain is complex and powerful, and the neural networks of today are too inefficient to be the model of future intelligent systems.  We must focus energy on new algorithms and network structures, incorporating efficiency through time in novel ways.  I plan to expand on some of these topics in future posts, and begin to discuss the components and connectivity that could compose the next generation of neural networks.