Learning without Backpropagation: Intuition and Ideas (Part 1)

November 22, 2016

For the last 30 years, artificial neural networks have overwhelmingly been trained by a technique called backpropagation. This method is correct, intuitive, and easy to implement in both software and hardware (with specialized routines available for GPU computing). However, there are downsides to the method. It can cause practical instabilities in the learning process due to vanishing and exploding gradients. It is inherently sequential in design; one must complete a full sequential forward pass before computing a loss, after which you can begin your sequential backward pass. This sequential requirement makes parallelizing large networks (in space and/or time) difficult. Finally, it may be too conservative of a method to achieve our true goal: optimizing out-of-sample global loss. For these reasons, we explore the possibility of learning networks without backpropagation.


Backpropagation is fairly straightforward, making use of some basic calculus (partial derivatives and the chain rule), and matrix algebra. Its goal: to measure the effect of internal parameters of a model on the final loss (or cost) of our objective. This value is called the gradient of our parameters with respect to the loss. For input x, target y, model f, and model parameters $\theta$:

\[\nabla_\theta = \frac{\partial L(f(x,\theta), y)}{\partial \theta} = \frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial \theta}\]

For a quick review of the backpropagation method, see this short video:

The symmetry of the forward and backward pass

For a commonly used linear transformation:

\[\begin{align} y &= w x + b \nonumber \\ \theta &= \{w, b\} \nonumber \end{align}\]

we typically multiply the backpropagated gradient by the transpose of the weight matrix: $w^T$. This is the mathematically correct approach to computing the exact gradient:

\[\begin{align} \nabla_w &= \nabla_y x^T \nonumber \\ \nabla_b &= \nabla_y \nonumber \\ \nabla_x &= w^T \nabla_y \nonumber \end{align}\]

All was right with the world, until I read Random feedback weights support learning in deep neural networks. In it, Lillicrap et al describe how you can replace $w^T$ with a random and fixed matrix $B$ during the backward pass so that:

\[\nabla_x = B \nabla_y\]

And the network will still learn. WTF?! How could you possibly learn something from random feedback? It was as if I was no longer Mr. Anderson, but instead some dude sitting in goo with tubes coming out of my body. Why did I just choose the red pill?

To gain intuition about how random feedback can still support learning, I had to step back and consider the function of neural nets. First there are many exactly equivalent formulations of a neural net. Just reorder the nodes of the hidden layers (and the corresponding rows of the weight matrix and bias vector and the columns of downstream weights), and you’ll get the exact same output. If you also allow small offsetting perturbations in the weights and biases, the final output will be relatively close (in the non-rigorous sense). There are countless ways to alter the parameters in order to produce effectively equivalent models.

Intuitively we may think that we are optimizing a function like:

But in reality, we are optimizing a function like:

Vanilla backpropagation optimizes towards the surface valley (local minimum) closest to the initial random weights. Random feedback first randomly picks a valley and instead optimizes towards that. In the words of Lillicrap et al:

The network learns how to learn – it gradually discovers how to use B, which then allows effective modification of the hidden units. At first, the updates to the hidden layer are not helpful, but they quickly improve by an implicit feedback process that alters W so that $e^T W B e > 0$.

We implicitly learn how to make W approximately symmetric with B before searching for the local minimum. Wow.

Relationship to Dropout

Trying to intuit random feedback felt similar to my process of understanding Dropout. In a nutshell, Dropout simulates an ensemble model of $2^n$ sub-models, each composed of a subset of the nodes of the original network. It avoids overfitting by spreading the responsibility. Except it’s not a real ensemble (where models are typically learned independently), but instead some sort of co-learned ensemble. The net effect is a less powerful, but usually more general, network. The explanation I see for why Dropout is beneficial is frequently: “it prevents co-adaption of latent features”. Put another way, it prevents a network from acquiring weights that force a dependence between latent (hidden) features.

In backpropagation, the gradients of early layers can vanish or explode unless the weights and gradients of later layers are highly controlled. In a sense, we see co-adaption of weights across layers. The weights of earlier layers are highly dependent on the weights of later layers (and vice-versa). Dropout reduces co-adaption from intra-layer neurons, and random feedback may help reduce co-adaption from inter-layer weights.

Benefits of breaking the symmetry

Some will claim that random feedback is more biologically-plausible, and this is true. However the real benefit comes from breaking the fragile dependence of early-layer weight updates from later-layer weights. The construction of the fixed feedback matrices ${B_1, B_2, …, B_M}$ can be managed as additional hyperparameters and have the potential to stabilize the learning process.


Random feedback is just the tip of the iceberg when it comes to backprop-free learning methods (though random feedback is more a slight modification of backprop). In part 2, I’ll review more research directions that have been explored, and highlight some particularly interesting recent methods which I intend on extending and incorporating into the JuliaML ecosystem.