Jekyll2017-02-01T17:03:10+00:00http://www.breloff.com//Tom BreloffThoughts from my quest for Artificial General IntelligenceLearning without Backpropagation: Intuition and Ideas (Part 2)2016-12-03T00:00:00+00:002016-12-03T00:00:00+00:00http://www.breloff.com/no-backprop-part2<p>In <a href="/no-backprop">part one</a>, we peeked into the rabbit hole of backprop-free network training with asymmetric random feedback. In this post, we’ll jump into the rabbit hole with both feet. First I’ll demonstrate how it is possible to learn by “gradient” descent with <strong>zero-derivative activations</strong>, where <strong>learning by backpropagation is impossible</strong>. The technique is a modification of Direct Feedback Alignment. Then I’ll review several different (but unexpectedly related) research directions: targetprop, e-prop, and synthetic gradients, which set up my ultimate goal: efficient training of arbitrary recurrent networks.</p>
<h2 id="direct-feedback-alignment">Direct Feedback Alignment</h2>
<p>In recent research from <a href="https://arxiv.org/abs/1609.01596" title="Direct Feedback Alignment Provides Learning in Deep Neural Networks (Nøkland 2016)">Arild Nøkland</a>, he explores extensions to random feedback (see <a href="/no-backprop">part one</a>) that avoid backpropagating error signals sequentially through the network. Instead, he proposes Direct Feedback Alignment (DFA) and Indirect Feedback Alignment (IFA) which connect the final error layer directly to earlier hidden layers through random feedback connections. Not only are they more convenient for error distribution, but they are more biologically plausible as there is no need for weight symmetry <strong>or</strong> feedback paths that match forward connectivity. A quick tutorial on the method:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/AZ0emAUIkw0" frameborder="0" allowfullscreen=""></iframe>
<h2 id="learning-through-flat-activations">Learning through flat activations</h2>
<p>In this post, we’re curious whether we can use a surrogate gradient algorithm that will handle threshold activations. <a href="https://arxiv.org/abs/1609.01596" title="Direct Feedback Alignment Provides Learning in Deep Neural Networks (Nøkland 2016)">Nøkland</a> connects the direct feedback connections from the transformation output error gradient to the “layer output”, which in this case is the <strong>output of the activation functions</strong>. However, we want to use activation functions with zero derivative, so even with direct feedback the gradients would be zeroed during propagation through the activations.</p>
<p>To get around this issue, we modify DFA to instead connect the error layer directly to the <strong>inputs of the activations</strong>, instead of the outputs. The result is that we have affine transformations which can learn to connect latent input ($h_{i-1}$ from earlier layers) to a projection of output error ($B_i \nabla y$) into the space of $h_i$, <strong>before</strong> applying the threshold nonlinearity. The effect of the application of a nonlinear activation is “handled” by the progressive re-learning of later network layers. Effectively, each layer <strong>learns how to align their inputs with a fixed projection of the error</strong>. The hope is that, by aligning layer input with final error gradients, we can <strong>project the inputs to a space that is useful for later layers</strong>. Learning happens in parallel, and later layers eventually learn to adjust to the learning that happens in the earlier layers.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/KwHG3O8ttcc" frameborder="0" allowfullscreen=""></iframe>
<h2 id="mnist-with-modified-dfa">MNIST with Modified DFA</h2>
<p>Reusing the approach in <a href="/JuliaML-and-Plots">an earlier post on JuliaML</a>, we will attempt to learn neural network parameters both with backpropagation and our modified DFA method. The combination of Plots and JuliaML makes digging into network internals and building custom learning algorithms super-simple, and the DFA learning algorithm was fairly quick to implement. The full notebook can be found <a href="https://github.com/tbreloff/notebooks/blob/master/juliaml_mnist_nobackprop.ipynb">here</a>. To ease understanding, I’ve created a video to review the notebook, method, and preliminary results:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/CCcFznBBElA" frameborder="0" allowfullscreen=""></iframe>
<p>Nice animations can be built using the super-convenient animation facilities of <a href="https://juliaplots.github.io">Plots</a>:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/cFOmmMKn_Gc" frameborder="0" allowfullscreen=""></iframe>
<h2 id="target-propagation">Target Propagation</h2>
<p>The concept of Target Propagation (targetprop) goes back to <a href="http://link.springer.com/chapter/10.1007/978-3-642-82657-3_24" title="Learning Process in an Asymmetric Threshold (LeCun 1987)">LeCun 1987</a>, but has recently been explored in depth in <a href="https://arxiv.org/abs/1407.7906" title="How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation (Bengio 2014)">Bengio 2014</a>, <a href="https://arxiv.org/abs/1412.7525" title="Difference Target Propagation (Lee et al 2014)">Lee et al 2014</a>, and <a href="https://arxiv.org/abs/1502.04156" title="Towards Biologically Plausible Deep Learning (Bengio et al 2015)">Bengio et al 2015</a>. The intuition is simple: instead of focusing solely on the “forward-direction” model ($y = f(x)$), we also try to fit the “backward-direction” model ($x = g(y)$). $f$ and $g$ form an auto-encoding relationship; $f$ is the <strong>encoder</strong>, creating a latent representation and predicted outputs given inputs $x$, and $g$ is the <strong>decoder</strong>, generating input representations/samples from latent/output variables.</p>
<p><a href="https://arxiv.org/abs/1407.7906" title="How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation (Bengio 2014)">Bengio 2014</a> iteratively adjusts weights to push latent outputs $h_i$ towards the <strong>targets</strong>. The final layer adjusts towards useful final targets using the output gradients as a guide:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{h}_i &= g_i (\hat{h}_{i+1}) \nonumber \\
\Delta \hat{h}_M &= -\eta \frac{\partial L}{\partial y_M} \nonumber
\end{align} %]]></script>
<p><a href="https://arxiv.org/abs/1412.7525" title="Difference Target Propagation (Lee et al 2014)">Difference Target Propagation</a> makes a slight adjustment to the update, and attempts to learn auto-encoders which fulfill:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{h}_i - h_i &= g_i (\hat{h}_{i+1}) - g_i (h_{i+1}) \nonumber
\end{align} %]]></script>
<p>Finally, <a href="https://arxiv.org/abs/1502.04156" title="Towards Biologically Plausible Deep Learning (Bengio et al 2015)">Bengio et al 2015</a> extend targetprop to a Bayesian/generative setting, in which they attempt to reduce divergence between generating distributions p and q, such that the pair of conditionals form a denoising auto-encoder:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h_i &\sim p (h_i | \hat{h}_{i+1}) \nonumber \\
h_i &\sim q (h_i | \hat{h}_{i-1}) \nonumber
\end{align} %]]></script>
<p>Targetprop (and its variants/extensions) is a nice alternative to backpropagation. There is still sequential forwards and backwards passes through the layers, however we:</p>
<ul>
<li>avoid the issues of vanishing and exploding gradients, and</li>
<li>focus on the role of intermediate layers: creating latent representations of the input which are useful in the context of the target values.</li>
</ul>
<h2 id="equilibrium-propagation">Equilibrium Propagation</h2>
<p><a href="https://arxiv.org/abs/1602.05179" title="Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation (Scellier and Bengio 2016)">Equilibrium Propagation</a> (e-prop) is a relatively new approach which (I’m not shy to admit) I’m still trying to get my head around. As I understand, it uses an iterative process of perturbing components towards improved values and allowing the network dynamics to settle into a new equilibrium. The proposed algorithm alternates between phases of “learning” in a forward and backward direction, though it is a departure from the simplicity of backprop and optimization.</p>
<p>The concepts are elegant, and it offers many potential advantages for efficient learning of very complex networks. However it will be a long time before those efficiencies are realized, given the trend towards massively parallel GPU computations. I’ll follow this line of research with great interest, but I don’t expect it to be used in a production setting in the near future.</p>
<h2 id="synthetic-gradients">Synthetic Gradients</h2>
<p>A <a href="https://arxiv.org/abs/1608.05343" title="Decoupled Neural Interfaces using Synthetic Gradients (Jaderberg et al 2016)">recent paper</a> from DeepMind takes an interesting approach. What if we use complex models to estimate useful surrogate gradients for our layers? Their focus is primarily from the perspective of “unlocking” (i.e. parallelizing) the forward, backward, and update steps of a typical backpropagation algorithm. However they also offer the possibility of estimating (un-truncated) Backpropagation Through Time (BPTT) gradients, which would be a big win.</p>
<p>Layers output to a local model, called a Decoupled Neural Interface. This local model estimates the value of the backpropagated gradient that would be used for updating the parameters of that layer, estimated using only the layer outputs and target vectors. If you’re like me, you noticed the similarity to DFA in modeling the relationship of the local layer and final targets in order to choose a search direction which is useful for improving the final network output.</p>
<h2 id="what-next">What next?</h2>
<p>I think the path forward will be combinations and extensions of the ideas presented here. Like synthetic gradients and direct feedback, I think we should be attempting to find reliable alternatives to backpropagation which are:</p>
<ul>
<li>Highly parallel</li>
<li>Asymmetric</li>
<li>Local in time and space</li>
</ul>
<p>Obviously they must still enable learning, and efficient/simple solutions are preferred. I like the concept of synthetic gradients, but wonder if they are optimizing the wrong objective. I like direct feedback, but wonder if there are alternate ways to initialize or update the projection matrices ($B_1, B_2, …$). Combining the concepts, can we add non-linearities to the error projections (direct feedback) and learn a more complex (and hopefully more useful) layer?</p>
<p>There is a lot to explore, and I think we’re just at the beginning. I, for one, am happy that I chose the red pill.</p>Tom BreloffIn part one, we peeked into the rabbit hole of backprop-free network training with asymmetric random feedback. In this post, we’ll jump into the rabbit hole with both feet. First I’ll demonstrate how it is possible to learn by “gradient” descent with zero-derivative activations, where learning by backpropagation is impossible. The technique is a modification of Direct Feedback Alignment. Then I’ll review several different (but unexpectedly related) research directions: targetprop, e-prop, and synthetic gradients, which set up my ultimate goal: efficient training of arbitrary recurrent networks.Learning without Backpropagation: Intuition and Ideas (Part 1)2016-11-22T00:00:00+00:002016-11-22T00:00:00+00:00http://www.breloff.com/no-backprop<p>For the last 30 years, artificial neural networks have overwhelmingly been trained by a technique called backpropagation. This method is correct, intuitive, and easy to implement in both software and hardware (with specialized routines available for GPU computing). However, there are downsides to the method. It can cause practical instabilities in the learning process due to vanishing and exploding gradients. It is inherently sequential in design; one must complete a full sequential forward pass before computing a loss, after which you can begin your sequential backward pass. This sequential requirement makes parallelizing large networks (in space and/or time) difficult. Finally, it may be too conservative of a method to achieve our true goal: optimizing out-of-sample global loss. For these reasons, we explore the possibility of learning networks without backpropagation.</p>
<h2 id="backpropagation">Backpropagation</h2>
<p>Backpropagation is fairly straightforward, making use of some basic calculus (partial derivatives and the chain rule), and matrix algebra. Its goal: to measure the effect of internal parameters of a model on the final loss (or cost) of our objective. This value is called the <strong>gradient</strong> of our parameters with respect to the <strong>loss</strong>. For input x, target y, model f, and model parameters $\theta$:</p>
<script type="math/tex; mode=display">\nabla_\theta = \frac{\partial L(f(x,\theta), y)}{\partial \theta} = \frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial \theta}</script>
<p>For a quick review of the backpropagation method, see this short video:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/qD4YR8MgOs8" frameborder="0" allowfullscreen=""></iframe>
<h2 id="the-symmetry-of-the-forward-and-backward-pass">The symmetry of the forward and backward pass</h2>
<p>For a commonly used linear transformation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y &= w x + b \nonumber \\
\theta &= \{w, b\} \nonumber
\end{align} %]]></script>
<p>we typically multiply the backpropagated gradient by the <strong>transpose of the weight matrix</strong>: $w^T$. This is the mathematically correct approach to computing the exact gradient:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_w &= \nabla_y x^T \nonumber \\
\nabla_b &= \nabla_y \nonumber \\
\nabla_x &= w^T \nabla_y \nonumber
\end{align} %]]></script>
<p>All was right with the world, until I read <a href="https://arxiv.org/abs/1411.0247?" title="Lillicrap et al: Random feedback weights support learning in deep neural networks (2014)">Random feedback weights support learning in deep neural networks</a>. In it, Lillicrap et al describe how you can replace $w^T$ with <strong>a random and fixed matrix</strong> $B$ during the backward pass so that:</p>
<script type="math/tex; mode=display">\nabla_x = B \nabla_y</script>
<p>And the network will still learn. WTF?! How could you possibly learn something from random feedback? It was as if I was no longer Mr. Anderson, but instead some dude sitting in goo with tubes coming out of my body. Why did I just choose the red pill?</p>
<p><img src="https://media.licdn.com/mpr/mpr/shrinknp_800_800/AAEAAQAAAAAAAAQSAAAAJDJmMjgwNjU1LWFmYzMtNDM5Ny1hZjUwLWQ5NjBlNWUzYzc4Nw.jpg" alt="" /></p>
<p>To gain intuition about how random feedback can still support learning, I had to step back and consider the function of neural nets. First there are many <strong>exactly equivalent</strong> formulations of a neural net. Just reorder the nodes of the hidden layers (and the corresponding rows of the weight matrix and bias vector and the columns of downstream weights), and you’ll get the exact same output. If you also allow small offsetting perturbations in the weights and biases, the final output will be relatively close (in the non-rigorous sense). There are countless ways to alter the parameters in order to produce effectively equivalent models.</p>
<p>Intuitively we may think that we are optimizing a function like:</p>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20537722/2c43a9f0-b0bc-11e6-90ad-8a14d2714b12.gif" alt="" /></p>
<p>But in reality, we are optimizing a function like:</p>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20537708/1560c1a0-b0bc-11e6-947c-7600cfc96446.gif" alt="" /></p>
<p>Vanilla backpropagation optimizes towards the surface valley (local minimum) closest to the initial random weights. Random feedback first randomly picks a valley and instead optimizes towards that. In the words of <a href="https://arxiv.org/abs/1411.0247?" title="Lillicrap et al: Random feedback weights support learning in deep neural networks (2014)">Lillicrap et al</a>:</p>
<blockquote>
<p>The network learns how to learn – it gradually discovers how to use B, which then allows effective modification of the hidden units. At first, the updates to the hidden layer are not helpful, but they quickly improve by an implicit feedback process that alters W so that $e^T W B e > 0$.</p>
</blockquote>
<p>We implicitly learn how to make W approximately symmetric with B before searching for the local minimum. Wow.</p>
<h2 id="relationship-to-dropout">Relationship to Dropout</h2>
<p>Trying to intuit random feedback felt similar to my process of understanding <a href="https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf" title="Srivastava et al: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014)">Dropout</a>. In a nutshell, Dropout simulates an ensemble model of $2^n$ sub-models, each composed of a subset of the nodes of the original network. It <strong>avoids overfitting by spreading the responsibility</strong>. Except it’s not a real ensemble (where models are typically learned independently), but instead some sort of co-learned ensemble. The net effect is a less powerful, but usually more general, network. The explanation I see for why Dropout is beneficial is frequently: “<strong>it prevents co-adaption of latent features</strong>”. Put another way, it prevents a network from acquiring weights that force a dependence between latent (hidden) features.</p>
<p>In backpropagation, the gradients of early layers can vanish or explode unless the weights and gradients of later layers are highly controlled. In a sense, we see <strong>co-adaption of weights across layers</strong>. The weights of earlier layers are highly dependent on the weights of later layers (and vice-versa). Dropout reduces co-adaption from <strong>intra-layer neurons</strong>, and random feedback may help reduce co-adaption from <strong>inter-layer weights</strong>.</p>
<h2 id="benefits-of-breaking-the-symmetry">Benefits of breaking the symmetry</h2>
<p>Some will claim that random feedback is more biologically-plausible, and this is true. However the real benefit comes from breaking the fragile dependence of early-layer weight updates from later-layer weights. The construction of the fixed feedback matrices ${B_1, B_2, …, B_M}$ can be managed as additional hyperparameters and have the potential to stabilize the learning process.</p>
<h2 id="next">Next</h2>
<p>Random feedback is just the tip of the iceberg when it comes to backprop-free learning methods (though random feedback is more a slight modification of backprop). In part 2, I’ll review more research directions that have been explored, and highlight some particularly interesting recent methods which I intend on extending and incorporating into the <a href="/JuliaML-and-Plots">JuliaML</a> ecosystem.</p>Tom BreloffFor the last 30 years, artificial neural networks have overwhelmingly been trained by a technique called backpropagation. This method is correct, intuitive, and easy to implement in both software and hardware (with specialized routines available for GPU computing). However, there are downsides to the method. It can cause practical instabilities in the learning process due to vanishing and exploding gradients. It is inherently sequential in design; one must complete a full sequential forward pass before computing a loss, after which you can begin your sequential backward pass. This sequential requirement makes parallelizing large networks (in space and/or time) difficult. Finally, it may be too conservative of a method to achieve our true goal: optimizing out-of-sample global loss. For these reasons, we explore the possibility of learning networks without backpropagation.JuliaML Transformations: Internal Design2016-11-21T00:00:00+00:002016-11-21T00:00:00+00:00http://www.breloff.com/transformations-video-internals<p>In this video post, I expand on my <a href="/transformations">introduction to Transformations</a> and show the core idea behind the design: namely that each transformation has a black-box representation of input, output, and (optionally) parameters which are vectors in contiguous storage. Julia’s excellent type system and efficient array views allow for very convenient and intuitive structures.</p>
<hr />
<iframe width="560" height="315" src="https://www.youtube.com/embed/yscT_P0k-Bs" frameborder="0" allowfullscreen=""></iframe>
<p>For questions, comments, or if you’re interested in collaborating, please join the <a href="https://gitter.im/JuliaML/chat">JuliaML Gitter chat</a>.</p>Tom BreloffIn this video post, I expand on my introduction to Transformations and show the core idea behind the design: namely that each transformation has a black-box representation of input, output, and (optionally) parameters which are vectors in contiguous storage. Julia’s excellent type system and efficient array views allow for very convenient and intuitive structures.Plots Tutorial: Ecosystem and Pipeline2016-11-21T00:00:00+00:002016-11-21T00:00:00+00:00http://www.breloff.com/plots-video<p>Plots is a complex and powerful piece of software, with features and functionality that many probably don’t realize. In this video tutorial, I try to explain where Plots fits into the Julia landscape and how Plots turns a simple command into a beautiful visualization.</p>
<hr />
<iframe width="560" height="315" src="https://www.youtube.com/embed/Iof7Ccm8UiM" frameborder="0" allowfullscreen=""></iframe>
<p>If you have questions or want to request video tutorials on other topics, <a href="https://gitter.im/tbreloff/Plots.jl">come chat</a>.</p>Tom BreloffPlots is a complex and powerful piece of software, with features and functionality that many probably don’t realize. In this video tutorial, I try to explain where Plots fits into the Julia landscape and how Plots turns a simple command into a beautiful visualization.Online Layer Normalization: Derivation of Analytical Gradients2016-11-15T00:00:00+00:002016-11-15T00:00:00+00:00http://www.breloff.com/layernorm<p><a href="https://arxiv.org/abs/1607.06450">Layer Normalization</a> is a technique developed by Ba, Kiros, and Hinton for normalizing neural network layers as a whole (as opposed to Batch Normalization and variants which normalize per-neuron). In this post I’ll show my derivation of analytical gradients for Layer Normalization using an online/incremental weighting of the estimated moments for the layer.</p>
<hr />
<h3 id="background-and-notation">Background and Notation</h3>
<p>Training deep neural networks (and likewise recurrent networks which are deep through time) with gradient descent has been a difficult problem, partially (mostly) due to the issue of <a href="http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf">vanishing and exploding gradients</a>. One solution is to normalize layer activations, and learn the skew (b) and scale (g) as part of the learning algorithm. Online layer normalization can be summed up as learning parameter arrays <strong>g</strong> and <strong>b</strong> in the learnable transformation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
y &= g \odot z + b \\
where: z_o &= \frac{a_o - \mu_t}{\sigma_t} \\
a_o &= \sum_i{w_{oi} x_i} \\
\mu_t &= \alpha_t (\frac{1}{D} \sum_p{a_p}) + (1-\alpha_t) \mu_{t-1} \\
\sigma_t &= \alpha_t \sqrt{\frac{1}{D-1} \sum_p{(a_o - \mu_t)^2}} + (1-\alpha_t) \sigma_{t-1}
\end{align} %]]></script>
<p>The vector <strong>a</strong> is the input to our <strong>LayerNorm</strong> layer and the result of a <strong>Linear</strong> transformation of <strong>x</strong>. We keep a running mean ($\mu_t$) and standard deviation ($\sigma_t$) of <strong>a</strong> using a time-varying weighting factor ($\alpha_t$).</p>
<h3 id="derivation">Derivation</h3>
<p>Due mostly to LaTeX-laziness, I present the derivation in scanned form. A PDF version can be found <a href="/images/layernorm/layernorm_derivation.pdf">here</a>.</p>
<p><img src="/images/layernorm/layernorm-0.png" alt="" /></p>
<p><img src="/images/layernorm/layernorm-1.png" alt="" /></p>
<p><img src="/images/layernorm/layernorm-2.png" alt="" /></p>
<p><img src="/images/layernorm/layernorm-3.png" alt="" /></p>
<p><img src="/images/layernorm/layernorm-4.png" alt="" /></p>
<p><img src="/images/layernorm/layernorm-5.png" alt="" /></p>
<h3 id="summary">Summary</h3>
<p>Layer normalization is a nice alternative to batch or weight normalization. With this derivation, we can include it as a standalone <a href="/transformations">learnable transformation</a> as part of a larger network. In fact, this is already accessible using the <code class="highlighter-rouge">nnet</code> convenience constructor in Transformations:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">using</span> <span class="n">Transformations</span>
<span class="n">nin</span><span class="x">,</span> <span class="n">nout</span> <span class="o">=</span> <span class="mi">3</span><span class="x">,</span> <span class="mi">5</span>
<span class="n">nhidden</span> <span class="o">=</span> <span class="x">[</span><span class="mi">4</span><span class="x">,</span><span class="mi">5</span><span class="x">,</span><span class="mi">4</span><span class="x">]</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">nnet</span><span class="x">(</span><span class="n">nin</span><span class="x">,</span> <span class="n">nout</span><span class="x">,</span> <span class="n">nhidden</span><span class="x">,</span> <span class="x">:</span><span class="n">relu</span><span class="x">,</span> <span class="x">:</span><span class="n">logistic</span><span class="x">,</span> <span class="n">layernorm</span> <span class="o">=</span> <span class="n">true</span><span class="x">)</span>
</code></pre>
</div>
<p>Network:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>Chain{Float64}(
Linear{3-->4}
LayerNorm{n=4, mu=0.0, sigma=1.0}
relu{4}
Linear{4-->5}
LayerNorm{n=5, mu=0.0, sigma=1.0}
relu{5}
Linear{5-->4}
LayerNorm{n=4, mu=0.0, sigma=1.0}
relu{4}
Linear{4-->5}
LayerNorm{n=5, mu=0.0, sigma=1.0}
logistic{5}
)
</code></pre>
</div>Tom BreloffLayer Normalization is a technique developed by Ba, Kiros, and Hinton for normalizing neural network layers as a whole (as opposed to Batch Normalization and variants which normalize per-neuron). In this post I’ll show my derivation of analytical gradients for Layer Normalization using an online/incremental weighting of the estimated moments for the layer.Transformations: Modular tensor computation graphs in Julia2016-11-14T00:00:00+00:002016-11-14T00:00:00+00:00http://www.breloff.com/transformations<p>In this post I’ll try to summarize my design goals with the <a href="https://github.com/JuliaML/Transformations.jl">Transformations</a> package for the <a href="https://github.com/JuliaML">JuliaML ecosystem</a>. Transformations should be seen as a modular and higher-level approach to building complex tensor computation graphs, similar to those you may build in TensorFlow or Theano. The major reason for designing this package from the ground up lies in the flexibility of a pure-Julia implementation for new research paths. If you want to apply convolutional neural nets to identify cats in pictures, this is not the package for you. My focus is in complex, real-time, and incremental algorithms for learning from temporal data. I want the ability to track learning progress in real time, and to build workflows and algorithms that don’t require a gpu server farm.</p>
<hr />
<h3 id="what-is-a-tensor-computation-graph">What is a tensor computation graph?</h3>
<p>A <strong>computation graph</strong>, or data flow graph, is a representation of math equations using a directed graph of nodes and edges. Here’s a simple example using Mike Innes’ cool package <a href="https://github.com/MikeInnes/DataFlow.jl">DataFlow</a>. I’ve built a <a href="https://github.com/JuliaML/Transformations.jl/blob/master/src/scratch/flow.jl">recipe</a> for converting DataFlow graphs to <a href="/Graphs">PlotRecipes graphplot</a> calls. See the <a href="https://github.com/tbreloff/notebooks/blob/master/transformations.ipynb">full notebook</a> for complete Julia code.</p>
<p>We’ll compute $f(x) = w * x + b$:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">g</span> <span class="o">=</span> <span class="nd">@flow</span> <span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">)</span> <span class="o">=</span> <span class="n">w</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">b</span>
<span class="n">plot</span><span class="x">(</span><span class="n">g</span><span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20271931/657d34d8-aa5a-11e6-81f7-80973266c16e.png" alt="" /></p>
<p>The computation graph is a <strong>graphical representation of the flow of mathematical calculations</strong> to compute a function. Follow the arrows, and do the operations on the inputs. First we multiply w and x together, then we add the result with b. The result of the addition is our output of the function f.</p>
<p>When x/w/b are numbers, this computation flow is perfectly easy to follow. But when they are <a href="https://en.wikipedia.org/wiki/Tensor">tensors</a>, the graph is much more complicated. Here’s the same example for a 1D, 2-element version where w is a 2x2 weight matrix and x and b are 2x1 column vectors:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">plot</span><span class="x">(</span><span class="nd">@flow</span> <span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">)</span> <span class="o">=</span> <span class="n">out</span><span class="x">(</span><span class="n">w11</span><span class="o">*</span><span class="n">x1</span> <span class="o">+</span> <span class="n">w12</span><span class="o">*</span><span class="n">x2</span> <span class="o">+</span> <span class="n">b1</span><span class="x">,</span> <span class="n">w21</span><span class="o">*</span><span class="n">x1</span> <span class="o">+</span> <span class="n">w22</span><span class="o">*</span><span class="n">x2</span> <span class="o">+</span> <span class="n">b2</span><span class="x">))</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20272655/96fd0978-aa5c-11e6-9274-ce279d63960b.png" alt="" /></p>
<p>Already this computational graph is getting out of hand. A <strong>tensor computation graph</strong> simply re-imagines the vector/matrix computations as the core units, so that the first representation ($f(x) = w * x + b$) is used to represent the tensor math which does a matrix-vector multiply and a vector add.</p>
<h3 id="transformation-graphs">Transformation Graphs</h3>
<p>Making the jump from computation graph to tensor computation graph was a big improvement in complexity and understanding of the underlying operations. This improvement is the core of frameworks like TensorFlow. But we can do better. In the same way a matrix-vector product</p>
<script type="math/tex; mode=display">(W*x)_i = \sum_j W_{ij} x_j</script>
<p>can be represented as simply the vector $Wx$, we can treat the <strong>tensor transformation</strong> $f(x) = wx + b$ as a black box function which takes input vector (x) and produces output vector (f(x)). Parameter nodes (w/b) are considered <strong>learnable parameters</strong> which are internal to the <strong>learnable transformation</strong>. The new <strong>transformation graph</strong> looks like:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">plot</span><span class="x">(</span><span class="nd">@flow</span> <span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">)</span> <span class="o">=</span> <span class="n">affine</span><span class="x">(</span><span class="n">x</span><span class="x">))</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20273571/d5014132-aa5f-11e6-8953-a17d6ebad1d2.png" alt="" /></p>
<p>Quite the improvement! We have created a modular, black-box representation of our affine transformation, which takes a vector input, multiplies by weight vector and adds a bias vector, producing a vector output:</p>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20273883/edfaa236-aa60-11e6-9c6c-9e8c8945201b.png" alt="" /></p>
<p>And here’s the comparison for a basic recurrent network:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">g</span> <span class="o">=</span> <span class="nd">@flow</span> <span class="k">function</span><span class="nf"> net</span><span class="x">(</span><span class="n">x</span><span class="x">)</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">relu</span><span class="x">(</span> <span class="n">Wxh</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">Whh</span><span class="o">*</span><span class="n">hidden</span> <span class="o">+</span> <span class="n">bh</span> <span class="x">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">logistic</span><span class="x">(</span> <span class="n">Why</span><span class="o">*</span><span class="n">hidden</span> <span class="o">+</span> <span class="n">Wxy</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">by</span> <span class="x">)</span>
<span class="k">end</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20276999/d2f9b1aa-aa6c-11e6-9ae6-9004f87abb8d.png" alt="" />
<img src="https://cloud.githubusercontent.com/assets/933338/20277271/ddc70276-aa6d-11e6-94d6-50d9a6921075.png" alt="" /></p>
<h3 id="but-tensorflow">But… TensorFlow</h3>
<p>The unfortunate climate of ML/AI research is: “TensorFlow is love, TensorFlow is life”. If it can’t be built into a TF graph, it’s not worth researching. I think we’re in a bit of a deep learning hype bubble at the moment. Lots of time and money is poured into hand-designed networks with (sometimes arbitrary) researcher-chosen hyperparameters and algorithms. Your average high school student can install and build a neural network to perform quite complex and impressive models. But this is not the path to human-level intelligence. You <strong>could</strong> represent the human brain by a fully connected deep recurrent neural network with a billion neurons and a quintillion connections, but I don’t think NVIDIA has built that GPU yet.</p>
<p>I believe that researchers need a more flexible framework to build, test, and train complex approaches. I want to make it easy to explore spiking neural nets, dynamically changing structure, evolutionary algorithms, and anything else that may get us closer to human intelligence (and beyond). See my <a href="/Efficiency-is-key">first post on efficiency</a> for a little more background on my perspective. JuliaML is my playground for exploring the future of AI research. TensorFlow and competitors are solutions to a very specific problem: gradient-based training of static tensor graphs. We need to break the cycle that research should only focus on solutions that fit (or can be hacked into) this paradigm. Transformations (the Julia package) is a generalization of tensor computation that should be able to support other paradigms and new algorithmic approaches to learning, though the full JuliaML design is one which empowers researchers to approach the problem from any perspective they see fit.</p>Tom BreloffIn this post I’ll try to summarize my design goals with the Transformations package for the JuliaML ecosystem. Transformations should be seen as a modular and higher-level approach to building complex tensor computation graphs, similar to those you may build in TensorFlow or Theano. The major reason for designing this package from the ground up lies in the flexibility of a pure-Julia implementation for new research paths. If you want to apply convolutional neural nets to identify cats in pictures, this is not the package for you. My focus is in complex, real-time, and incremental algorithms for learning from temporal data. I want the ability to track learning progress in real time, and to build workflows and algorithms that don’t require a gpu server farm.Visualizing Graphs in Julia using Plots and PlotRecipes2016-11-11T00:00:00+00:002016-11-11T00:00:00+00:00http://www.breloff.com/Graphs<p>In this short post, I hope to introduce you to basic visualization of graphs (nodes connected by edges) when using <a href="https://github.com/tbreloff/Plots.jl">Plots</a> in Julia. The intention is that Visualizing a graph is as simple as inputting the connectivity structure, and optionally setting a ton of attributes that define the layout, labeling, colors, and more. Nodes are markers, and edges are lines. With this understanding, we can apply common Plots attributes as we see fit.</p>
<hr />
<h3 id="setup">Setup</h3>
<p>First, you’ll want to get a working setup of <a href="https://github.com/tbreloff/Plots.jl">Plots</a> and <a href="https://github.com/JuliaPlots/PlotRecipes.jl">PlotRecipes</a>:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="c"># for pkg in ("Plots","PlotRecipes")</span>
<span class="c"># Pkg.add(pkg)</span>
<span class="c"># Pkg.checkout(pkg)</span>
<span class="c"># end</span>
<span class="n">using</span> <span class="n">PlotRecipes</span>
<span class="c"># we'll use the PyPlot backend, and set a couple defaults</span>
<span class="n">pyplot</span><span class="x">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="x">,</span> <span class="n">size</span><span class="o">=</span><span class="x">(</span><span class="mi">800</span><span class="x">,</span><span class="mi">400</span><span class="x">))</span>
</code></pre>
</div>
<h3 id="type-trees">Type Trees</h3>
<p>For our example, we’re going to build a graph of the type hierarchy for a Julia abstract type. We will look at the Integer abstraction, and view it at the center of all supertypes and subtypes of Integer. You can view this demo as <a href="https://github.com/tbreloff/notebooks/blob/master/types_demo.ipynb">a Jupyter notebook</a>.</p>
<p>First, we’ll create a vector of our chosen type (Integer) and all its supertypes (Real, Number, and Any):</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">T</span> <span class="o">=</span> <span class="n">Integer</span>
<span class="n">sups</span> <span class="o">=</span> <span class="x">[</span><span class="n">T</span><span class="x">]</span>
<span class="n">sup</span> <span class="o">=</span> <span class="n">T</span>
<span class="k">while</span> <span class="n">sup</span> <span class="o">!=</span> <span class="kt">Any</span>
<span class="n">sup</span> <span class="o">=</span> <span class="n">supertype</span><span class="x">(</span><span class="n">sup</span><span class="x">)</span>
<span class="n">unshift!</span><span class="x">(</span><span class="n">sups</span><span class="x">,</span><span class="n">sup</span><span class="x">)</span>
<span class="k">end</span>
</code></pre>
</div>
<p>Next we will build the graph connectivity and node labels. <code class="highlighter-rouge">source</code> and <code class="highlighter-rouge">destiny</code> are lists of integers representing the indices in the edge connections of “source” to “destination”.</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">n</span> <span class="o">=</span> <span class="n">length</span><span class="x">(</span><span class="n">sups</span><span class="x">)</span>
<span class="n">nodes</span><span class="x">,</span> <span class="n">source</span><span class="x">,</span> <span class="n">destiny</span> <span class="o">=</span> <span class="n">copy</span><span class="x">(</span><span class="n">sups</span><span class="x">),</span> <span class="n">collect</span><span class="x">(</span><span class="mi">1</span><span class="x">:</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="x">),</span> <span class="n">collect</span><span class="x">(</span><span class="mi">2</span><span class="x">:</span><span class="n">n</span><span class="x">)</span>
<span class="k">function</span><span class="nf"> add_subs</span><span class="o">!</span><span class="x">(</span><span class="n">T</span><span class="x">,</span> <span class="n">supidx</span><span class="x">)</span>
<span class="k">for</span> <span class="n">sub</span> <span class="k">in</span> <span class="n">subtypes</span><span class="x">(</span><span class="n">T</span><span class="x">)</span>
<span class="n">push!</span><span class="x">(</span><span class="n">nodes</span><span class="x">,</span> <span class="n">sub</span><span class="x">)</span>
<span class="n">subidx</span> <span class="o">=</span> <span class="n">length</span><span class="x">(</span><span class="n">nodes</span><span class="x">)</span>
<span class="n">push!</span><span class="x">(</span><span class="n">source</span><span class="x">,</span> <span class="n">supidx</span><span class="x">)</span>
<span class="n">push!</span><span class="x">(</span><span class="n">destiny</span><span class="x">,</span> <span class="n">subidx</span><span class="x">)</span>
<span class="n">add_subs!</span><span class="x">(</span><span class="n">sub</span><span class="x">,</span> <span class="n">subidx</span><span class="x">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">add_subs!</span><span class="x">(</span><span class="n">T</span><span class="x">,</span> <span class="n">n</span><span class="x">)</span>
<span class="n">names</span> <span class="o">=</span> <span class="n">map</span><span class="x">(</span><span class="n">string</span><span class="x">,</span> <span class="n">nodes</span><span class="x">)</span>
</code></pre>
</div>
<p>Now we will use the connectivity (<code class="highlighter-rouge">source</code> and <code class="highlighter-rouge">destiny</code>) and the node labels (<code class="highlighter-rouge">names</code>) to visualize the graphs.</p>
<h3 id="graphplot">graphplot</h3>
<p>The <code class="highlighter-rouge">graphplot</code> method is a <a href="https://juliaplots.github.io/recipes/">user recipe</a> defined in <a href="https://github.com/JuliaPlots/PlotRecipes.jl">PlotRecipes</a>. It accepts many different inputs to describe the graph structure:</p>
<ul>
<li><code class="highlighter-rouge">source</code> and <code class="highlighter-rouge">destiny</code> lists, with optional <code class="highlighter-rouge">weights</code> for weighted edges</li>
<li><code class="highlighter-rouge">adjlist</code>: a vector of int-vectors, describing the connectivity from each node</li>
<li><code class="highlighter-rouge">adjmat</code>: an adjacency matrix</li>
<li><code class="highlighter-rouge">LightGraphs.Graph</code>: if LightGraphs is installed</li>
</ul>
<p>We will use the source/destiny/weights method in this post. Lets see what it looks like without overriding default settings:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">graphplot</span><span class="x">(</span><span class="n">source</span><span class="x">,</span> <span class="n">destiny</span><span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20223068/c233d0b2-a805-11e6-8e0a-44f85b45f71b.png" alt="" /></p>
<hr />
<p>Cool. Now lets add names to the nodes. Notice that the nodes take on a hexagonal shape and expand to fix the text. There are a few additional attributes you can try: <code class="highlighter-rouge">fontsize</code>, <code class="highlighter-rouge">nodeshape</code>, and <code class="highlighter-rouge">nodesize</code>.</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">graphplot</span><span class="x">(</span><span class="n">source</span><span class="x">,</span> <span class="n">destiny</span><span class="x">,</span> <span class="n">names</span><span class="o">=</span><span class="n">names</span><span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20221332/e93254ec-a7fe-11e6-8ced-0e941a5bf588.png" alt="" /></p>
<hr />
<p>We can also change the layout of the nodes by:</p>
<ul>
<li>using one of the built-in algorithms (spectral, stress, or tree)</li>
<li>extending with <a href="https://github.com/JuliaGraphs/NetworkLayout.jl">NetworkLayout</a></li>
<li>passing an arbitrary layout function to the <code class="highlighter-rouge">func</code> keyword</li>
<li>overriding the x/y/z coordinates yourself (<code class="highlighter-rouge">graphplot(..., x=x, y=y)</code>).</li>
</ul>
<p>As there’s a clear hierarchy to our graph, lets use the tree method:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">graphplot</span><span class="x">(</span><span class="n">source</span><span class="x">,</span> <span class="n">destiny</span><span class="x">,</span> <span class="n">names</span><span class="o">=</span><span class="n">names</span><span class="x">,</span> <span class="n">method</span><span class="o">=</span><span class="x">:</span><span class="n">tree</span><span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20221351/fbce7e50-a7fe-11e6-831d-b64a63d56275.png" alt="" /></p>
<hr />
<p>The tree layout allows the additional setting of the <code class="highlighter-rouge">root</code> of the tree. Lets make the graph flow from left to right:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">graphplot</span><span class="x">(</span><span class="n">source</span><span class="x">,</span> <span class="n">destiny</span><span class="x">,</span> <span class="n">names</span><span class="o">=</span><span class="n">names</span><span class="x">,</span> <span class="n">method</span><span class="o">=</span><span class="x">:</span><span class="n">tree</span><span class="x">,</span> <span class="n">root</span><span class="o">=</span><span class="x">:</span><span class="n">left</span><span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20221362/0601b9f0-a7ff-11e6-9f65-85368911f5c8.png" alt="" /></p>
<hr />
<p>All too easy. Finally, lets give it some color. Remember that we’re building a generic Plots visualization, where nodes are markers and edges are line segments. For more info on Plots, please read through <a href="https://juliaplots.github.io/">the documentation</a>.</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">weights</span> <span class="o">=</span> <span class="n">linspace</span><span class="x">(</span><span class="mi">1</span><span class="x">,</span><span class="mi">2</span><span class="x">,</span><span class="n">length</span><span class="x">(</span><span class="n">source</span><span class="x">))</span>
<span class="n">graphplot</span><span class="x">(</span><span class="n">source</span><span class="x">,</span> <span class="n">destiny</span><span class="x">,</span> <span class="n">weights</span><span class="x">,</span>
<span class="n">names</span> <span class="o">=</span> <span class="n">names</span><span class="x">,</span> <span class="n">method</span> <span class="o">=</span> <span class="x">:</span><span class="n">tree</span><span class="x">,</span>
<span class="n">l</span> <span class="o">=</span> <span class="x">(</span><span class="mi">2</span><span class="x">,</span> <span class="n">cgrad</span><span class="x">()),</span> <span class="c"># apply the default color gradient to the line (line_z values taken from edge weights)</span>
<span class="n">m</span> <span class="o">=</span> <span class="x">[</span><span class="n">node</span><span class="o">==</span><span class="n">T</span> <span class="o">?</span> <span class="x">:</span><span class="n">orange</span> <span class="x">:</span> <span class="x">:</span><span class="n">steelblue</span> <span class="k">for</span> <span class="n">node</span> <span class="k">in</span> <span class="n">nodes</span><span class="x">]</span> <span class="c"># node colors</span>
<span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/20221372/0efabb7e-a7ff-11e6-83ee-65cb176641f7.png" alt="" /></p>
<h3 id="summary">Summary</h3>
<p>Visualizing graphs with PlotRecipes is pretty simple, and it’s easy to customize to your hearts content, thanks to the flexibility of Plots. In a future post, I’ll use this functionality to view and visualize neural networks using my (work in progress) efforts within <a href="https://github.com/JuliaML">JuliaML</a> and <a href="https://github.com/tbreloff">other projects</a>.</p>Tom BreloffIn this short post, I hope to introduce you to basic visualization of graphs (nodes connected by edges) when using Plots in Julia. The intention is that Visualizing a graph is as simple as inputting the connectivity structure, and optionally setting a ton of attributes that define the layout, labeling, colors, and more. Nodes are markers, and edges are lines. With this understanding, we can apply common Plots attributes as we see fit.Deep Reinforcement Learning with Online Generalized Advantage Estimation2016-10-06T00:00:00+00:002016-10-06T00:00:00+00:00http://www.breloff.com/DeepRL-OnlineGAE<p>Deep Reinforcement Learning, or Deep RL, is a really hot field at the moment. If you haven’t heard of it, pay attention. Combining the power of reinforcement learning and deep learning, it is being used to play complex games better than humans, control driverless cars, optimize robotic decisions and limb trajectories, and much more. And we haven’t even gotten started… Deep RL has far reaching applications in business, finance, health care, and many other fields which could be improved with better decision making. It’s the closest (practical) approach we have to <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">AGI</a>. Seriously… how cool it that? In this post, I’ll rush through the basics and terminology in standard reinforcement learning (RL) problems, then review and extend work in Policy Gradient and Actor-Critic methods to derive an online variant of <a href="https://arxiv.org/abs/1506.02438">Generalized Advantage Estimation</a> (GAE) using eligibility traces, which can be used to learn optimal policies for our Deep RL agents.</p>
<hr />
<h3 id="background-and-terminology">Background and Terminology</h3>
<div class="imgcenter">
<img src="https://cloud.githubusercontent.com/assets/933338/20276214/5dd8bc66-aa69-11e6-99c6-81e4a43b4afe.png" /><br />
<em>The RL loop: state, action, reward</em>
</div>
<p>The RL framework: An <strong>agent</strong> senses the current <strong>state</strong> (s) of its <strong>environment</strong>. The agent takes an <strong>action</strong> (a), selected from the <strong>action set</strong> (A), using <strong>policy</strong> (π), and receives an immediate <strong>reward</strong> (r), with the goal of receiving large future <strong>return</strong> (R).</p>
<p>That’s it. Everything else is an implementation detail. The above paragraph can be summed up with the equations below, where $\sim$ means that we randomly sample a value from a probability distribution, and the current discrete time-step is $t$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
a_t &\sim \pi(a_t \mid s_t) \\
a_t &\in A_t \\
r_t, s_{t+1} &\sim p(r_t, s_{t+1} \mid s_t, a_t) \\
\tau_t &= \{ s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, ~ ... \} \\
\tau &= \tau_0 \\
R(\tau_t) &= \sum_{l=0}^\infty{\gamma^l r_{t+l}}
\end{align} %]]></script>
<p>The policy is some function (usually parameterized, sometimes differentiable) which uses the full history of experience of the agent to choose an action after presentation of the new state of the environment. For simplicity, we’ll only consider discrete-time episodic environments, though most of this could be extended or approximated for continuous, infinite-horizon scenarios. We will also only consider the function approximation case, where we will approximate policies and value functions using neural networks that are parameterized by a vector of learnable weights $Θ$.</p>
<p>The difficulty of reinforcement learning, as compared to other sub-fields of machine learning, is the <strong>Credit Assignment Problem</strong>. This is the problem that there are many many actions which lead to any given reward (and many rewards resulting from a single action), and it’s not easy to pick out the “important actions” which led to the good (or bad) reward. With infinite resources and enough samples, the statistics will reveal the truth, but we’d like to make sure an algorithm converges in our lifetime.</p>
<p><img src="https://theclickercenterblog.files.wordpress.com/2015/05/rat-lever-press-cartoon.png" alt="" /></p>
<hr />
<h3 id="approaches-to-rl">Approaches to RL</h3>
<p>The goal with all reinforcement learners is to learn a good policy which will take the best actions in order to generate the highest return. This can be accomplished a few different ways.</p>
<p>Value iteration, temporal difference (TD) learning, and Q-learning approximate the value of states $V(s_t)$ or state-action pairs $Q(s_t, a_t)$, and use the “surprise” from incorrect value estimates to update a policy. These methods can be more flexible for learning off-policy, but can be difficult to apply effectively for continuous action spaces (such as robotics).</p>
<p>An alternative approach, called Policy Gradient methods, model a direct mapping from states to actions. With a little math, one can compute the effect of a parameter $\theta_i$ on the future cumulative returns of the policy. Then to improve our policy, we “simply” adjust our parameters in a direction that will increase the return. In truth, what we’re doing is increasing the probability of choosing good actions, and decreasing the probability of choosing bad actions.</p>
<p>I put “simply” in quotes because, although the math is reasonably straightforward, there are a few practical hurdles to overcome to be able to learn policies effectively. Policy gradient methods can be applied to a wide range of problems, with both continuous and discrete action spaces. In the next section we’ll see a convenient theorem that makes the math of computing parameter gradients tractable.</p>
<p>We can combine value iteration and policy gradients using a framework called Actor-Critic. In this, we maintain an <strong>actor</strong> which typically uses a policy gradient method, and a <strong>critic</strong>, which typically uses some sort of value iteration. The actor never sees the actual reward. Instead, the critic intercepts the rewards from the trajectory and critiques the chosen actions of the actor. This way, the noisy (and delayed) reward stream can be smoothed and summarized for better policy updates. Below, we will assume an Actor-Critic approach, but we will focus only on how to update the parameters of the actor.</p>
<hr />
<h3 id="policy-gradient-theorem">Policy Gradient Theorem</h3>
<p>I’d like to briefly review a “trick” which makes policy gradient methods tractable: the <a href="https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf" title="Sutton et al: Policy Gradient Methods for Reinforcement Learning with Function Approximation (2000)">Policy Gradient Theorem</a>. If we assume there is some mapping of any state/action pair to a real-valued return:</p>
<script type="math/tex; mode=display">f: S \times A \rightarrow \mathbb{R} \\
x \in (S,A) \\
R = f(x)</script>
<p>then we’d like to compute the gradient of $\theta$ with respect to the expected total return $E_x[f(x)]$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta E_x[f(x)] &= \frac{\partial E_x[f(x)]}{\partial \theta} \\
& = \nabla_\theta \int f(x) ~ p(x \mid \theta) ~ dx \\
& = \int f(x) ~ \nabla_\theta p(x \mid \theta) ~ \frac{p(x \mid \theta)}{p(x \mid \theta)} ~ dx \\
& = \int f(x) ~ \nabla_\theta log ~ p(x \mid \theta) ~ p(x \mid \theta) ~ dx \\
& = E_x[f(x) ~ \nabla_\theta log ~ p(x \mid \theta)]
\end{align} %]]></script>
<p>We bring the gradient inside the integral and divide by the probability to get the gradient of log probability (grad-log-prob) term, along with a well-formed expectation. This trick means we don’t actually need to compute the integral equation in order to estimate the total gradient. Additionally, it’s much easier to take the gradient of a log as it decomposes into a sum of terms. This is important, because we only need to sample rewards and compute $\nabla_\theta log ~ p(x \mid \theta)$, which is <strong>dependent solely on our policy and the states/rewards that we see</strong>. There’s no need to understand or estimate the transition probability distribution of the underlying environment! (this is called “model-free reinforcement learning”.) Check out <a href="https://www.youtube.com/watch?v=oPGVsoBonLM">John Schulman’s lectures</a> for a great explanation and more detail.</p>
<hr />
<h3 id="improvements-to-the-policy-gradient-formula">Improvements to the policy gradient formula</h3>
<p>Now we have this easy-to-calculate formula to estimate the gradient, so we can just plug in the formulas and feed it into a gradient descent optimization, and we’ll magically learn optimal parameters… right? Right?!?</p>
<p>Sadly, due to weak correlation of actions to rewards resulting from the credit assignment problem, our gradient estimates will be extremely weak and noisy (the signal to noise ratio will be small, and variance of the gradient will be high), and convergence will take a while (unless it diverges).</p>
<p>One improvement we can make is to realize that, when deciding which actions to select, we only care about the <strong>relative difference in state-action values</strong>. Suppose we’re in a really awesome state, and no matter what we do we’re destined to get a reward of at least 100, but there’s a single action which would allow us to get 101. In this case we care only about that difference: <script type="math/tex">101 - 100 = 1</script>.</p>
<p>This is the intuition behind the <strong>advantage function</strong> $A^\pi(s,a)$ for a policy $\pi$. It is defined as the difference between the <strong>state-action value function</strong> $Q^\pi(s,a)$ and the <strong>state value function</strong> $V^\pi(s)$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
V^\pi(s_t) &= E^\pi[R(\tau_t) \mid s_t] \\
Q^\pi(s_t, a_t) &= E^\pi[R(\tau_t) \mid s_t, a_t] \\
A^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t)
\end{align} %]]></script>
<p>Going back to the policy gradient formula, we <a href="https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf" title="Sutton et al: Policy Gradient Methods for Reinforcement Learning with Function Approximation (2000)">can subtract a zero-expectation <strong>baseline</strong></a> $b_t(s_t)$ from our score function $f(x)$ without changing the expectation. If we choose the score function to be the state-action value function $Q^\pi(s,a)$, and the baseline to be the state value function $V^\pi(s)$, then the policy gradient $g$ is of the form:</p>
<script type="math/tex; mode=display">\begin{align}
g = E[\sum_{t=0}^\infty{A^\pi(s_t, a_t) ~ \nabla_\theta log ~ \pi^\theta(a_t \mid s_t)}]
\end{align}</script>
<p>The intuition with this formula is that we wish to increase the probability of better-than-average actions, and decrease the probability of worse-than-average actions. We use a discount factor $\gamma$ to control the impact of our value estimation. With $\gamma$ near 0, we will approximate the next-step return. With $\gamma$ near 1, we will approximate the sum of all future rewards. If episodes are very long (or infinite), we may need $\gamma < 1$ for tractibility/convergence.</p>
<p>This formula is an example of an Actor-Critic algorithm, where the policy $\pi$ (the <strong>actor</strong>) adjusts its parameters by using the “advice” of a <strong>critic</strong>. In this case we use an estimate of the advantage to critique our policy choices.</p>
<hr />
<h3 id="generalized-advantage-estimator-gae">Generalized Advantage Estimator (GAE)</h3>
<p><a href="https://arxiv.org/abs/1506.02438" title="Schulman et al: High-Dimensional Continuous Control using Generalized Advantage Estimation (2015)">Schulman et al</a> use a discounted sum of TD residuals:</p>
<script type="math/tex; mode=display">\begin{align}
\delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)
\end{align}</script>
<p>and compute an estimator of the k-step discounted advantage:</p>
<script type="math/tex; mode=display">\begin{align}
\hat{A}_t^{(k)} = \sum_{l=0}^{k-1}{\gamma^l \delta_{t+l}^V}
\end{align}</script>
<p>Note: it seems that equation 14 from their paper has an incorrect subscript on $\delta$.</p>
<p>They define their generalized advantage estimator (GAE) as the weighted average of the advantage estimators above, which reduce to a sum of discounted TD residuals:</p>
<script type="math/tex; mode=display">\begin{align}
\hat{A}_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^\infty{(\gamma\lambda)^l \delta_{t+l}^V}
\end{align}</script>
<p>This generalized estimator of the advantage function allows a trade-off of bias vs variance using the parameter $0 \leq \lambda \leq 1$, similar to <a href="http://webdocs.cs.ualberta.ca/~sutton/papers/sutton-88-with-erratum.pdf" title="Sutton: Learning to Predict by the Methods of Temporal Differences (1988)">TD(λ)</a>. For $\lambda = 0$, the problem reduces to the (unbiased) TD(0) function. As we increase $\lambda$ towards 1, we reduce the variance of our estimator but increase the bias.</p>
<hr />
<h3 id="online-gae">Online GAE</h3>
<p>We’ve come a long way. We now have a low(er)-variance approximation to the true policy gradient. In github speak: <code class="highlighter-rouge">:tada:</code> But for problems I care about (for example high frequency trading strategies), it’s not very practical to compute forward-looking infinite-horizon returns before performing a gradient update step.</p>
<p>In this section, I’ll derive an online formula which is equivalent to the GAE policy gradient above, but which uses <strong>eligibility traces</strong> of the inner gradient of log probabilities to <strong>compute a gradient estimation on every reward, as it arrives</strong>. Not only will this prove to be more efficient, but there will be a massive savings in memory requirements and compute resources, at the cost of a slightly more complex learning process. (Luckily, <a href="/JuliaML-and-Plots">I code in Julia</a>)</p>
<p>Notes: See <a href="http://lasa.epfl.ch/publications/uploadedFiles/Waw13.pdf" title="Wawrzyński et al: Autonomous reinforcement learning with experience replay (2013)">Wawrzyński et al</a> for an alternate derivation. These methods using eligibility traces are closely related to <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf" title="Williams: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (1992)">REINFORCE</a>.</p>
<p>Lets get started. We are trying to reorganize the many terms of the policy gradient formula so that the gradient is of the form: $g = E^\pi[\sum_{t=0}^\infty{r_t \psi_t}]$, where $\psi_t$ can depend only on the states, actions, and rewards that occurred <strong>before</strong> (or immediately after) the arrival of $r_t$. We will solve for an online estimator of the policy gradient $\hat{g}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{g} &= \sum_{t=0}^\infty{\hat{A}_t^{GAE(\gamma,\lambda)} ~ \nabla_\theta log ~ \pi^\theta(a_t \mid s_t)} \\
& = \sum_{t=0}^\infty{ \nabla_\theta log ~ \pi^\theta(a_t \mid s_t) ~ \sum_{l=0}^\infty{(\gamma\lambda)^l \delta_{t+l}^V}}
\end{align} %]]></script>
<p>In order to simplify the derivation, I’ll introduce the following shorthand:</p>
<script type="math/tex; mode=display">\begin{align}
\nabla_t := \nabla_\theta log ~ \pi^\theta(a_t \mid s_t)
\end{align}</script>
<p>and then expand the sum:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{g} &= \nabla_0 (\delta_0^V + (\gamma\lambda)\delta_1^V + (\gamma\lambda)^2\delta_2^V + ~ ... ) \nonumber \\
& ~~ + \nabla_1 (\delta_1^V + (\gamma\lambda)\delta_2^V + (\gamma\lambda)^2\delta_3^V + ~ ... ) \\
& ~~ + \nabla_2 (\delta_2^V + (\gamma\lambda)\delta_3^V + (\gamma\lambda)^2\delta_4^V + ~ ... ) \nonumber \\
& ~~ + ~ ... \nonumber
\end{align} %]]></script>
<p>and collect the $\delta_t^V$ terms:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{g} &= \delta_0^V \nabla_0 \nonumber \\
& ~~ + \delta_1^V ( \nabla_1 + (\gamma\lambda)\nabla_0 ) \\
& ~~ + \delta_2^V ( \nabla_2 + (\gamma\lambda)\nabla_1 + (\gamma\lambda)^2\nabla_0 ) \nonumber \\
& ~~ + ~ ... \nonumber
\end{align} %]]></script>
<p>and summarize:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\hat{g} &= \sum_{t=0}^\infty{\delta_t^V \sum_{l=0}^t{(\gamma\lambda)^l \nabla_{t-l}}}
\end{align} %]]></script>
<p>If we define our eligibility trace as the inner sum in that equation:</p>
<script type="math/tex; mode=display">\begin{align}
\epsilon_t := \sum_{l=0}^t{(\gamma\lambda)^l \nabla_{t-l}}
\end{align}</script>
<p>and convert to a recursive formula:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\epsilon_0 &:= \nabla_0 \\
\epsilon_t &:= (\gamma\lambda) \epsilon_{t-1} + \nabla_t
\end{align} %]]></script>
<p>then we have our online generalized advantage estimator for the policy gradient:</p>
<script type="math/tex; mode=display">\begin{align}
\hat{g} = \sum_{t=0}^\infty{\delta_t^V \epsilon_t}
\end{align}</script>
<p>So at each time-step, we compute the gradient term $\hat{g}_t = \delta_t^V \epsilon_t$ as the product of the TD(0) error from our critic and the accumulated log-prob gradients of our policy. We could update our parameters online at the end of each episode, or at each step, or in batches, or using some sort of smoothing method.</p>
<hr />
<h3 id="summary">Summary</h3>
<p>In this post I covered some basic terminology, and summarized the theorems and formulas that comprise Generalized Advantage Estimation. We extended GAE to the online setting in order to give us more flexibility when computing parameter updates. The hyperparameter <strong>$\gamma$ allows us to control our trust in the value estimation</strong>, while the hyperparameter <strong>$\lambda$ allows us to assign more credit to recent actions</strong>.</p>
<p>In future posts, I intend to demonstrate how to use formulas 26-28 in practical algorithms to solve problems in robotic simulation and other complex environments. If you need help solving difficult problems with data, or if you see mutual benefit in collaboration, please don’t hesitate to get in touch.</p>Tom BreloffDeep Reinforcement Learning, or Deep RL, is a really hot field at the moment. If you haven’t heard of it, pay attention. Combining the power of reinforcement learning and deep learning, it is being used to play complex games better than humans, control driverless cars, optimize robotic decisions and limb trajectories, and much more. And we haven’t even gotten started… Deep RL has far reaching applications in business, finance, health care, and many other fields which could be improved with better decision making. It’s the closest (practical) approach we have to AGI. Seriously… how cool it that? In this post, I’ll rush through the basics and terminology in standard reinforcement learning (RL) problems, then review and extend work in Policy Gradient and Actor-Critic methods to derive an online variant of Generalized Advantage Estimation (GAE) using eligibility traces, which can be used to learn optimal policies for our Deep RL agents.Machine Learning and Visualization in Julia2016-09-23T00:00:00+00:002016-09-23T00:00:00+00:00http://www.breloff.com/JuliaML-and-Plots<p>In this post, I’ll introduce you to the <a href="http://julialang.org/">Julia programming language</a> and a couple long-term projects of mine: <a href="https://juliaplots.github.io/">Plots</a> for easily building complex data visualizations, and <a href="https://github.com/JuliaML">JuliaML</a> for machine learning and AI. After short introductions to each, we’ll quickly throw together some custom code to build and visualize the training of an artificial neural network. Julia is fast, but you’ll see that the real speed comes from developer productivity.</p>
<h3 id="update-nov-27-2016">Update (Nov 27, 2016)</h3>
<p>Since the original post there have been several API changes, and the code in this post has either been added to packages (such as MLPlots) or will not run due to name changes. I have created a <a href="https://github.com/tbreloff/notebooks/blob/master/juliaml_mnist.ipynb">Jupyter notebook</a> with updated code.</p>
<h3 id="introduction-to-julia">Introduction to Julia</h3>
<p>Julia is a fantastic, game-changing language. I’ve been coding for 25 years, using mainstays like C, C++, Python, Java, Matlab, and Mathematica. I’ve dabbled in many others: Go, Erlang, Haskel, VB, C#, Javascript, Lisp, etc. For every great thing about each of these languages, there’s something equally bad to offset it. I could never escape the “two-language problem”, which is when you must maintain a multi-language code base to deal with the deficiencies of each language. C can be fast to run, but it certainly isn’t fast to code. The lack of high-level interfaces means you’ll need to do most of your analysis work in another language. For me, that was usually Python. Now… Python can be great, but when it’s not good enough… <strong>ugh</strong>.</p>
<p>Python excels when you want high level and your required functionality already exists. If you want to implement a new algorithm with even minor complexity, you’ll likely need another language. (Yes… Cython is another language.) C is great when you just want to move some bytes around. But as soon as you leave the “sweet spot” of these respective languages, everything becomes prohibitively difficult.</p>
<p>Julia is amazing because you can properly <strong>abstract exactly the right amount</strong>. Write pseudocode and watch it run (and usually fast!) Easily create strongly-typed custom data manipulators. Write a macro to automate generation of your boilerplate code. Use generated functions to produce highly specialized code paths depending on input types. Create your own mini-language for domain-specificity. I often find myself designing solutions to problems that simply should not be attempted in other languages.</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">using</span> <span class="n">Plots</span>
<span class="n">pyplot</span><span class="x">()</span>
<span class="n">labs</span> <span class="o">=</span> <span class="n">split</span><span class="x">(</span><span class="s">"Julia C/C++ Python Matlab Mathematica Java Go Erlang"</span><span class="x">)</span>
<span class="n">ease</span> <span class="o">=</span> <span class="x">[</span><span class="mf">0.8</span><span class="x">,</span> <span class="mf">0.1</span><span class="x">,</span> <span class="mf">0.8</span><span class="x">,</span> <span class="mf">0.7</span><span class="x">,</span> <span class="mf">0.6</span><span class="x">,</span> <span class="mf">0.3</span><span class="x">,</span> <span class="mf">0.5</span><span class="x">,</span> <span class="mf">0.5</span><span class="x">]</span>
<span class="n">power</span> <span class="o">=</span> <span class="x">[</span><span class="mf">0.9</span><span class="x">,</span> <span class="mf">0.9</span><span class="x">,</span> <span class="mf">0.3</span><span class="x">,</span> <span class="mf">0.4</span><span class="x">,</span> <span class="mf">0.2</span><span class="x">,</span> <span class="mf">0.8</span><span class="x">,</span> <span class="mf">0.7</span><span class="x">,</span> <span class="mf">0.5</span><span class="x">]</span>
<span class="n">txts</span> <span class="o">=</span> <span class="n">map</span><span class="x">(</span><span class="n">i</span><span class="o">-></span><span class="n">text</span><span class="x">(</span><span class="n">labs</span><span class="x">[</span><span class="n">i</span><span class="x">],</span> <span class="n">font</span><span class="x">(</span><span class="n">round</span><span class="x">(</span><span class="kt">Int</span><span class="x">,</span> <span class="mi">5</span><span class="o">+</span><span class="mi">15</span><span class="o">*</span><span class="n">power</span><span class="x">[</span><span class="n">i</span><span class="x">]</span><span class="o">*</span><span class="n">ease</span><span class="x">[</span><span class="n">i</span><span class="x">]))),</span> <span class="mi">1</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">labs</span><span class="x">))</span>
<span class="n">scatter</span><span class="x">(</span><span class="n">ease</span><span class="x">,</span> <span class="n">power</span><span class="x">,</span>
<span class="n">series_annotations</span><span class="o">=</span><span class="n">txts</span><span class="x">,</span> <span class="n">ms</span><span class="o">=</span><span class="mi">0</span><span class="x">,</span> <span class="n">leg</span><span class="o">=</span><span class="n">false</span><span class="x">,</span>
<span class="n">xguide</span><span class="o">=</span><span class="s">"Productivity"</span><span class="x">,</span> <span class="n">yguide</span><span class="o">=</span><span class="s">"Power"</span><span class="x">,</span>
<span class="n">formatter</span><span class="o">=</span><span class="n">x</span><span class="o">-></span><span class="s">""</span><span class="x">,</span> <span class="n">grid</span><span class="o">=</span><span class="n">false</span><span class="x">,</span> <span class="n">lims</span><span class="o">=</span><span class="x">(</span><span class="mi">0</span><span class="x">,</span><span class="mi">1</span><span class="x">)</span>
<span class="x">)</span>
</code></pre>
</div>
<p><img src="https://cloud.githubusercontent.com/assets/933338/18838897/0e59c134-83d7-11e6-9019-1c0dd1ecf335.png" alt="" /></p>
<p>I won’t waste time going through Julia basics here. For the new users, there are many <a href="http://julialang.org/learning/">resources for learning</a>. The takeaway is: if you’re reading this post and you haven’t tried Julia, drop what you’re doing and give it a try. With services like <a href="https://juliabox.com/">JuliaBox</a>, you really don’t have an excuse.</p>
<h3 id="introduction-to-plots">Introduction to Plots</h3>
<p><a href="https://github.com/tbreloff/Plots.jl">Plots</a> (and the <a href="https://github.com/JuliaPlots">JuliaPlots</a> ecosystem) are modular tools and a cohesive interface, which let you very simply define and manipulate visualizations.</p>
<p>One of its strengths is the varied supported <a href="https://juliaplots.github.io/backends/">backends</a>. Choose text-based plotting from a remote server or real-time 3D simulations. Fast, interactive, lightweight, or complex… all without changing your code. Massive thanks to the creators and developers of the many backend packages, and especially to <a href="https://github.com/jheinen">Josef Heinen</a> and <a href="https://github.com/SimonDanisch">Simon Danisch</a> for their work in integrating the awesome <a href="https://github.com/jheinen/GR.jl">GR</a> and <a href="https://github.com/JuliaGL/GLVisualize.jl">GLVisualize</a> frameworks.</p>
<p>However, more powerful than any individual feature is the concept of <a href="https://juliaplots.github.io/recipes/">recipes</a>. A recipe can be simply defined as <strong>a conversion with attributes</strong>. “User recipes” and “type recipes” can be defined on custom types to enable them to be “plotted” just like anything else. For example, the <code class="highlighter-rouge">Game</code> type in my <a href="https://github.com/tbreloff/AtariAlgos.jl">AtariAlgos</a> package will capture the current screen from an Atari game and display it as an image plot with the simple command <code class="highlighter-rouge">plot(game)</code>:</p>
<p><img src="https://cloud.githubusercontent.com/assets/933338/17670982/8923a2f6-62e2-11e6-943f-bd0a2a7b5c1f.gif" alt="" /></p>
<p>“Series recipes” allow you to build up complex visualizations in a modular way. For example, a histogram recipe will bin data and return a bar plot, while a bar recipe can in turn be defined as a bunch of shapes. The modularity greatly simplifies generic plot design. Using modular recipes, we are able to implement boxplots and violin plots, even when a backend only supports simple drawing of lines and shapes:</p>
<p><img src="https://juliaplots.github.io/examples/img/pyplot/pyplot_example_30.png" alt="" /></p>
<p>To see many more examples of recipes in the wild, check out <a href="https://github.com/JuliaPlots/StatPlots.jl">StatPlots</a>, <a href="https://github.com/JuliaPlots/PlotRecipes.jl">PlotRecipes</a>, and more in the <a href="https://juliaplots.github.io/ecosystem/">wider ecosystem</a>.</p>
<p>For a more complete introduction of Plots, see <a href="https://www.youtube.com/watch?v=LGB8GvAL4HA">my JuliaCon 2016 workshop</a> and read through the <a href="https://juliaplots.github.io/">documentation</a></p>
<h3 id="introduction-to-juliaml">Introduction to JuliaML</h3>
<p><a href="https://github.com/JuliaML">JuliaML</a> (Machine Learning in Julia) is a community organization that was formed to brainstorm and design cohesive alternatives for data science. We believe that Julia has the potential to change the way researchers approach science, enabling algorithm designers to truly “think outside the box” (because of the difficulty of implementing non-conventional approaches in other languages). Many of us have independently developed tools for machine learning before contributing. Some of my contributions to the current codebase in JuliaML are copied-from or inspired-by my work in <a href="https://github.com/tbreloff/OnlineAI.jl">OnlineAI</a>.</p>
<p>The recent initiatives in the <a href="https://github.com/JuliaML/Learn.jl">Learn</a> ecosystem (LearnBase, Losses, Transformations, PenaltyFunctions, ObjectiveFunctions, and StochasticOptimization) were spawned during the 2016 JuliaCon hackathon at MIT. Many of us, including <a href="https://github.com/joshday">Josh Day</a>, <a href="https://github.com/ahwillia">Alex Williams</a>, and <a href="https://github.com/Evizero">Christof Stocker</a> (by Skype), stood in front of a giant blackboard and hashed out the general design. Our goal was to provide fast, reliable building blocks for machine learning researchers, and to unify the existing fragmented development efforts.</p>
<ul>
<li><strong><a href="https://github.com/JuliaML/Learn.jl">Learn</a></strong>: The “meta” package for JuliaML, which imports and re-exports many of the packages in the JuliaML organization. This is the easiest way to get everything installed and loaded.</li>
<li><strong><a href="https://github.com/JuliaML/LearnBase.jl">LearnBase</a></strong>: Lightweight method stubs and abstractions. Most packages import (and re-export) the methods and abstract types from LearnBase.</li>
<li><strong><a href="https://github.com/JuliaML/Losses.jl">Losses</a></strong>: A collection of types and methods for computing loss functions for supervised learning. Both distance-based (regression/classification) and margin-based (Support Vector Machine) losses are supported. Optimized methods for working with array data are provided with both allocating and non-allocating versions. This package was originally Evizero/LearnBase.jl. Much of the development is by Christof Stocker, with contributions from Alex Williams and myself.</li>
<li><strong><a href="https://github.com/JuliaML/Transformations.jl">Transformations</a></strong>: Tensor operations with attached storage for values and gradients: activations, linear algebra, neural networks, and more. The concept is that each <code class="highlighter-rouge">Transformation</code> has both input and output <code class="highlighter-rouge">Node</code> for input and output arrays. These nodes implicitly link to storage for the current values and current gradients. Nodes can be “linked” together in order to point to the same underlying storage, which makes it simple to create complex directed graphs of modular computations and perform backpropagation of gradients. A <code class="highlighter-rouge">Chain</code> (generalization of a feedforward neural network) is just an ordered list of sub-transformations with appropriately linked nodes. A <code class="highlighter-rouge">Learnable</code> is a special type of transformation that also has parameters which can be learned. Utilizing Julia’s awesome array abstractions, we can collect params from many underlying transformations into a single vector, and avoid costly copying and memory allocations. I am the primary developer of Transformations.</li>
<li><strong><a href="https://github.com/JuliaML/PenaltyFunctions.jl">PenaltyFunctions</a></strong>: A collection of types and methods for regularization functions (penalties), which are typically part of a total model loss in learning algorithms. Josh Day (creator of the awesome <a href="https://github.com/joshday/OnlineStats.jl">OnlineStats</a>) is the primary developer of PenaltyFunctions.</li>
<li><strong><a href="https://github.com/JuliaML/ObjectiveFunctions.jl">ObjectiveFunctions</a></strong>: Combine transformations, losses, and penalties into an <code class="highlighter-rouge">objective</code>. Much of the interface is shared with Transformations, though this package allows for flexible Empirical Risk Minimization and similar optimization. I am the primary developer on the current implementation.</li>
<li><strong><a href="https://github.com/JuliaML/StochasticOptimization.jl">StochasticOptimization</a></strong>: A generic framework for optimization. The initial focus has been on gradient descent, but I have hopes that the framework design will be adopted by other classic optimization frameworks, like <a href="https://github.com/JuliaOpt/Optim.jl">Optim</a>. There are many gradient descent methods included: SGD with momentum, Adagrad, Adadelta, Adam, Adamax, and RMSProp. The flexible “Meta Learner” framework provides a modular approach to optimization algorithms, allowing developers to add convergence criteria, custom iteration traces, plotting, animation, etc. We’ll see this flexibility in the example below. We have also redesigned data iteration/sampling/splitting, and the new iteration framework is currently housed in StochasticOptimization (though it will eventually live in MLDataUtils). I am the primary developer for this package.</li>
</ul>
<h3 id="learning-mnist">Learning MNIST</h3>
<p>Time to code! I’ll walk you through some code to build, learn, and visualize a fully connected neural network for the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a>. The steps I’ll cover are:</p>
<ul>
<li>Load and initialize Learn and Plots</li>
<li>Build a special wrapper for our trace plotting</li>
<li>Load the MNIST dataset</li>
<li>Build a neural net and our objective function</li>
<li>Create custom traces for our optimizer</li>
<li>Build a learner, and learn optimal parameters</li>
</ul>
<div class="imgcenter">
<img src="/images/juliaml1/mnist_50x50.gif" /><br />
<em>Custom visualization for tracking MNIST fit</em>
</div>
<p>Disclaimers:</p>
<ul>
<li>I expect you have a basic understanding of gradient descent optimization and machine learning models. I don’t have the time or space to explain those concepts in detail, and there are plenty of other resources for that.</li>
<li>Basic knowledge of Julia syntax/concepts would be very helpful.</li>
<li>This API is subject to change, and this should be considered pre-alpha software.</li>
<li>This assumes you are using Julia 0.5.</li>
</ul>
<p>Get the software (use <code class="highlighter-rouge">Pkg.checkout</code> on a package for the latest features):</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="c"># Install Learn, which will install all the JuliaML packages</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">clone</span><span class="x">(</span><span class="s">"https://github.com/JuliaML/Learn.jl"</span><span class="x">)</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">build</span><span class="x">(</span><span class="s">"Learn"</span><span class="x">)</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">checkout</span><span class="x">(</span><span class="s">"MLDataUtils"</span><span class="x">,</span> <span class="s">"tom"</span><span class="x">)</span> <span class="c"># call Pkg.free if/when this branch is merged</span>
<span class="c"># A package to load the data</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">add</span><span class="x">(</span><span class="s">"MNIST"</span><span class="x">)</span>
<span class="c"># Install Plots and StatPlots</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">add</span><span class="x">(</span><span class="s">"Plots"</span><span class="x">)</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">add</span><span class="x">(</span><span class="s">"StatPlots"</span><span class="x">)</span>
<span class="c"># Install GR -- the backend we'll use for Plots</span>
<span class="n">Pkg</span><span class="o">.</span><span class="n">add</span><span class="x">(</span><span class="s">"GR"</span><span class="x">)</span>
</code></pre>
</div>
<p>Start up Julia, then load the packages:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">using</span> <span class="n">Learn</span>
<span class="k">import</span> <span class="n">MNIST</span>
<span class="n">using</span> <span class="n">MLDataUtils</span>
<span class="n">using</span> <span class="n">StatsBase</span>
<span class="n">using</span> <span class="n">StatPlots</span>
<span class="c"># Set up GR for plotting. x11 is uglier, but much faster</span>
<span class="n">ENV</span><span class="x">[</span><span class="s">"GKS_WSTYPE"</span><span class="x">]</span> <span class="o">=</span> <span class="s">"x11"</span>
<span class="n">gr</span><span class="x">(</span><span class="n">leg</span><span class="o">=</span><span class="n">false</span><span class="x">,</span> <span class="n">linealpha</span><span class="o">=</span><span class="mf">0.5</span><span class="x">)</span>
</code></pre>
</div>
<p>A custom type to simplify the creation of trace plots (which <del>will probably be</del> has already been added to <a href="https://github.com/JuliaML/MLPlots.jl">MLPlots</a>):</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="c"># the type, parameterized by the indices and plotting backend</span>
<span class="k">type</span><span class="nc"> TracePlot</span><span class="x">{</span><span class="n">I</span><span class="x">,</span><span class="n">T</span><span class="x">}</span>
<span class="n">indices</span><span class="o">::</span><span class="n">I</span>
<span class="n">plt</span><span class="o">::</span><span class="n">Plot</span><span class="x">{</span><span class="n">T</span><span class="x">}</span>
<span class="k">end</span>
<span class="n">getplt</span><span class="x">(</span><span class="n">tp</span><span class="o">::</span><span class="n">TracePlot</span><span class="x">)</span> <span class="o">=</span> <span class="n">tp</span><span class="o">.</span><span class="n">plt</span>
<span class="c"># construct a TracePlot for n series. note we pass through</span>
<span class="c"># any keyword arguments to the `plot` call</span>
<span class="k">function</span><span class="nf"> TracePlot</span><span class="x">(</span><span class="n">n</span><span class="o">::</span><span class="kt">Int</span> <span class="o">=</span> <span class="mi">1</span><span class="x">;</span> <span class="n">maxn</span><span class="o">::</span><span class="kt">Int</span> <span class="o">=</span> <span class="mi">500</span><span class="x">,</span> <span class="n">kw</span><span class="o">...</span><span class="x">)</span>
<span class="n">indices</span> <span class="o">=</span> <span class="k">if</span> <span class="n">n</span> <span class="o">></span> <span class="n">maxn</span>
<span class="c"># limit to maxn series, randomly sampled</span>
<span class="n">shuffle</span><span class="x">(</span><span class="mi">1</span><span class="x">:</span><span class="n">n</span><span class="x">)[</span><span class="mi">1</span><span class="x">:</span><span class="n">maxn</span><span class="x">]</span>
<span class="k">else</span>
<span class="mi">1</span><span class="x">:</span><span class="n">n</span>
<span class="k">end</span>
<span class="n">TracePlot</span><span class="x">(</span><span class="n">indices</span><span class="x">,</span> <span class="n">plot</span><span class="x">(</span><span class="n">length</span><span class="x">(</span><span class="n">indices</span><span class="x">);</span> <span class="n">kw</span><span class="o">...</span><span class="x">))</span>
<span class="k">end</span>
<span class="c"># add a y-vector for value x</span>
<span class="k">function</span><span class="nf"> add_data</span><span class="x">(</span><span class="n">tp</span><span class="o">::</span><span class="n">TracePlot</span><span class="x">,</span> <span class="n">x</span><span class="o">::</span><span class="n">Number</span><span class="x">,</span> <span class="n">y</span><span class="o">::</span><span class="n">AbstractVector</span><span class="x">)</span>
<span class="k">for</span> <span class="x">(</span><span class="n">i</span><span class="x">,</span><span class="n">idx</span><span class="x">)</span> <span class="k">in</span> <span class="n">enumerate</span><span class="x">(</span><span class="n">tp</span><span class="o">.</span><span class="n">indices</span><span class="x">)</span>
<span class="n">push!</span><span class="x">(</span><span class="n">tp</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">series_list</span><span class="x">[</span><span class="n">i</span><span class="x">],</span> <span class="n">x</span><span class="x">,</span> <span class="n">y</span><span class="x">[</span><span class="n">idx</span><span class="x">])</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c"># convenience: if y is a number, wrap it as a vector and call the other method</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">tp</span><span class="o">::</span><span class="n">TracePlot</span><span class="x">,</span> <span class="n">x</span><span class="o">::</span><span class="n">Number</span><span class="x">,</span> <span class="n">y</span><span class="o">::</span><span class="n">Number</span><span class="x">)</span> <span class="o">=</span> <span class="n">add_data</span><span class="x">(</span><span class="n">tp</span><span class="x">,</span> <span class="n">x</span><span class="x">,</span> <span class="x">[</span><span class="n">y</span><span class="x">])</span>
</code></pre>
</div>
<p>Load the MNIST data and preprocess:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="c"># our data:</span>
<span class="n">x_train</span><span class="x">,</span> <span class="n">y_train</span> <span class="o">=</span> <span class="n">MNIST</span><span class="o">.</span><span class="n">traindata</span><span class="x">()</span>
<span class="n">x_test</span><span class="x">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">MNIST</span><span class="o">.</span><span class="n">testdata</span><span class="x">()</span>
<span class="c"># normalize the input data given μ/σ for the input training data</span>
<span class="n">μ</span><span class="x">,</span> <span class="n">σ</span> <span class="o">=</span> <span class="n">rescale!</span><span class="x">(</span><span class="n">x_train</span><span class="x">)</span>
<span class="n">rescale!</span><span class="x">(</span><span class="n">x_test</span><span class="x">,</span> <span class="n">μ</span><span class="x">,</span> <span class="n">σ</span><span class="x">)</span>
<span class="c"># convert class vector to "one hot" matrix</span>
<span class="n">y_train</span><span class="x">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">map</span><span class="x">(</span><span class="n">to_one_hot</span><span class="x">,</span> <span class="x">(</span><span class="n">y_train</span><span class="x">,</span> <span class="n">y_test</span><span class="x">))</span>
<span class="n">train</span> <span class="o">=</span> <span class="x">(</span><span class="n">x_train</span><span class="x">,</span> <span class="n">y_train</span><span class="x">)</span>
<span class="n">test</span> <span class="o">=</span> <span class="x">(</span><span class="n">x_test</span><span class="x">,</span> <span class="n">y_test</span><span class="x">)</span>
</code></pre>
</div>
<p>Build a neural net with <a href="http://ieeexplore.ieee.org/document/7280459/">softplus activations</a> for the inner layers and softmax output for classification:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">nin</span><span class="x">,</span> <span class="n">nh</span><span class="x">,</span> <span class="n">nout</span> <span class="o">=</span> <span class="mi">784</span><span class="x">,</span> <span class="x">[</span><span class="mi">50</span><span class="x">,</span><span class="mi">50</span><span class="x">],</span> <span class="mi">10</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">nnet</span><span class="x">(</span><span class="n">nin</span><span class="x">,</span> <span class="n">nout</span><span class="x">,</span> <span class="n">nh</span><span class="x">,</span> <span class="x">:</span><span class="n">softplus</span><span class="x">,</span> <span class="x">:</span><span class="n">softmax</span><span class="x">)</span>
</code></pre>
</div>
<p>Note: the <code class="highlighter-rouge">nnet</code> method is a very simple convenience constructor for <code class="highlighter-rouge">Chain</code> transformations. It’s pretty easy to construct the transformation yourself for more complex models. This is what is constructed on the call to <code class="highlighter-rouge">nnet</code> (note: this is the output of <code class="highlighter-rouge">Base.show</code>, not runnable code):</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">Chain</span><span class="x">{</span><span class="kt">Float64</span><span class="x">}(</span>
<span class="n">Affine</span><span class="x">{</span><span class="mi">784</span><span class="o">--></span><span class="mi">50</span><span class="x">}</span>
<span class="n">softplus</span><span class="x">{</span><span class="mi">50</span><span class="x">}</span>
<span class="n">Affine</span><span class="x">{</span><span class="mi">50</span><span class="o">--></span><span class="mi">50</span><span class="x">}</span>
<span class="n">softplus</span><span class="x">{</span><span class="mi">50</span><span class="x">}</span>
<span class="n">Affine</span><span class="x">{</span><span class="mi">50</span><span class="o">--></span><span class="mi">10</span><span class="x">}</span>
<span class="n">softmax</span><span class="x">{</span><span class="mi">10</span><span class="x">}</span>
<span class="x">)</span>
</code></pre>
</div>
<p>Create an objective function to minimize, adding an Elastic (combined L1/L2) penalty/regularization. Note that the cross-entropy loss function is inferred automatically for us since we are using softmax output:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">obj</span> <span class="o">=</span> <span class="n">objective</span><span class="x">(</span><span class="n">t</span><span class="x">,</span> <span class="n">ElasticNetPenalty</span><span class="x">(</span><span class="mf">1e-5</span><span class="x">))</span>
</code></pre>
</div>
<p>Build <code class="highlighter-rouge">TracePlot</code> objects for our custom visualization:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="c"># parameter plots</span>
<span class="n">pidx</span> <span class="o">=</span> <span class="mi">1</span><span class="x">:</span><span class="mi">2</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">t</span><span class="x">)</span>
<span class="n">pvalplts</span> <span class="o">=</span> <span class="x">[</span><span class="n">TracePlot</span><span class="x">(</span><span class="n">length</span><span class="x">(</span><span class="n">params</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">i</span><span class="x">])),</span> <span class="n">title</span><span class="o">=</span><span class="s">"</span><span class="si">$</span><span class="s">(t[i])"</span><span class="x">)</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="n">pidx</span><span class="x">]</span>
<span class="n">ylabel!</span><span class="x">(</span><span class="n">pvalplts</span><span class="x">[</span><span class="mi">1</span><span class="x">]</span><span class="o">.</span><span class="n">plt</span><span class="x">,</span> <span class="s">"Param Vals"</span><span class="x">)</span>
<span class="n">pgradplts</span> <span class="o">=</span> <span class="x">[</span><span class="n">TracePlot</span><span class="x">(</span><span class="n">length</span><span class="x">(</span><span class="n">params</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">i</span><span class="x">])))</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="n">pidx</span><span class="x">]</span>
<span class="n">ylabel!</span><span class="x">(</span><span class="n">pgradplts</span><span class="x">[</span><span class="mi">1</span><span class="x">]</span><span class="o">.</span><span class="n">plt</span><span class="x">,</span> <span class="s">"Param Grads"</span><span class="x">)</span>
<span class="c"># nnet plots of values and gradients</span>
<span class="n">valinplts</span> <span class="o">=</span> <span class="x">[</span><span class="n">TracePlot</span><span class="x">(</span><span class="n">input_length</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">i</span><span class="x">]),</span> <span class="n">title</span><span class="o">=</span><span class="s">"input"</span><span class="x">,</span> <span class="n">yguide</span><span class="o">=</span><span class="s">"Layer Value"</span><span class="x">)</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">1</span><span class="x">]</span>
<span class="n">valoutplts</span> <span class="o">=</span> <span class="x">[</span><span class="n">TracePlot</span><span class="x">(</span><span class="n">output_length</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">i</span><span class="x">]),</span> <span class="n">title</span><span class="o">=</span><span class="s">"</span><span class="si">$</span><span class="s">(t[i])"</span><span class="x">,</span> <span class="n">titlepos</span><span class="o">=</span><span class="x">:</span><span class="n">left</span><span class="x">)</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">t</span><span class="x">)]</span>
<span class="n">gradinplts</span> <span class="o">=</span> <span class="x">[</span><span class="n">TracePlot</span><span class="x">(</span><span class="n">input_length</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">i</span><span class="x">]),</span> <span class="n">yguide</span><span class="o">=</span><span class="s">"Layer Grad"</span><span class="x">)</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="mi">1</span><span class="x">]</span>
<span class="n">gradoutplts</span> <span class="o">=</span> <span class="x">[</span><span class="n">TracePlot</span><span class="x">(</span><span class="n">output_length</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">i</span><span class="x">]))</span> <span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">t</span><span class="x">)]</span>
<span class="c"># loss/accuracy plots</span>
<span class="n">lossplt</span> <span class="o">=</span> <span class="n">TracePlot</span><span class="x">(</span><span class="n">title</span><span class="o">=</span><span class="s">"Test Loss"</span><span class="x">,</span> <span class="n">ylim</span><span class="o">=</span><span class="x">(</span><span class="mi">0</span><span class="x">,</span><span class="kt">Inf</span><span class="x">))</span>
<span class="n">accuracyplt</span> <span class="o">=</span> <span class="n">TracePlot</span><span class="x">(</span><span class="n">title</span><span class="o">=</span><span class="s">"Accuracy"</span><span class="x">,</span> <span class="n">ylim</span><span class="o">=</span><span class="x">(</span><span class="mf">0.6</span><span class="x">,</span><span class="mi">1</span><span class="x">))</span>
</code></pre>
</div>
<p>Add a method for computing the loss and accuracy on a subsample of test data:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="k">function</span><span class="nf"> my_test_loss</span><span class="x">(</span><span class="n">obj</span><span class="x">,</span> <span class="n">testdata</span><span class="x">,</span> <span class="n">totcount</span> <span class="o">=</span> <span class="mi">500</span><span class="x">)</span>
<span class="n">totloss</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="n">totcorrect</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="x">(</span><span class="n">x</span><span class="x">,</span><span class="n">y</span><span class="x">)</span> <span class="k">in</span> <span class="n">eachobs</span><span class="x">(</span><span class="n">rand</span><span class="x">(</span><span class="n">eachobs</span><span class="x">(</span><span class="n">testdata</span><span class="x">),</span> <span class="n">totcount</span><span class="x">))</span>
<span class="n">totloss</span> <span class="o">+=</span> <span class="n">transform!</span><span class="x">(</span><span class="n">obj</span><span class="x">,</span><span class="n">y</span><span class="x">,</span><span class="n">x</span><span class="x">)</span>
<span class="c"># logistic version:</span>
<span class="c"># ŷ = output_value(obj.transformation)[1]</span>
<span class="c"># correct = (ŷ > 0.5 && y > 0.5) || (ŷ <= 0.5 && y < 0.5)</span>
<span class="c"># softmax version:</span>
<span class="n">ŷ</span> <span class="o">=</span> <span class="n">output_value</span><span class="x">(</span><span class="n">obj</span><span class="o">.</span><span class="n">transformation</span><span class="x">)</span>
<span class="n">chosen_idx</span> <span class="o">=</span> <span class="n">indmax</span><span class="x">(</span><span class="n">ŷ</span><span class="x">)</span>
<span class="n">correct</span> <span class="o">=</span> <span class="n">y</span><span class="x">[</span><span class="n">chosen_idx</span><span class="x">]</span> <span class="o">></span> <span class="mi">0</span>
<span class="n">totcorrect</span> <span class="o">+=</span> <span class="n">correct</span>
<span class="k">end</span>
<span class="n">totloss</span><span class="x">,</span> <span class="n">totcorrect</span><span class="o">/</span><span class="n">totcount</span>
<span class="k">end</span>
</code></pre>
</div>
<p>Our custom trace method which will be called after each minibatch:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">tracer</span> <span class="o">=</span> <span class="n">IterFunction</span><span class="x">((</span><span class="n">obj</span><span class="x">,</span> <span class="n">i</span><span class="x">)</span> <span class="o">-></span> <span class="n">begin</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">mod1</span><span class="x">(</span><span class="n">i</span><span class="x">,</span><span class="n">n</span><span class="x">)</span><span class="o">==</span><span class="n">n</span> <span class="o">||</span> <span class="k">return</span> <span class="n">false</span>
<span class="c"># param trace</span>
<span class="k">for</span> <span class="x">(</span><span class="n">j</span><span class="x">,</span><span class="n">k</span><span class="x">)</span> <span class="k">in</span> <span class="n">enumerate</span><span class="x">(</span><span class="n">pidx</span><span class="x">)</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">pvalplts</span><span class="x">[</span><span class="n">j</span><span class="x">],</span> <span class="n">i</span><span class="x">,</span> <span class="n">params</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">k</span><span class="x">]))</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">pgradplts</span><span class="x">[</span><span class="n">j</span><span class="x">],</span> <span class="n">i</span><span class="x">,</span> <span class="n">grad</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">k</span><span class="x">]))</span>
<span class="k">end</span>
<span class="c"># input/output trace</span>
<span class="k">for</span> <span class="n">j</span><span class="o">=</span><span class="mi">1</span><span class="x">:</span><span class="n">length</span><span class="x">(</span><span class="n">t</span><span class="x">)</span>
<span class="k">if</span> <span class="n">j</span><span class="o">==</span><span class="mi">1</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">valinplts</span><span class="x">[</span><span class="n">j</span><span class="x">],</span> <span class="n">i</span><span class="x">,</span> <span class="n">input_value</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">j</span><span class="x">]))</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">gradinplts</span><span class="x">[</span><span class="n">j</span><span class="x">],</span> <span class="n">i</span><span class="x">,</span> <span class="n">input_grad</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">j</span><span class="x">]))</span>
<span class="k">end</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">valoutplts</span><span class="x">[</span><span class="n">j</span><span class="x">],</span> <span class="n">i</span><span class="x">,</span> <span class="n">output_value</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">j</span><span class="x">]))</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">gradoutplts</span><span class="x">[</span><span class="n">j</span><span class="x">],</span> <span class="n">i</span><span class="x">,</span> <span class="n">output_grad</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="n">j</span><span class="x">]))</span>
<span class="k">end</span>
<span class="c"># compute approximate test loss and trace it</span>
<span class="k">if</span> <span class="n">mod1</span><span class="x">(</span><span class="n">i</span><span class="x">,</span><span class="mi">500</span><span class="x">)</span><span class="o">==</span><span class="mi">500</span>
<span class="n">totloss</span><span class="x">,</span> <span class="n">accuracy</span> <span class="o">=</span> <span class="n">my_test_loss</span><span class="x">(</span><span class="n">obj</span><span class="x">,</span> <span class="n">test</span><span class="x">,</span> <span class="mi">500</span><span class="x">)</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">lossplt</span><span class="x">,</span> <span class="n">i</span><span class="x">,</span> <span class="n">totloss</span><span class="x">)</span>
<span class="n">add_data</span><span class="x">(</span><span class="n">accuracyplt</span><span class="x">,</span> <span class="n">i</span><span class="x">,</span> <span class="n">accuracy</span><span class="x">)</span>
<span class="k">end</span>
<span class="c"># build a heatmap of the total outgoing weight from each pixel</span>
<span class="n">pixel_importance</span> <span class="o">=</span> <span class="n">reshape</span><span class="x">(</span><span class="n">sum</span><span class="x">(</span><span class="n">t</span><span class="x">[</span><span class="mi">1</span><span class="x">]</span><span class="o">.</span><span class="n">params</span><span class="o">.</span><span class="n">views</span><span class="x">[</span><span class="mi">1</span><span class="x">],</span><span class="mi">1</span><span class="x">),</span> <span class="mi">28</span><span class="x">,</span> <span class="mi">28</span><span class="x">)</span>
<span class="n">hmplt</span> <span class="o">=</span> <span class="n">heatmap</span><span class="x">(</span><span class="n">pixel_importance</span><span class="x">,</span> <span class="n">ratio</span><span class="o">=</span><span class="mi">1</span><span class="x">)</span>
<span class="c"># build a nested-grid layout for all the trace plots</span>
<span class="n">plot</span><span class="x">(</span>
<span class="n">map</span><span class="x">(</span><span class="n">getplt</span><span class="x">,</span> <span class="n">vcat</span><span class="x">(</span>
<span class="n">pvalplts</span><span class="x">,</span> <span class="n">pgradplts</span><span class="x">,</span>
<span class="n">valinplts</span><span class="x">,</span> <span class="n">valoutplts</span><span class="x">,</span>
<span class="n">gradinplts</span><span class="x">,</span> <span class="n">gradoutplts</span><span class="x">,</span>
<span class="n">lossplt</span><span class="x">,</span> <span class="n">accuracyplt</span>
<span class="x">))</span><span class="o">...</span><span class="x">,</span>
<span class="n">hmplt</span><span class="x">,</span>
<span class="n">size</span> <span class="o">=</span> <span class="x">(</span><span class="mi">1400</span><span class="x">,</span><span class="mi">1000</span><span class="x">),</span>
<span class="n">layout</span><span class="o">=</span><span class="nd">@layout</span><span class="x">([</span>
<span class="n">grid</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="n">length</span><span class="x">(</span><span class="n">pvalplts</span><span class="x">))</span>
<span class="n">grid</span><span class="x">(</span><span class="mi">2</span><span class="x">,</span><span class="n">length</span><span class="x">(</span><span class="n">valoutplts</span><span class="x">)</span><span class="o">+</span><span class="mi">1</span><span class="x">)</span>
<span class="n">grid</span><span class="x">(</span><span class="mi">1</span><span class="x">,</span><span class="mi">3</span><span class="x">){</span><span class="mf">0.2</span><span class="n">h</span><span class="x">}</span>
<span class="x">])</span>
<span class="x">)</span>
<span class="c"># show the plot</span>
<span class="n">gui</span><span class="x">()</span>
<span class="k">end</span><span class="x">)</span>
<span class="c"># trace once before we start learning to see initial values</span>
<span class="n">tracer</span><span class="o">.</span><span class="n">f</span><span class="x">(</span><span class="n">obj</span><span class="x">,</span> <span class="mi">0</span><span class="x">)</span>
</code></pre>
</div>
<p>Finally, we build our learner and learn! We’ll use the Adadelta method with a learning rate of 0.05. Notice we just added our custom tracer to the list of parameters… we could have added others if we wanted. The <code class="highlighter-rouge">make_learner</code> method is just a convenience to optionally construct a <code class="highlighter-rouge">MasterLearner</code> with some common sub-learners. In this case we add a <code class="highlighter-rouge">MaxIter(50000)</code> sub-learner to stop the optimization after 50000 iterations.</p>
<p>We will train on randomly-sampled minibatches of 5 observations at a time, and update our parameters using the average gradient:</p>
<div class="language-julia highlighter-rouge"><pre class="highlight"><code><span class="n">learner</span> <span class="o">=</span> <span class="n">make_learner</span><span class="x">(</span>
<span class="n">GradientLearner</span><span class="x">(</span><span class="mf">5e-2</span><span class="x">,</span> <span class="n">Adadelta</span><span class="x">()),</span>
<span class="n">tracer</span><span class="x">,</span>
<span class="n">maxiter</span> <span class="o">=</span> <span class="mi">50000</span>
<span class="x">)</span>
<span class="n">learn!</span><span class="x">(</span><span class="n">obj</span><span class="x">,</span> <span class="n">learner</span><span class="x">,</span> <span class="n">infinite_batches</span><span class="x">(</span><span class="n">train</span><span class="x">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">5</span><span class="x">))</span>
</code></pre>
</div>
<div class="imgcenter">
<img src="/images/juliaml1/mnist_50x50.png" /><br />
<em>A snapshot after training for 30000 iterations</em>
</div>
<p>After a little while we are able to predict ~97% of the test examples correctly. The heatmap (which represents the “importance” of each pixel according to the outgoing weights of our model) depicts the important curves we have learned to distinguish the digits. The performance can be improved, and I might devote future posts to the many ways we could improve our model, however model performance was not my focus here. Rather I wanted to highlight and display the flexibility in learning and visualizing machine learning models.</p>
<h3 id="conclusion">Conclusion</h3>
<p>There are many approaches and toolsets for data science. In the future, I hope that the ease of development in Julia convinces people to move their considerable resources away from inefficient languages and towards Julia. I’d like to see Learn.jl become the generic interface for all things learning, similar to how Plots is slowly becoming the center of visualization in Julia.</p>
<p>If you have questions, or want to help out, <a href="https://gitter.im/JuliaML/chat">come chat</a>. For those in the reinforcement learning community, I’ll probably focus my next post on <a href="https://github.com/tbreloff/Reinforce.jl">Reinforce</a>, <a href="https://github.com/tbreloff/AtariAlgos.jl">AtariAlgos</a>, and <a href="https://github.com/tbreloff/OpenAIGym.jl">OpenAIGym</a>. I’m open to many types of collaboration. In addition, I can consult and/or advise on many topics in finance and data science. If you think I can help you, or you can help me, please don’t hesitate to get in touch.</p>Tom BreloffIn this post, I’ll introduce you to the Julia programming language and a couple long-term projects of mine: Plots for easily building complex data visualizations, and JuliaML for machine learning and AI. After short introductions to each, we’ll quickly throw together some custom code to build and visualize the training of an artificial neural network. Julia is fast, but you’ll see that the real speed comes from developer productivity.