The Elegant Math Behind Neural Networks December 07, 2020

It's been a while since I started dabbling with neural networks. But it wasn't just side projects: Last summer I started working on my master's thesis on using neural networks with aerial imagery. As is required for a project like this, I dove deep into the mechanics of neural networks, spending hours not only on understanding different machine learning techniques but also the math behind them. What I found was that in their essence neural networks are a really beautiful concept. But what I've also found is that most resources on neural networks either skimped on the math, only mentioning a few key equations without explaining all the mechanics behind them, or they were so thorough that they were hard to follow.

Thus I worked through the math myself in order to present the math behind neural networks in my master's thesis without leaving important connections between equations as an exercise to the reader but neither diving in so deep that no one except mathematics can follow.

This article is an abridged version of what I wrote for my master's thesis. It presupposes knowledge of vector algebra and analysis as taught in As far as I know. I've attended university in Germany so to be fair that's a limited sample size.a typical engineer's math class. We'll first look at the basic idea of neural networks. Then, after discussing the notation used in this article, we'll talk about the history of neural models which will lead us to modern neural models used today. This will be the foundation for understanding multilayer perceptrons and then gradient descent and backpropagation, which is the algorithm used to train them.

Basic Idea

What is a neural network in its essence? It's a computational model that combines simple mathematical functions into complex calculation graphs. Their ability to approximate arbitrary functions makes them a frequently used tool in machine learning.

The idea of neural networks (well, technically speaking we're talking about artificial neural networks here) is based on biological neural networks found in the human brain, where more than 86 billion nerve cells are assembled into a complex network. Each neuron is connected via synapses with other neurons at the input and output of the neuron. A neuron communicates with other neurons by sending a pulsed electrical signal to other neurons if its input (stimulus) exceeds a threshold value.

Artificial neural networks are conceptually based on the behavior of such neurons. They adopt the idea that complex, intelligent behavior can be achieved with networks of simple, uniform computational units. This intelligent behavior is a property of the entire network and is not explicitly coded in individual elements of the network. Like the brain, such artificial networks therefore exhibit emergent behavior.

Despite being researched since the dawn of modern computer science in the 1940s and 50s, it wasn't until around 2006 that research made the current breakthrough, where they became the central force in machine learning.


To talk about neural networks unambiguously, we'll first need to define the notation that will be used from here on forward. As we'll use a lot of vector algebra, we'll denote a vector in bold letters:

x=(x1x2xn)\bold{x}=\begin{pmatrix} x_1 \\ x_2 \\ \ldots \\ x_n \end{pmatrix}

This denotes a vector named x\bold{x} where xix_i is the ii-th element. A transposed vector will be notated as x=(x1,x2,,xn)\bold{x}^{\intercal}=\left(x_1, x_2, \ldots, x_n\right). A matrix M=(x1,x2,,xn)\bold{M}=(\bold{x_1}, \bold{x_2, \ldots, \bold{x}_n}) will be denoted in bold uppercase letters.

As we'll use iterative algorithms a lot when talking about training neural networks, we'll have to distinguish our variables in different iteration steps of our algorithms. Thus we'll use y{i}y^{\{ i \}} to talk about the value yy in iteration step ii.

In addition, with neural networks of multiple layers, we'll have to tell apart values at different layers. We'll do so by using y(L)y^{(L)} to describe the value of yy at layer LL.

Finally, we'll use xy\bold{x} \odot \bold{y} to describe the component-wise multiplication (Hadamard product) of two vectors x\bold{x} and y\bold{y}.

History of Neural Models

We'll start our journey into neural networks by looking at the history of neural models, the fundamental building blocks of neural networks. That way we gain an understanding not only of how neural networks work but how they came along and what motivated their development.

The McCulloch-Pitts Neuron

Nautilus has an in-depth article on Walter Pitts, one of the fathers of modern computing and his pursuit of building a machine modeling the human brain and nervous system.

The idea of modeling computation as a computational graph inspired by the human brain is as old as modern computer science itself. One of the pioneers of computing in the 1940s was Walter Pitts who together with neurophysiologist Warren McCulloch developed the first computational model of a neuron. Its definition is simple:

f(x)={1,i=1Nxiθ0f (\bold{x}) = \begin{cases} 1, \hspace{0.17em} \sum_{i = 1}^N x_i \ge \theta\\ 0 \end{cases}

The McCulloch-Pitts neuron is a function that has NN input values described as input vector x\bold{x} and a binary output f(x)f(\bold{x}). If the sum of input values xi,0i<Nx_i, 0\le i < N exceeds a predefined threshold θ\theta, tie neuron activates its output, otherwise it stays silent.

In their seminal paper A Logical Calculus of Ideas Immanent in Nervous Activity (1943), McCulloch and Pitts showed that a network built out of these neurons was able to logical operations as AND, OR, and NOT. But despite their effort they weren't able to develop a method to train a neural network to approximate a given function. On the contrary: To model a function one had to set the threshold values manually.

The Perceptron

The next big step in the development of neural networks was accomplished by psychologist Frank Rosenblatt in 1958. Building upon the work of McCulloch and Pitts his contribution was twofold: He introduced individual weights to the neuron's input signals and developed a method of training a neuron. The advantage of his neural model was not only the ability to algorithmically train the neuron to model a target function but also its robustness to small deviations of the training data.

Mathematically, the perceptron is described as a function of the input values x\bold{x}, the multiplicative edge weights w\bold{w} and a linear offset bb, also called bias:

f(x,w)={1,wx+b00f (\bold{x}, \bold{w}) = \begin{cases} 1, \bold{w} \cdot \bold{x} + b \ge 0\\ 0 \end{cases}

The Perceptron fires its output whenever the weighted input signals, offset by the bias, reach a threshold of 00. Just as in the McCulloch-Pitts cell, the the Perceptron's output value is quantized to a binary value.

Rosenblatt's Perceptron training algorithm followed the basic idea of adapting the neuron's weights in a way that minimized the modeling error regarding the target function. In order to model a target function y(x)y(\bold{x}) using a training dataset X\bold{X}, one would use a sample xX\bold{x} \in \bold{X} to perform the following iteration step:

Note that we drop the bias bb in the training algorithm for simplicity. It can be viewed as an additional weight which always receives an input value of 11.
Δy{i}(x)=f{i}(x)y(x)w{i+1}=w{i}αxΔy{i}\begin{aligned} \Delta y^{\{ i \}} (\bold{x}) & = f^{\{ i \}} (\bold{x}) - y (\bold{x}) \\ \bold{w}^{\{ i + 1 \}} & = \bold{w}^{\{ i \}} - \alpha \cdot \bold{x} \cdot \Delta y^{\{ i \}} \end{aligned}

Here we first calculate the quantized modeling error Δy{1,0,1}\Delta y \in \{ - 1, 0, 1 \} for the current iteration ii. Then we update the weights using the modeling error and the training sample in a way that is similar to how Newton's method approximates function roots. Additionally we use α\alpha as the learning rate to control the tradeoff between learning quickly (by making big adjustments in each step) or learning accurately (by only making small corrections in each step).

Despite the progress made compared to the McCulloch-Pitts cell, the Perceptron was limited to the modeling of linearly separable functions. This meant that it is not able to learn functions like logical XOR which is non-linear. Furthermore, the training algorithm is only able to train a single neuron or a single layer of multiple neurons but not multiple layers of neurons.


Where Rosenblatt's Perceptron was a big improvement on the original McCulloch-Pitts neuron, ADALINE was a more iterative advancement on the Perceptron. It was developed in 1960 by electrical engineers Bernard Widrow and Ted Hoff. The neural model itself is the same that Rosenblatt already used:

f(x)={1,wx+b00f (\bold{x}) = \begin{cases} 1, \bold{w} \cdot \bold{x} + b \ge 0\\ 0 \end{cases}

But Widrow and Hoff had a key insight: Learning could be improved by not quantizing the modeling error:

Δy{i}(x)=w{i}xy(x)w{i+1}=w{i}αΔy{i}xx2\begin{aligned} \Delta y^{\{ i \}} (\bold{x}) & = \bold{w}^{\{ i \}} \cdot \bold{x} - y (\bold{x}) \\ \bold{w}^{\{ i + 1 \}} & = \bold{w}^{\{ i \}} - \alpha \cdot \Delta y^{\{ i \}} \cdot \frac{\bold{x}}{| \bold{x} |^2} \end{aligned}

In their algorithm, the neuron's weights are updated by using the modeling error to determine the size of the correction and using the training sample to determine its sign.

In their work, the authors also recognized that the training algorithm basically implements a gradient descent that minimizes the quadratic modeling error. Their training algorithm, which is known as the delta rule, is the basis for how modern neural networks are trained today.

AI Winter

As mentioned previously, the Perceptron, and therefore also ADALINE, was not able to approximate non-linearly separable functions like XOR. In the academic literature of their time this was known as the XOR affair and was widely used by critics to attack the idea of neural networks (then known as connectionism) itself. For most of the 1970s and 80s there was little research into neural networks which remained a niche topic. Only in the mid-80s academic interest returned which among other developments included backpropagation, a training algorithm we'll later discuss in detail. Yet, research on neural models played only a minor role in these developments.

Modern Neural Models

Modern neural models follow the ideas of Perceptron and ADALINE but introduce a key change: Instead of quantizing the neuron's output to a binary value they use a so-called activation function to calculate the output signal:

f(x)=g(wx+b)f (\bold{x}) = g (\bold{w}^{\intercal} \bold{x} + b)

Here, g(x)g(\bold{x}) is a vectorized activation function which is applied to each element of its argument:

g(ν)j=g(νj),0j<n,νRng (\bold{\nu})_j = g (\bold{\nu}_j), 0 \leqslant j < n, \bold{\nu} \in \mathbb{R}^n

In order to perform the gradient descent algorithm used to train neural networks, the activation function needs to be differentiable. Until 2011 typical activation functions included the sigmoid function

g(x)=11+exg (x) = \frac{1}{1 + e^{- x}}

and the hyperbolic tangent

g(x)=sinhxcoshx.g (x) = \frac{\sinh x}{\cosh x}.

Due to issues these activation functions have with vanishing gradients Glorot et al. introduced the rectified linear unit (ReLU) in 2011:

g(x)=max(0,x)g (x) = \max (0, x)

Thus, a modern neuron model will look like this:

f(x)=max(0,wx+b).f (\bold{x}) = \max(0, \bold{w} \cdot \bold{x} + b).

Multi-Layer Perceptrons

From the question of how we model a single neuron we now turn to modeling networks of neurons. While a single neuron approximates a real function, a feed-forward network can approximate vectorial functions. In other words: A neural network y(x)\bold{y}(\bold{x}) tries to approximate a vectorial function f(x)\bold{f}(\bold{x}) so that

y(x)f(x)xX. \bm{y} (\bm{x}) \approx \bm{f} (\bm{x}) \quad \forall \bm{x} \in \bm{X}.

In general, multiple neurons can be connected into an arbitrary acyclic graph. But most commonly neurons are arranged into layers where each layer only feeds into the neurons of the following layer:

y(x)=y(L)(y() (y(2)(y(1)(x))))\bm{y} (\bm{x}) = \bm{y}^{(L)} (\bm{y}^{(\ldots)}  (\bm{y}^{(2)} (\bm{y}^{(1)} (\bm{x}))))

Here, each layer y(l)(ξ)\bold{y}^{(l)}(\xi) is built from a set of neurons

y(l)(ξ)=[y1(ξ)y2(ξ)yn(ξ)].\bm{y}^{(l)} (\bm{\xi}) = \left[ \begin{array}{c} y_1 (\bm{\xi})\\ y_2 (\bm{\xi})\\ \vdots\\ y_n (\bm{\xi}) \end{array} \right].

Each neuron is connected to the output values of all nodes of the previous layer. In order to distinguish a layer's output from the output of the neural net itself, we'll call the layer output the activation value a(l)\bold{a}^{(l)}. Each layer also has a set of weights W(l)\bold{W}^{(l)} where wijw_{ij} describes the weight of the previous layer's neuron jj to the current layer's neuron ii. The set of all layer's weights can be described as weight tensor W={W1,W2,,WL}\bold{W}=\{ \bold{W}_1, \bold{W}_2, \ldots,\bold{W}_L \} where LL is the number of layers in the network.

As just mentioned, every layer uses the previous layer's activation value to calculate its output vector. We'll describe this as

y(l)=a(l)(a(l1))=g(z(l)(a(l1))).\bold{y}^{(l)}= \bm{a}^{(l)} (\bm{a}^{(l - 1)}) = g (\bm{z}^{(l)} (\bm{a}^{(l - 1)})).

Here we calculate the layer's activation value as a function of the previous layer's activation value. More in detail, to calculate the activation value we first calculate the pre-activation value z(l)\bold{z}^{(l)} and then apply the activation function g(z(l))g(\bold{z}^{(l)}). The pre-activation value is determined using

z(l)(a(l1))={W(l)a(l1)+b(l),l>1x,l=1. \bm{z}^{(l)} (\bm{a}^{(l - 1)}) = \left\{\begin{array}{ll} \bm{W}^{(l)} \cdot \bm{a}^{(l - 1)} + \bm{b}^{(l)}, & l > 1\\ \bm{x}, & l = 1 \end{array}\right..

This means that for the first layer, the pre-activation value simply is the training sample x\bold{x} while for each subsequent layer it's the previous layer's activation value weighted with weight matrix W(l)\bold{W}^{(l)} and offset with bias b(l)\bold{b}^{(l)}.

Finally, we can define the output signal of the neural network as the last layer's activation value:

y(x,W)=a(L).\bm{y} (\bm{x}, \bm{W}) = \bm{a}^{(L)}.

Gradient Descent

The most commonly used method to train a feed-forward neural network is the gradient descent algorithm. It is used to optimize the network weights W\bold{W} in a way that minimizes the modeling error described by a cost function C(X,W)C(\bold{X}, \bold{W}) regarding a dataset X\bold{X}:

minWC(X,W)\min_{\bm{W}} C (\bm{X}, \bm{W})

The cost function for the entire data set X\bold{X} with NN entries is composed of the costs for the individual training entries Cx(y(x),f(x))C_x\left(\bold{y}\left(\bold{x}), \bold{f}(\bold{x}\right)\right):

C(X,W)=12Nx XCx(y(x),f (x))C (\bm{X}, \bm{W}) = \frac{1}{2 N} \sum_{\bm{x} \in  \bm{X}} C_{\bm{x}} (\bm{y} (\bm{x}), \bm{f}  (\bm{x}))

In its essence, the gradient descent algorithm works as follows:

The gradient descent algorithm thus follows the direction of steepest descent by calculating the cost function's gradient C(X,W)\nabla C(\bold{X}, \bold{W}). As with the Perceptron training and ADALINE we use a learning rate factor α\alpha.

The choice of cost function CxC_x has a big impact on the network's training behavior. Typically, for regression tasks a mean square error (MSE) is used:

Cx(y(x),f(x))=12y(x)f(x)2.C_{\bm{x}} (\bm{y} (\bm{x}), \bm{f} (\bm{x})) = \frac{1}{2} \| \bm{y} (\bm{x}) - \bm{f} (\bm{x}) \|^2.

On the other hand, classification tasks often use cross-entropy loss:

Cx(y(x),f(x))=i[fi(x)lnyi(x)+(1fi(x))ln(1yi(x))].C_{\bm{x}} (\bm{y} (\bm{x}), \bm{f} (\bm{x})) = - \sum_i [f_i (\bm{x}) \cdot \ln y_i (\bm{x}) + (1 - f_i (\bm{x})) \cdot \ln (1 - y_i (\bm{x}))].


While gradient descent itself is already enough to train a neural network for a given dataset, it most often is very computation-intensive to determine the full gradient C(X,W)\nabla C(\bold{X}, \bold{W}). The cost function is built by composing the layer's activation functions repeatedly. This means that in order to get the gradient we have to apply partial differentiation with respect to every weight in weight tensor W\bold{W} which for modern neural networks can be in the millions of weights.

Backpropagation solves the computation problem by calculating the neural network cost function's gradient in an iterative fashion. The algorithm itself was developed in the 1960s in the field of control engineering. In 1974, mathematician Paul Werbos was the first one to apply this method to neural networks, but his discovery wasn't widely noticed at the time. Later, in 1986 psychologist David Rumelhard rediscovered the application of backpropagation to neural networks and was able to popularize this approach to neural network training.

The backpropagation algorithm works by first performing a forward propagation step that calculates the cost for the whole training set. Then, starting with the last layer the network weights are updated iteratively for each previous layer until reaching the input layer.

We'll first look into how backpropagation works for the last layer and from there we'll generalize the method to also work for the previous layers.

Backpropagation For The Last Layer LL

The basic idea of backpropagation is to adjust each weight according to its influence on the cost function:

Δwij(l)=αCwij(l)\Delta w_{i j}^{(l)} = - \alpha \cdot \frac{\partial C}{\partial w_{ij}^{(l)}}

When starting at the last layer LL we thus first need to calculate the const function's gradient. By applying the chain rule to the previous equation we get

Cwij(L)=Cai(L)ai(L)zi(L)zi(L)wij(L).\frac{\partial C}{\partial w_{ij}^{(L)}} = \frac{\partial C}{\partial a_i^{(L)}} \cdot \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} \cdot \frac{\partial z_i^{(L)}}{\partial w_{i j}^{(L)}}.

The first two terms of this equation are usually called δij\delta_{ij} and describes the error for weight wijw_{ij}:

δij=Cai(L)ai(L)zi(L)\delta_{i j} = \frac{\partial C}{\partial a_i^{(L)}} \cdot \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}}

When taking the mean square error cost function and the ReLU activation function as an example, we can see how this term could be calculated:

Cai(L)=12ai(L)(x)fi(x)2=ai(L)(x)fi(x)ai(L)zi(L)=zi(L)g(zi(L))=zi(L)max(0,zi(L))={1,zi(L)>00,zi(L)<0\begin{aligned} \frac{\partial C}{\partial a_i^{(L)}} & = \frac{1}{2} \| a_i^{(L)} (\bm{x}) - f_i (\bm{x}) \|^2 = a_i^{(L)} (\bm{x}) - f_i (\bm{x}) \\ \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} & = \frac{\partial}{\partial z_i^{(L)}} g(z_i^{(L)}) = \frac{\partial}{\partial z_i^{(L)}} \max(0, z_i^{(L)}) = \left\{\begin{array}{l} 1, z_i^{(L)} > 0\\ 0, z_i^{(L)} < 0 \end{array}\right. \end{aligned}

Note that the derivative of ai(L)a_i^{(L)} is not defined at zi(L)=0z_i^{(L)} = 0 as g(ν)=max(0,ν)g(\nu)=\max(0, \nu) isn't continuously differentiable at this location. But in practice, backpropagation can still be applied, since pre-activation values zi(L)=0z_i^{(L)} = 0 are sufficiently rare and in these cases the derivative can be replaced by the left- or right-sided derivative.

The last term of the cost function gradient derives the activation function with respect to the weight. When evaluating this we'll find that this simply resolves to the activation value of the previous layer:

zi(L)wij(L)=wij(L)wij(L)aj(L1)+bi(L)=aj(L1)\frac{\partial z_i^{(L)}}{\partial w_{i j}^{(L)}} = \frac{\partial}{\partial w_{i j}^{(L)}} w_{ij}^{(L)} \cdot a_j^{(L - 1)} + b_i^{(L)} = a_j^{(L - 1)}

This means we can express the cost function gradient as

Cwij(L)=δijaj(L1).\frac{\partial C}{\partial w_{ij}^{(L)}} = \delta_{i j} \cdot a_j^{(L - 1)}.

When expressing this in vector form we can define the error term of the cost function gradient as:

δ(L)=aCg(z(L)).\bm{\delta}^{(L)} = \nabla_{\bm{a}} C \odot g' (\bm{z}^{(L)}).

Now we can express the weight update for the last update in vector form too:

Δw(L)=αδ(L) a(L1)\Delta \bm{w}^{(L)} = - \alpha \cdot \bm{\delta}^{(L)} \odot  \bm{a}^{(L - 1)}

Backpropagation For The Remaining Layers

After updating the weights for the last layer, we now can update the weights for all previous layers too. First, we calculate the error term δij(l)\delta_{ij}^{(l)} for the previous layers:

δij(l)=Cai(l)aj(l)zj(l).\delta_{ij}^{(l)} = \frac{\partial C}{\partial a_i^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}}.

Here the terms are defined as follows:

Cai(l)=k=1nl+1wij(l+1)g(zi(l+1))Cai(l+1)aj(l)zj(l)=g(zi(l))\begin{aligned} \frac{\partial C}{\partial a_i^{(l)}} & = \sum_{k = 1}^{n_{l + 1}} w_{i j}^{(l + 1)} \cdot g' (z_i^{(l + 1)}) \cdot \frac{\partial C}{\partial a_i^{(l + 1)}} \\ \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} & = g' (z_i^{(l)}) \end{aligned}

To calculate the first part of the error term we need to know the following layer's weights and pre-activation value and the following layer's cost gradient with respect to its activation value. This means that we have to process all layers in reverse, starting with the last one and working our way forward to the first layer. We can express all of this as a vector equation in order to calculate the error term for the whole layer:

δ(l)=((W(l+1))  δ(l+1))g(z(l)).\bm{\delta}^{(l)} = ((\bm{W}^{(l + 1)})^{\top} \cdot   \bm{\delta}^{(l + 1)}) \odot g' (\bm{z}^{(l)}) .

Using this error term we now can update the weights just as we did for the last layer.

Minibatch Processing

Now we are on the finishing straight. We've learned what neural networks are, about their basic principles and the math behind them, how they work, and how they are trained. There's just one bit missing. Look closely at the calculation of the cost:

C(X,W)=12Nx XCx(y(x),f (x))C (\bm{X}, \bm{W}) = \frac{1}{2 N} \sum_{\bm{x} \in  \bm{X}} C_{\bm{x}} (\bm{y} (\bm{x}), \bm{f}  (\bm{x}))

This equation implies that we calculate the estimation error for the complete dataset for every training iteration. The trouble here is that typical datasets can grow really, really large. The ImageNet dataset, for example, which led to a breakthrough in image recognition, contains more than 14 million images. And this isn't even considered a particularly big dataset. It is no surprise that calculating the estimation error for the complete dataset becomes the bottleneck during training really fast.

So what do we do? We cheat! Instead of calculating the approximation error we just estimate it by calculating it for a subset of our dataset. We could go as far as only using a single training set entry to estimate the approximation error. This would be called Stochastic Gradient Descent (SDG). Or we could use a batch of entries in order to achieve Minibatch Gradient Descent. Which one to use? It's a tradeoff. A smaller batch size is less accurate in estimating the approximation error than a larger batch, but it's also faster. And it turns out that the loss of accuracy can actually be beneficial as it tends to prevent overfitting the model (making the model just memorize the dataset).

Closing Thoughts

If you made it this far: congratulations! It was a lot of math but hopefully you're at least a little closer to understanding the math behind neural networks. And if you're hungry for more, there's a ton of interesting ideas, algorithms, and improvements that build upon these foundations. And lots more math, of course!

If this was helpful to you or you found a typo or mathematical inconsistency, feel free to shoot me an email!


I've refrained from mentioning all sources and references throughout this article. Still, none of this is really my work but rather my summary and presentation. The real work can be found in these resources:

Historical works and background:


Other helpful resources: