Training neural networks is an art rather than a process with a fixed outcome. You don’t know whether you’ll end up with working models, and there are many aspects that may induce failure for your machine learning project.

However, over time, you’ll also learn a certain set of brush strokes which significantly improve the odds that you’ll *succeed*.

Even though this may sound weird (it did when I started my dive into machine learning theory), it think that the above description is actually true. Once you’ll dive in, there will be a moment when all the pieces start coming together.

In modern neural network theory, Batch Normalization is likely one of the encounters that you’ll have during your quest for information.

It has something to do with *normalizing* based on *batches of data*… right? Yeah, but that’s actually repeating the name in different words.

Batch Normalization, in fact, helps you overcome a phenomenon called **internal covariate shift**. What this is, and how Batch Normalization works? We’ll answer those questions in this blog.

To be precise: we’ll kick off by exploring the concept of an internal covariate shift. What is it? How is it caused? Why does it matter? These are the questions that we’ll answer.

It is followed by the introduction of Batch Normalization. Here, we’ll also take a look at what it is, how it works, what it does and why it matters. This way, you’ll understand how it can be used to **speed up your training**, or to even save you from situations with **non-convergence**.

Are you ready? Let’s go! ๐

## Table of contents

## Internal covariate shift: a possible explanation of slow training and non-convergence

Suppose that you have a neural network, such as this one that has been equipped with Dropout neurons:

As you might recall from the high-level supervised machine learning process, training a neural network includes a *feedforward operation* on your training set. During this operation, the data is fed to the neural network, which generates a prediction for each sample that can be compared to the *target data*, a.k.a. the ground truth.

This results in a loss value that is computed by some loss function.

Based on the loss function, backpropagation will compute what is known as the *gradient* to improve the loss, while gradient descent or an adaptive optimizer will actually change the weights of the neurons of your neural network. Based on this change, the model is expected to perform better during the next iteration, in which the process is repeated.

### Changing input distributions

Now, let’s change your viewpoint. Most likely, you’ll have read the previous while visualizing the neural network as a whole. Perfectly fine, as this was intended, but now focus on the network as if it is *a collection of stacked, but individual, layers*.

Each layer takes some input, transforms this input through interaction with its weights, and outputs the result, to be consumed by the first layer downstream. Obviously, this is not true for the input layer (with the original sample as input) and the output layer (with no subsequent layer), but you get the point.

Now suppose that we feed the entire training set to the neural network. The first layer will *transform* this data into *something else*. Statistically, however, this is also a *sample*, which thus has a sample mean and a sample standard deviation. This process repeats itself for each individual layer: the input data can be represented as some statistical sample with mean \(\mu\) and standard deviation \(\sigma\).

### Internal covariate shift

Now do note two things:

- Firstly, the argument above means by consequence that the distribution of input data for some particular layer depends on
*all the interactions happening in all the upstream layers*. - Secondly, this means by consequence that
*a change in how one or more of the upstream layer(s) process data*will change the*input distribution*for this layer.

…and what happens when you train your model? Indeed, you change *how the layers process data*, by changing their weights.

Ioffe & Szegedy (2015), in their paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” call this process the **“internal covariate shift”**. They define it as follows:

The change in the distribution of network activations due to the change in network parameters during training.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167.

### Why is this bad?

Put plainly and simply:

**It slows down training**.

If you were using a very strict approach towards defining a supervised machine learning model, you would for example say that machine learning produces *a function which maps some input to some output based on some learnt mapping, which equals the mapping made by the true, underlying mapping in your data*.

This is also true for each layer: each layer essentially is a function which learns to map some input to some output, so that the system as a whole maps the original input to the desired output.

Now imagine that you’re looking at the training process from some distance. Slowly but surely, each layer learns to represent the internal mapping and the system as a whole starts to show the desired behavior. Perfect, isn’t it?

Yes, except that you also see some oscillation during the process. Indeed, you see that the layers make *tiny* mistakes during training, because they expect the inputs to be of some kind, while they are slightly different. They do know how to handle this, as the changes are very small, but they have to readjust each time they encounter such a change. As a result, the process as a whole takes a bit longer.

The same is true for the actual machine learning process. The *internal covariance shift*, or the changing distributions of the input data for each hidden layer, mean that each layer requires some extra time to learn the weights which allow the system as a whole to minimize the loss value of the entire neural network. In extreme cases, although this does not happen too often, this shift may even result in non-convergence, or the impossibility of learning the mapping as a whole. This especially occurs in datasets which have not been normalized and are by consequence a poor fit for ML.

## Introducing Batch Normalization

Speaking about such normalization: rather than leaving it to the machine learning engineer, can’t we (at least partially) fix the problem in the neural network itself?

That’s the thought process that led Ioffe & Szegedy (2015) to conceptualize the concept of **Batch Normalization**: by normalizing the inputs to each layer to a learnt representation likely close to \((\mu = 0.0, \sigma = 1.0)\), the internal covariance shift is reduced substantially. As a result, it is expected that the speed of the training process is increased significantly.

But how does it work?

Let’s find out.

### Per-feature normalization on minibatches

The first important thing to understand about Batch Normalization is that it works on a per-feature basis.

This means that, for example, for feature vector \(\textbf{x} = [0.23, 1.26, -2.41]\), normalization is not performed equally for each dimension. Rather, each dimension is normalized individually, based on the sample parameters of the *dimension*.

The second important thing to understand about Batch Normalization is that it makes use of minibatches for performing the normalization process (Ioffe & Szegedy, 2015). It avoids the computational burden of using the entire training set, while assuming that minibatches approach the dataset’s sample distribution if sufficiently large. This is a very smart idea.

### Four-step process

Now, the algorithm. For each feature \(x_B^{(k)} \) in your feature vector \(\textbf{x}_B\) (which, for your hidden layers, doesn’t contain your features but rather the inputs for that particular layer), Batch Normalization normalizes the values with a four-step process on your minibatch \(B\) (Ioffe & Szegedy, 2015):

**Computing the mean of your minibatch**: \(\mu_B^{(k)} \leftarrow \frac{1}{m} \sum\limits_{i=1}^m x_B{ _i ^{(k)} } \).**Computing the variance of your minibatch:**\(\sigma^2{ _B^{(k)} } \leftarrow \frac{1}{m} \sum\limits_{i=1}^m ( x_B{ _i ^{(k)} } – \mu_B^{(k)})^2\)**Normalizing the value:**\(\hat{x}_B^{(k)} \leftarrow \frac{x_B{ ^{(k)} } – \mu_B^{(k)}}{\sqrt{ \sigma^2{ _B^{(k)} } + \epsilon}}\)**Scaling and shifting:**\(y_i \leftarrow \gamma\hat{x} _B ^{(k)} + \beta\).

#### Computing mean and variance

The first two steps are simple and are very common as well as required in a normalization step: **computing the mean** \(\mu\) and **variance **\(\sigma^2\) of the \(k^{\text{th}}\) dimension of your minibatch sample \(x_B\).

#### Normalizing

These are subsequently used in the **normalization step**, in which the expected distribution is \((0, 1)\) as long as samples in the minibatch have the same distribution and the value for \(\epsilon\) is neglected (Ioffe & Szegedy, 2015).

You may ask: indeed, this \(\epsilon\), why is it there?

It’s for numerical stability (Ioffe & Szegedy, 2015). If the variance \(\sigma^2\) were zero, one would get a *division by zero* error. This means that the model would become numerically unstable. The value for \(\epsilon\) resolves this by taking a very small but nonzero value to counter this effect.

#### Scaling and shifting

Now, finally, the fourth step: **scaling and shifting** the normalized input value. I can get why this is weird, as we already completed normalization in the third step.

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation \(x^{(k)}\), a pair of parameters \(\gamma^{(k)}\), \(\beta^{(k)}\), which scale and shift the normalized value:

\(y^{(k)} = \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)}\)Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.ยarXiv preprint arXiv:1502.03167.

Linear regime of the nonlinearity? Represent the identity transform? What are these?

Let’s decomplexify the rather academic English into a plainer variant.

First, the “linear regime of the nonlinearity”. Suppose that we’re using the Sigmoid activation function, which is a nonlinear activation function (a “nonlinearity”) and was still quite common in 2015, when the Ioffe & Szegedy paper was written.

It looks like this:

Suppose that we’ve added it to some arbitrary layer.

*Without Batch Normalization*, the inputs of this layer do not have a distribution of approximately \((0, 1)\), and hence could theoretically be likelier to take rather large values (e.g. \(2.5623423…\)).

Suppose that our layer does nothing but pass the data (it makes our case simpler), the *activations* of those input values produce outputs that have a *nonlinear* slope: as you can see in the plot above, for inputs to the activation function in the domain \([2, 4]\), the output bends a bit.

However, for inputs of \(\approx 0\), this is not the case: the outputs for the input domain of approximately \([-0.5, 0.5]\) don’t bend and actually seem to represent a *linear function*. This entirely reduces the effect of nonlinear activation, and by consequence the performance of our model, and might not be what we want!

…and wait: didn’t we normalize to \((0, 1)\), meaning that the inputs to our activation function are likely in the domain \([-1, 1]\) for every layer? Oops ๐

This is why the authors introduce a scaling and shifting operation with some parameters \(\gamma\) and \(\beta\), with which the normalization can be adapted during training, in extreme cases even to “represent the identity transform” (a.k.a., what goes in, comes out again – entirely removing the Batch Normalization step).

The parameters are learnt during training, together with the other parameters (Ioffe & Szegedy, 2015).

### Continuing our small example

Now, let’s revise our small example from above, with our feature vector \(\textbf{x} = [0.23, 1.26, -2.41]\).

Say if we used a minibatch approach with 2 samples per batch (a bit scant, I know, but it’s sufficient for the explanation), with another vector \(\textbf{x}_a = [0.56, 0.75, 1.00]\) in the set, our Batch Normalization step would go as follows (assuming \(\gamma = \beta = 1\)):

Features | Mean | Variance | Input | Output |
---|---|---|---|---|

[0.23, 0.56] | 0.395 | 0.054 | 0.23 | -0.710 |

[1.26, 0.75] | 1.005 | 0.130 | 1.26 | 0.707 |

[-2.41, 1.00] | -0.705 | 5.81 | -2.41 | -0.707 |

As we can see, with \(\gamma = \beta = 1\), our values are normalized to a distribution of approximately \((0, 1)\) – with some \(\epsilon\) term.

### The benefits of Batch Normalization

Theoretically, there are some assumed benefits when using Batch Normalization in your neural network (Ioffe & Szegedy, 2015):

- The model is less sensitive to hyperparameter tuning. That is, whereas larger learning rates led to non-useful models previously, larger LRs are acceptable now.
- Weight initialization is a tad less important now.
- Dropout, which is used to add noise to benefit training, can be removed.

### Batch Normalization during inference

While a minibatch approach speeds up the training process, it is “neither necessary nor desirable during inference” (Ioffe & Szegedy, 2015). When inferring e.g. the class for a new sample, you wish to normalize it based on the *entire* training set, as it produces better estimates and is computationally feasible.

Hence, during inference, the Batch Normalization step goes as follows:

\(\hat{x}^{(k)} \leftarrow \frac{x_i^{(k)} – \mu^{(k)}}{\sqrt{ \sigma^2{ ^{(k)} } + \epsilon}}\)Where \(x \in X\) and \(X\) represents the full training data, rather than some minibatch \(X_b\).

## Summary

In this blog post, we’ve looked at the problem of a relatively slow and non-convergent training process, and noted that Batch Normalization may help reduce the issues with your neural network. By reducing the distribution of the input data to \((0, 1)\), and doing so on a per-layer basis, Batch Normalization is theoretically expected to reduce what is known as the “internal covariance shift”, resulting in faster learning.

I hope you’ve learnt something from this blog post. If you did, please feel free to leave a comment in the comments box below – I’ll happily read and answer ๐ Please do the same if you have any questions or when you have remarks. Thanks for reading MachineCurve today and happy engineering! ๐

## References

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift.ย *arXiv preprint arXiv:1502.03167*.

Reddit. (n.d.). Question about Batch Normalization. Retrieved from https://www.reddit.com/r/MachineLearning/comments/3k4ecb/question_about_batch_normalization/

Hi Chris,

Thank you for the clear explanation, I very much enjoy your blogs!

I have one question, is it possible to use batch normalization if my features don’t fit a Gaussian distribution? For now I have normalized them between 0 and 1 and fed them in a LSTM Model (1 LSTM and 1 dense layer atm), but can I use batch normalization between the LSTM and dense layer?

Best,

Laura

Hi Laura,

From what I understand about Batch Normalization, that should work fine. BN only changes the distribution of inputs to nodes directly downstream the BN layers; in your case thus the _outputs_ of the LSTM layer. The distribution of your features (i.e. that of your Input layer) should therefore not matter. I’ve also performed at quick search at Google, and nothing seems to come up as to that being problematic.

Best,

Chris

Thanks!

Best of luck writing your thesis, I think! Using ML for it?

Best,

Chris