How to use Batch Normalization with Keras?

The inputs to individual layers in a neural network can be normalized to speed up training. This process, called Batch Normalization, attempts to resolve an issue in neural networks called internal covariate shift.

But how does it work? And how does it work in terms of code - with the Keras deep learning framework? We'll take a look at these questions in this blog.

Firstly, we'll provide a recap on Batch Normalization to ensure that you've gained some conceptual understanding, or that it has been revived. This includes a discussion on the problem, why it occurs during training, and how Batch Normalization may resolve it.

Then, we move on to the actual Keras part - by providing you with an example neural network using Batch Normalization to learn classification on the KMNIST dataset. Each step of the code which creates the neural network is explained so that you understand how it works.

Are you ready? Let's go! :)

Recap: about Batch Normalization

Before we start coding, let's take a brief look at Batch Normalization again. We start off with a discussion about internal covariate shift and how this affects the learning process. Subsequently, as the need for Batch Normalization will then be clear, we'll provide a recap on Batch Normalization itself to understand what it does.

Training a supervised ML model

Suppose that you have this neural network, which is composed of Dropout neurons:

Following the high-level supervised machine learning process, training such a neural network is a multi-step process:

Feeding your training data to the network in a feedforward fashion, in which each layer processes your data further.
This leads to a prediction for every sample.
This prediction can be compared to the actual target value (the "ground truth"), to see how well the model performs.
How well, or strictly speaking how bad the model performs is reflected in the loss value.
Improving the neural network means firstly, identifying the necessary change in the weights of each neuron with respect to the loss value, and possibly with respect to the intermediate layers as well.
Secondly, by means of an optimizer like gradient descent or an adaptive optimizer, the weights get updated based on these necessary changes (also called gradients).

Internal covariate shift

Now take a look at the neural network from a per-layer point of view. Each layer takes some input, performs a linear operation using the input vector and the weights vector, feeds the data into a nonlinear activation function, and passes the data to the next layer or the output.

Neural networks train fast if the distribution of the data remains the same, and especially if it is normalized to the range of \((\mu = 0, \sigma = 1)\). This is not the case when no Batch Normalization is applied: by training the network (i.e. changing the weights of the individual neurons), the outputs for every layer change, which means that the distribution of input data for every layer will change during every iteration.

We call this internal covariate shift (Ioffe & Szegedy, 2015). It is bad, because it can slow down learning. Fortunately, it can be avoided - and Batch Normalization is a way of doing so.

Batch Normalization normalizes layer inputs on a per-feature basis

As we saw before, neural networks train fast if the distribution of the input data remains similar over time. Batch Normalization helps you do this by doing two things: normalizing the input value and scaling and shifting it.

Normalizing the value: \(\hat{x}_B^{(k)} \leftarrow \frac{x_B{ ^{(k)} } - \mu_B^{(k)}}{\sqrt{ \sigma^2{ _B^{(k)} } + \epsilon}}\)

Every input \(x_B{ ^{(k)}}\) is normalized by first subtracting input sample mean \( \mu_B^{(k)} \) and then dividing by \( \sqrt{ \sigma^2{ _B^{(k)} } + \epsilon} \), which is the square root of the variance of the input sample, plus some \( \epsilon \). Do note:

Whenever we mention "sample" we mean just one dimension of the feature vectors in our minibatch, as normalization is done per dimension. This means, for e.g. the feature vector \([2.31, 5.12, 0.12]\), Batch Normalization is applied three times, so once per dimension.
Contrary to true \((0, 1)\) normalization, a small value represented by \(\epsilon\) is added to the square root, to ensure that the denominator is nonzero - avoiding division by zero errors.

Scaling and shifting: \(y_i \leftarrow \gamma\hat{x} _B ^{(k)} + \beta\).

With some activation functions (such as the Sigmoid activation function), normalizing inputs to have the \((0, 1)\) distribution may result in a different issue: they'll activate almost linearly as they primarily activate in the linear segment of the activation function.

Here, I explain this in more detail, and why this needs to be avoided.

By scaling the value with some \(\gamma\) and shifting the value with some \(\beta\), this problem can be avoided. The values for these are learnt during training.

Batch Normalization in the Keras API

In the Keras API (TensorFlow, n.d.), Batch Normalization is defined as follows:

keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)

Put simply, Batch Normalization can be added as easily as adding a BatchNormalization() layer to your model, e.g. with model.add. However, if you wish, local parameters can be tuned to steer the way in which Batch Normalization works. These parameters are as follows:

Axis: the axis of your data which you like Batch Normalization to be applied on. Usually, this is not of importance, but if you have a channels-first Conv layer, it must be set to 1.
Momentum: the momentum that is to be used on the moving mean and the moving variance.
Epsilon: the value for \(\epsilon\) that is to be used in the normalization step, to avoid division by zero (also see the Batch Normalization formula above).
Center: if True, the value for \(\beta\) is used; if False, it's ignored.
Scale: if True, the value for \(\gamma\) is used; if False, it's ignored.
Beta initializer, regularizer and constraint: these define the Keras initializer, regularizer and constraints for the beta i.e. the center factor. They give you more control over how the network learns the values during training.
Gamma initializer, regularizer and constraint: these define the Keras initializer, regularizer and constraints for the gamma i.e. the scaling factor. They give you more control over how the network learns the values during training.
Moving mean initializer, moving variance initializer: these define the Keras initializers for the moving mean and moving variance.

Why the moving mean and variance, you say?

This has to do with how Batch Normalization works during training time versus inference time.

During training time, there's a larger minibatch available which you can use to compute sample mean and sample variance.

However, during inference, the sample size is one. There's no possibility to compute an average mean and an average variance - because you have one value only, which may be an outlier. Having the moving mean and moving variance from the training process available during inference, you can use these values to normalize during inference. Smart and simple, but a great fix for this issue :)

Today's model

Let's take a look at the model we're going to create today :) First, we'll see what dataset we're going to use - being the KMNIST datset. This is followed by a discussion about the model we'll be creating in this tutorial.

The dataset

For the dataset, we're using the KMNIST dataset today:

It is a drop-in replacement for the MNIST dataset:

The training set has 60.000 samples;
The testing set has 10.000 samples;
Each sample is a 28x28 pixels image;
Each sample belongs to one of 10 target classes.

Using the `extra-keras-datasets` module

We use the extra-keras-datasets module to load our dataset. This module, which we created and discussed in a different blog post, attempts to replicate the keras.datasets way of loading data into your machine learning projects, albeit with different datasets. As we believe that making more datasets easily available boosts adoption of a framework, especially by people who are just starting out, we've been making available additional datasets for Keras through this module. It may be worthwhile to check it out separately!

Installing this module is required if you wish to run the model (if you don't, you may also replace it with keras.datasets.mnist), and can be done very easily:

pip install extra-keras-datasets

With one line, a dataset can be imported into your model:

from extra_keras_datasets import kmnist

And subsequently loading the data into the particular variables is also easy:

(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')

The model architecture

This is the architecture of today's model, which we generated with Net2Vis (Bäuerle & Ropinski, 2019):

Our model has two convolutional blocks followed by two dense layers:

Each convolutional block contains a Conv2D layer and a MaxPooling2D layer, whose outputs are normalized with BatchNormalization layers.
The convolutional blocks will learn the feature maps, and will thus learn to generate activations for certain parts within the images, such as edges.
With a Flatten layer, the contents of the feature maps are converted into a one-dimensional Tensor that can be used in the Dense layers.
The Dense layers together produce the classification. The input to the final Dense layer from the first one is also BatchNormalized.

Keras implementation

Let's now see how we can implement this model with Keras :) We'll be using the TensorFlow 2.0 approach to Keras, which is the currently preferred way of using the library. This unfortunately means that it's no longer possible to use Keras with Theano or CNTK. However, if you wish to still use it, it may still work to replace tensorflow.keras with keras, i.e. the original library.

Creating the model is a multi-step process:

First, we import everything that we need.
Then, we set the model configuration.
This is followed by loading and preparing the dataset.
Subsequently, we define the architecture of our model in line with what we defined above.
Then, we compile the model and fit the data, i.e. start the training process.
Once this finishes, we generate evaluation metrics based on our testing set.

Let's go! Open your Explorer or Finder, navigate to some folder, and create a Python file, e.g. model_batchnorm.py. Next, open this file in your code editor - so that we can start coding :)

Model imports

These are our model imports:

from extra_keras_datasets import kmnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization

We can describe them as follows:

We'll import the main tensorflow library so that we can import Keras stuff next.
Then, from models, we import the Sequential API - which allows us to stack individual layers nicely and easily.
Then, from layers, we import Dense, Flatten, Conv2D, MaxPooling2D and BatchNormalization - i.e., the layers from the architecture that we specified.
Finally, we import the kmnist dataset from the extra_keras_datasets library.

Model configuration

We can then set the configuration for our model:

# Model configuration
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1

The batch size of our minibatches is set to 250, which balances well between gradient preciseness and memory requirements. We'll train for 25 epochs (which could be higher if you wish, just configure it to a different number :)) and tell the model that we have 10 classes that it can classify into - i.e., the 10 KMNIST classes. 20% of the training data will be used for validation purposes and with verbosity mode set to True, all output will be displayed on screen.

Loading & preparing data

The next step is loading the data. Given the simplicity with which extra-keras-datasets can be used, thanks to the original keras.datasets module, this is definitely easy:

# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')

Subsequent processing of the data so that it is prepared for training is a bit more complex, but it is neither very difficult:

# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape

With this step, we obtain the shape of our input_train and input_test datasets, i.e. our features. We'll use the first to set the shape of our Keras input data next - which are image height (shape dim 1), image width (shape dim 2) and the number of channels (just one):

# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)

Channels have to be included because Keras expects them during training.

Next, because the data does not have yet the channels property, we'll have to reshape our data to include it there as well:

# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)
input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)

Now, the bulk of the work is done. We next convert the data to float32 format which presumably speeds up training:

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

And finally normalize the data:

# Normalize input data
input_train = input_train / 255
input_test = input_test / 255

We're now ready to define the architecture.

Defining the model architecture

Since we already discussed the architecture of our model above, its components won't be too surprising. However, what is still lacking is the actual code for our architecture - so let's write it now and explain it afterwards:

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(no_classes, activation='softmax'))

What this code does is create an instance of a model based on the Sequential API. Subsequently, the convolutional, pooling, batch normalization and Dense layers are stacked with model.add.

Some things we haven't included in the architectural discussion before:

Activation functions: for the intermediate layers: we use the ReLU activation function in our convolutional and Dense layers, except for the last one. ReLU is the de facto standard activation function used today and we hence use it in our model. Given the small size of our dataset, we omit applying He init, which is preferred over Xavier/Glorot init when using ReLU.
Activation function for the final layer: in this layer, we're using the Softmax activation function, which generates a probability distribution over the target classes, from which we can select the most likely class.

Model compilation & fitting data

The next step is model compilation:

# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])

As you can see, model compilation is essentially instantiating the model architecture we defined before with the model configuration we set before. We use sparse categorical crossentropy loss, which combines nicely with our integer target values - so that we don't have to convert these into categorical format before we start training. To optimize the model, we use the Adam optimizer, and add accuracy as an additional metric.

Then, we fit the data to our model, a.k.a. starting the training process:

# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)

We fit the input training set with its corresponding targets, and train according to the preconfigured batch_size and no_epochs, with verbosity mode set to on and the validation_split set as before (i.e., to 20%).

Note that the history object can be used for visualizing the training process / the improvements over epochs later.

Generating evaluation metrics

The final step is generating evaluation metrics with our test set, to see whether our model generalizes to unseen data:

# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

Now, we can start training! :)

Full model code

Should you wish instead to obtain the full code for the model at once, here you go :)

from extra_keras_datasets import kmnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization

# Model configuration
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1

# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')

# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape 

# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)

# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)
input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize input data
input_train = input_train / 255
input_test = input_test / 255

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(no_classes, activation='softmax'))

# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])

# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)

# Generate generalization metric  s
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

Results

To start training, open up a terminal which has the required software dependencies installed (i.e. tensorflow 2.0+ and the extra-keras-datasets module), cd to the folder where your Python file is located, and run it with e.g. python model_batchnorm.py.

Most likely, the training process will then begin, and you should see the test results once it finishes. Here are the results over the epochs shown visually. They were generated by means of the history object (note that you must add extra code to make this work):

As you can see, the model performs well. Obviously, for practical settings, this will be different as your data set is likely much more complex, but I'm curious whether Batch Normalization will help ensure faster convergence in your models! Please let me know in the comments section below :)

Summary

In this blog post, we've looked at how to apply Batch Normalization in your Keras models. This included a discussion about the concept of internal covariate shift and why this may slow down the learning process. Additionally, we provided a recap on the concept of Batch Normalization and how it works, and why it may reduce these issues.

This was followed by a Keras implementation using the TensorFlow 2.0 way of working. The full code was split into small blocks which contained an explanation. This way, I hope that you understood well why I coded what I coded.

A long story short: I hope you've learnt something today! If you did, I'd love to know what, and you can leave a comment below. Please do the same if you have questions left or remarks that you wish to express. Thank you for reading MachineCurve today and happy engineering! 😊

References

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Bäuerle, A., & Ropinski, T. (2019). Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations. arXiv preprint arXiv:1902.04394.

MachineCurve. (2020, January 14). What is Batch Normalization for training neural networks? Retrieved from https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D. (2018). Deep learning for classical Japanese literature. arXiv preprint arXiv:1812.01718. Retrieved from https://arxiv.org/abs/1812.01718

TensorFlow. (n.d.). tf.keras.layers.BatchNormalization. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Learn how large language models and other foundation models are working and how you can train open source ones yourself.

Keras

Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.

TensorFlow

TensorFlow is the most popular deep learning framework. It is is used by many companies.

PyTorch

PyTorch is a deep learning framework which is popular for its ease of use and flexibility.

Machine learning theory

Read about the fundamentals of machine learning, deep learning and artificial intelligence.

Transformer architectures

Emerging since 2017, Transformer architectures are part of the state of the art in deep learning.

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

batch normalization

deep learning

keras

machine learning

neural network

training process

Connect on social media

Connect with me on LinkedIn

To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!

See my work on GitHub

My work is available on GitHub. Feel free to check it out and see if it can be of use to you!

Side info

The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.

All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.

If you have any questions or remarks, feel free to get in touch.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.

Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.

Mathjax is licensed under the Apache License, Version 2.0.

How to use Batch Normalization with Keras?

January 15, 2020 by Chris

Recap: about Batch Normalization

Training a supervised ML model

Internal covariate shift

Batch Normalization normalizes layer inputs on a per-feature basis

Batch Normalization in the Keras API

Today's model

The dataset

Using the extra-keras-datasets module

The model architecture

Keras implementation

Model imports

Model configuration

Loading & preparing data

Defining the model architecture

Model compilation & fitting data

Generating evaluation metrics

Full model code

Results

Summary

References

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Connect on social media

Connect with me on LinkedIn

See my work on GitHub

Side info

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Side info

Connect with me on LinkedIn

See my work on GitHub

Using the `extra-keras-datasets` module