Leaky ReLU: improving traditional ReLU

The Leaky ReLU is a type of activation function which comes across many machine learning blogs every now and then. It is suggested that it is an improvement of traditional ReLU and that it should be used more often.

But how is it an improvement? How does Leaky ReLU work? In this blog, we'll take a look. We identify what ReLU does and why this may be problematic in some cases. We then introduce Leaky ReLU and argue why its design can help reduce the impact of the problems of traditional ReLU. Subsequently, we briefly look into whether it is actually better and why traditional ReLU is still in favor today.

After reading this tutorial, you will...

Understand how ReLU works.
See why using ReLU can be problematic at times.
How Leaky ReLU helps resolve these problems.

Let's take a look 🚀

Update 08/Feb/2021: ensure that article is up-to-date.

Brief recap: what is ReLU and how does it work?

Rectified Linear Unit, or ReLU, is one of the most common activation functions used in neural networks today. It is added to layers in neural networks to add nonlinearity, which is required to handle today's ever more complex and nonlinear datasets.

Each neuron computes a dot product and adds a bias value before the value is output to the neurons in the subsequent layer. These mathematical operations are linear in nature. This is not bad if we were training the model against a dataset that is linearly separable (in the case of classification) or where a line needs to be estimated (when regressing).

However, if data is nonlinear, we face problems. Linear neuron outputs ensure that the system as a whole, thus the entire neural network, behaves linearly. By consequence, it cannot handle such data, which is very common today: the MNIST dataset, which we used for showing how to build classifiers in Keras, is nonlinear - and it is one of the simpler ones!

Activation functions come to the rescue by adding nonlinearity. They're placed directly after the neural outputs and do nothing else but converting some input to some output. Because the mathematical functions used are nonlinear, the output is nonlinear - which is exactly what we want, since now the system behaves nonlinearly and nonlinear data is supported!

Note that although activation functions are pretty much nonlinear all the time, it's of course also possible to use the identity function \(f(x) = x\) as an activation function. It would be pointless, but it can be done.

Now ReLU. It can be expressed as follows:

\begin{equation} f(x) = \begin{cases} 0, & \text{if}\ x < 0 \\ x, & \text{otherwise} \\ \end{cases} \end{equation}

And visualized in this way:

For all values \(\geq 0\), it behaves linearly, but essentially behaves nonlinearly by outputting zeroes for all negative inputs.

Hence, it can be used as a nonlinear activation function.

It's grown very popular and may be the most popular activation used today - it is more popular than the older Sigmoid and Tanh activation functions - for the reason that it can be computed relatively inexpensively. Computing ReLU is equal to computing \(ReLU(x) = max(0, x)\), which is much less expensive than the exponents or trigonometric operations necessary otherwise.

Problems with ReLU

However, it's not the silver bullet and every time you'll run into trouble when using ReLU. It doesn't happen often - which makes it highly generalizable across machine learning domains and machine learning problems - but you may run into some issues.

Firstly, ReLU is not continuously differentiable. At \(x = 0\), the breaking point between \(x\) and 0, the gradient cannot be computed. This is not too problematic, but can very lightly impact training performance.

Secondly, and more gravely, ReLU sets all values < 0 to zero. This is beneficial in terms of sparsity, as the network will adapt to ensure that the most important neurons have values of > 0. However, this is a problem as well, since the gradient of 0 is 0 and hence neurons arriving at large negative values cannot recover from being stuck at 0. The neuron effectively dies and hence the problem is known as the dying ReLU problem. You're especially vulnerable to it when your neurons are not initialized properly or when your data is not normalized very well, causing significant weight swings during the first phases of optimizing your model. The impact of this problem may be that your network essentially stops learning and underperforms.

Introducing Leaky ReLU

What if you caused a slight but significant information leak in the left part of ReLU, i.e. the part where the output is always 0?

This is the premise behind Leaky ReLU, one of the possible newer activation functions that attempts to minimize one's sensitivity to the dying ReLU problem.

Mathematically, it is defined as follows (Maas et al., 2013):

\begin{equation} f(x) = \begin{cases} 0.01x, & \text{if}\ x < 0 \\ x, & \text{otherwise} \\ \end{cases} \end{equation}

Leaky ReLU can be visualized as follows:

If you compare this with the image for traditional ReLU above, you'll see that for all \(inputs < 0\), the outputs are slightly descending. The thesis is that these small numbers reduce the death of ReLU activated neurons. This way, you'll have to worry less about the initialization of your neural network and the normalization of your data. Although these topics remain important, they are slightly less critical.

Does Leaky ReLU really work?

Next, the question: does Leaky ReLU really work? That is, does it really reduce the likelihood that your ReLU activating network dies off?

Let's try and find out.

Nouroz Rahman isn't convinced:

However, I personally don’t think Leaky ReLU provides any advantage over ReLU, holistically, considering both training and accuracy although some papers claimed to achieve that. That’s why Leaky ReLU is trivial in deep learning and honestly speaking, I have never used it or thought of the necessity of using it.

Nouroz Rahman

In a 2018 study, Pedamonti argues that Leaky ReLU and ReLU performance on the MNIST dataset is similar. Even though the problem of dying neural networks may now be solved theoretically, it can be the case that it simply doesn't happen very often - and that in those cases, normal ReLU works as well. "It's simple, it's fast, it's standard" - someone argued. And I tend to agree.

Summary

In this blog post, we've seen what challenges ReLU-activated neural networks. We also introduced the Leaky ReLU which attempts to resolve issues with traditional ReLU that are related to dying neural networks. We can conclude that in many cases, it seems to be the case that traditional / normal ReLU is relevant, and that Leaky ReLU benefits in those cases where you suspect your neurons are dying. I'd say: use ReLU if you can, and other linear rectifiers if you need to.

Happy engineering! 😊

References

Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. Retrieved from https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc

What are the advantages of using Leaky Rectified Linear Units (Leaky ReLU) over normal ReLU in deep learning? (n.d.). Retrieved from https://www.quora.com/What-are-the-advantages-of-using-Leaky-Rectified-Linear-Units-Leaky-ReLU-over-normal-ReLU-in-deep-learning

Pedamonti, D. (2018). Comparison of non-linear activation functions for deep neural networks on MNIST classification task. arXiv preprint arXiv:1804.02763.

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Learn how large language models and other foundation models are working and how you can train open source ones yourself.

Keras

Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.

TensorFlow

TensorFlow is the most popular deep learning framework. It is is used by many companies.

PyTorch

PyTorch is a deep learning framework which is popular for its ease of use and flexibility.

Machine learning theory

Read about the fundamentals of machine learning, deep learning and artificial intelligence.

Transformer architectures

Emerging since 2017, Transformer architectures are part of the state of the art in deep learning.

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

activation functions

deep learning

neural networks

relu

Connect on social media

Connect with me on LinkedIn

To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!

See my work on GitHub

My work is available on GitHub. Feel free to check it out and see if it can be of use to you!

Side info

The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.

All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.

If you have any questions or remarks, feel free to get in touch.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.

Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.

Mathjax is licensed under the Apache License, Version 2.0.

Leaky ReLU: improving traditional ReLU

October 15, 2019 by Chris

Brief recap: what is ReLU and how does it work?

Problems with ReLU

Introducing Leaky ReLU

Does Leaky ReLU really work?

Summary

References

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Connect on social media

Connect with me on LinkedIn

See my work on GitHub

Side info

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Side info

Connect with me on LinkedIn

See my work on GitHub