Beyond Swish: the LiSHT activation function

Deep neural networks perform linear operations to combine weight vectors with input vectors. The values that are the outputs of these combinations are subsequently fed to activation functions which map the linear input into nonlinear output.

The Rectified Linear Unit or ReLU activation function is very popular today. It activates to zero for all inputs lower than zero, and activates linearly (i.e. \(f(x) = x\) for all \(x >= 0\)).

Nevertheless, it has some challenges - to which the Swish activation function was found to be a solution. Increasing in popularity, studies have emerged that empirically investigate the effectiveness of Swish. Does it really result in better model performance? If not, why is this the case? How could even Swish be improved?

We'll take a look at these questions in this blog post. First, we recap - based on our earlier blog post linked above - how Swish might improve model performance compared to traditional ReLU. Subsequently, we introduce challenges that were found empirically, before introducing a new activation function called LiSHT.

Ready? Let's go!

Update 17/Mar/2021: ensured that article is up to date for 2021. Added better formatting, fixed a few spelling issues and improved article metadata.

Recap: how Swish improves ReLU

If we wish to understand the challenges of the Swish activation function, we must first investigate how Swish improves ReLU in the first place. As we have seen in our Swish related blog post, there are multiple reasons ( Ramachandran, 2017):

Like ReLU, it is bounded below and unbounded above. This allows Swish to introduce both sparsity and non-congestion in the training process.
It's also smooth, compared to ReLU. Because of this, the Swish loss landscape is smooth as well, which allows the optimizer to experience less oscillation. This might ensure faster convergence.
Small negative values are not zeroed out, which may help you catch certain patterns in your dataset in a better way.

How the ReLU and Swish activations activate. They are really similar, but Swish is smooth and allows the model to capture small negative inputs.

Swish challenges

This does not mean that Swish is free of challenges. On the contrary - and this has everything to do with model optimization.

While Swish reportedly improves model performance (Ramachandran et al., 2017), it still does not allow you to avoid vanishing gradients, as argued by Roy et al. (2019). Instead, they argue that "the gradient diminishing problem is still present in case of Swish function".

But why is this the case?

We'll have to take a look at neural network optimization by means of gradient descent (or similar optimizers) combined with backpropagation.

It will be fairly simple to identify why even Swish might cause you to fall prey to these vanishing gradients.

Vanishing gradients?

Lets very briefly recap the vanishing gradients problem for the unaware reader. Suppose that we create a neural network with the Sigmoid activation function. Gradient descent, which is a first-order derivative optimizer, will then - together with backprop - use the first-order derivative to compute the gradients and to perform the optimization procedure.

The activation function and its first-order derivative can be visualized as follows:

As you can see, computed gradients for Sigmoid will never be larger than \(\approx 0.25\), and in many cases the gradients will be very small.

Since optimizing multiple layers of a neural network essentially chains together computed gradients from loss value to layer, with all intermediate layers included, the gradients for upstream layers get really small, slowing down the learning process the more upstream you get. Adding more and more layers will thus essentially create a network that learns slowly or cannot even converge anymore - say hello to the vanishing gradients problem.

While Sigmoid is one of the worst activation functions in terms of the vanishing gradients problem, we experience a similar situation when applying the Swish activation function. Let's take a look.

Swish and vanishing gradients

We can generate the same plot for the Swish activation function (Serengil, 2018; Ramachandran, 2017):

Even though the vanishing gradients problem is much less severe in case of Swish, only inputs of \(x >= 2\) result in gradients of 1 and (sometimes) higher. In any other case, the gradient will still cause the chain to get smaller with increasing layers.

Hence, indeed - as Roy et al. (2019) argue: Swish does not fully avoid the vanishing gradients problem.

Introducing LiSHT

To reduce the impact of this problem, they introduce the LiSHT activation function, or the Linearly Scaled Hyperbolic Tangent. This activation function simply uses the tanh function and scales it linearly, as follows:

\(LiSHT(x) = x \times tanh(x)\)

When we compare it with traditional ReLU and Swish, we get this plot:

And when we look at LiSHT in terms of the derivatives, this is what we see:

Essentially, LiSHT looks very much like Swish in terms of the first-order derivative. However, the range is expanded into the negative as well, which means that the vanishing gradient problem is reduced even further - at least in theory.

In their work, Roy et al. (2019) report based on empirical testing that indeed, the vanishing gradient problems is reduced compared to Swish and traditional ReLU. Additional correlations between network learning and the shape of e.g. the LiSHT loss landscape were identified.

Even though the authors empirically tested LiSHT on various datasets (Car Evaluation, Iris, MNIST, CIFAR10, CIFAR100 and Twitter140) with multiple types of architectures (MLP, CNN, LSTM), we'll have to wait to see if LiSHT will generate traction in the machine learning community. Firstly, it will be difficult to knock ReLU off the throne, as it generalizes well to most machine learning scenarios. While the authors have done their best to test LiSHT across many settings, we still don't know enough about how well it generalizes across most scenarios.

Secondly, which has nothing to do with true fact, the machine learning community has been relatively slow to adapt promising activation functions like Swish. While it does improve ReLU in many cases, most tutorials still recommend ReLU over such new activation functions. While this partially occurs because of the first reason - i.e., that ReLU simply generalizes well, and works well in many cases - the LiSHT authors also face the inherent slowness of collective human nature to adapt.

I'm curious to see more applications of LiSHT and I can be sure that we'll also do some testing ourselves here at MachineCurve!

Summary

In this blog post, we introduced the LiSHT activation function. It's a relatively new one, which attempts to improve Swish, which itself was an improvement of traditional ReLU in terms of the loss landscape generated during optimization. We did so by taking a look at how Swish improves ReLU in the first place, why Swish is still sensitive to vanishing gradients, and how LiSHT attempts to reduce this sensitivity.

I hope you've learnt something new today, and I wish you all the best in your machine learning process. If you have any questions, please feel free to leave a comment in the comments box below 😄👇 I'd encourage you to do the same if you do not agree with elements of my blog post, since the only way to improve it is by doing so collectively. Thanks for reading MachineCurve today and happy engineering! 😎

References

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941, 7.

Roy, S. K., Manna, S., Dubey, S. R., & Chaudhuri, B. B. (2019). LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks. arXiv preprint arXiv:1901.05894.

Serengil, S. (2018, August 31). Swish as Neural Networks Activation Function. Retrieved from https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Learn how large language models and other foundation models are working and how you can train open source ones yourself.

Keras

Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.

TensorFlow

TensorFlow is the most popular deep learning framework. It is is used by many companies.

PyTorch

PyTorch is a deep learning framework which is popular for its ease of use and flexibility.

Machine learning theory

Read about the fundamentals of machine learning, deep learning and artificial intelligence.

Transformer architectures

Emerging since 2017, Transformer architectures are part of the state of the art in deep learning.

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

activation function

activation functions

deep learning

lisht

machine learning

neural networks

optimizer

relu

swish

Connect on social media

Connect with me on LinkedIn

To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!

See my work on GitHub

My work is available on GitHub. Feel free to check it out and see if it can be of use to you!

Side info

The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.

All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.

If you have any questions or remarks, feel free to get in touch.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.

Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.

Mathjax is licensed under the Apache License, Version 2.0.

Beyond Swish: the LiSHT activation function

November 17, 2019 by Chris

Recap: how Swish improves ReLU

Swish challenges

Vanishing gradients?

Swish and vanishing gradients

Introducing LiSHT

Summary

References

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Connect on social media

Connect with me on LinkedIn

See my work on GitHub

Side info

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Side info

Connect with me on LinkedIn

See my work on GitHub