Bidirectional LSTMs with TensorFlow 2.0 and Keras

Long Short-Term Memory networks or LSTMs are Neural Networks that are used in a variety of tasks. Used in Natural Language Processing, time series and other sequence related tasks, they have attained significant attention in the past few years. Thanks to their recurrent segment, which means that LSTM output is fed back into itself, LSTMs can use context when predicting a next sample.

Traditionally, LSTMs have been one-way models, also called unidirectional ones. In other words, sequences such as tokens (i.e. words) are read in a left-to-right or right-to-left fashion. This does not necessarily reflect good practice, as more recent Transformer based approaches like BERT suggest. In fact, bidirectionality - or processing the input in a left-to-right and a right-to-left fashion, can improve the performance of your Machine Learning model.

In this tutorial, we will take a closer look at Bidirectionality in LSTMs. We will take a look LSTMs in general, providing sufficient context to understand what we're going to do. We also focus on how Bidirectional LSTMs implement bidirectionality. We then continue and actually implement a Bidirectional LSTM with TensorFlow and Keras. We're going to use the tf.keras.layers.Bidirectional layer for this purpose.

After reading this tutorial, you will...

Understand what Bidirectional LSTMs are and how they compare to regular LSTMs.
Know how Bidirectional LSTMs are implemented.
Be able to create a TensorFlow 2.x based Bidirectional LSTM.

Code example: using Bidirectional with TensorFlow and Keras

Here's a quick code example that illustrates how TensorFlow/Keras based LSTM models can be wrapped with Bidirectional. This converts them from unidirectional recurrent models into bidirectional ones. Click here to understand the merge_mode attribute. If you want to understand bidirectional LSTMs in more detail, or construct the rest of the model and actually run it, make sure to read the rest of this tutorial too! :)

# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(10), merge_mode='sum'))
model.add(Dense(1, activation='sigmoid'))

Bidirectional LSTMs: concepts

Before we take a look at the code of a Bidirectional LSTM, let's take a look at them in general, how unidirectionality can limit LSTMs and how bidirectionality can be implemented conceptually.

How LSTMs work

A Long Short-Term Memory network or LSTM is a type of recurrent neural network (RNN) that was developed to resolve the vanishing gradients problem. This problem, which is caused by the chaining of gradients during error backpropagation, means that the most upstream layers in a neural network learn very slowly.

It is especially problematic when your neural network is recurrent, because the type of backpropagation involved there involves unrolling the network for each input token, effectively chaining copies of the same model. The longer the sequence, the worse the vanishing gradients problem is. We therefore don't use classic or vanilla RNNs so often anymore.

LSTMs fix this problem by separating memory from the hidden outputs. An LSTM consists of memory cells, one of which is visualized in the image below. As you can see, the output from the previous layer \(h[t-1]\) and to the next layer \(h[t]\) is separated from the memory, which is noted as \(c\). Interactions between the previous output and current input with the memory take place in three segments or gates:

The forget gate, which is the first segment. It feeds both the previous output and the current input through a Sigmoid (\(\sigma\)) function, then multiplying the result with memory. It thus removes certain short-term elements from memory.
The input or update gate, which is the second segment. It also utilizes a Sigmoid function and learns what must be added memory, updating it based on the current input and the output from the previous layer. In addition, this Sigmoid activated data is multiplied with a Tanh generated output from memory and input, normalizing the memory update and keeping memory values low.
The output gate, which is the third segment. It utilizes a Sigmoid activated combination from current input and previous output and multiplies it with a Tanh-normalized representation from memory. The output is then presented and is used in the next cell, which is a copy of the current one with the same parameters.

While many nonlinear operations are present within the memory cell, the memory flow from \(c[t-1]\) to \(c[t]\) is linear - the multiplication and addition operations are linear operations. By consequence, through a smart implementation, the gradient in this segment is always kept at 1.0 and hence vanishing gradients no longer occur. This aspect of the LSTM is therefore called a Constant Error Carrousel, or CEC.

How unidirectionality can limit your LSTM

Suppose that you are processing the sequence \(\text{I go eat now}\) through an LSTM for the purpose of translating it into French. Recall that processing such data happens on a per-token basis; each token is fed through the LSTM cell which processes the input token and passes the hidden state on to itself. When unrolled (as if you utilize many copies of the same LSTM model), this process looks as follows:

This immediately shows that LSTMs are unidirectional. In other words, the sequence is processed into one direction; here, from left to right. This makes common sense, as - except for a few languages - we read and write in a left-to-right fashion. For translation tasks, this is therefore not a problem, because you don't know what will be said in the future and hence have no business about knowing what will happen after your current input word.

But unidirectionality can also limit the performance of your Machine Learning model. This is especially true in the cases where the task is language understanding rather than sequence-to-sequence modeling. For example, if you're reading a book and have to construct a summary, or understand the context with respect to the sentiment of a text and possible hints about the semantics provided later, you'll read in a back-and-forth fashion.

Yes: you will read the sentence from the left to the right, and then also approach the same sentence from the right. In other words, in some language tasks, you will perform bidirectional reading. And for these tasks, unidirectional LSTMs might not suffice.

From unidirectional to bidirectional LSTMs

In those cases, you might wish to use a Bidirectional LSTM instead. With such a network, sequences are processed in both a left-to-right and a right-to-left fashion. In other words, the phrase \(\text{I go eat now}\) is processed as \(\text{I} \rightarrow \text{go} \rightarrow \text{eat} \rightarrow \text{now}\) and as \(\text{I} \leftarrow \text{go} \leftarrow \text{eat} \leftarrow \text{now}\).

This provides more context for the tasks that require both directions for better understanding.

While conceptually bidirectional LSTMs work in a bidirectional fashion, they are not bidirectional in practice. Rather, they are just two unidirectional LSTMs for which the output is combined. Outputs can be combined in multiple ways (TensorFlow, n.d.):

Vector summation. Here, the output equals \(\text{LSTM}_\rightarrow + \text{LSTM}_\leftarrow\).
Vector averaging. Here, the output equals \(\frac{1}{2}(\text{LSTM}_\rightarrow + \text{LSTM}_\leftarrow)\)
Vector multiplication. Here, the output equals \(\text{LSTM}_\rightarrow \times \text{LSTM}_\leftarrow\).
Vector concatenation. Here, the output vector is twice the dimensionality of the input vectors, because they are concatenated rather than combined.

Implementing a Bidirectional LSTM

Now that we understand how bidirectional LSTMs work, we can take a look at implementing one. In this tutorial, we will use TensorFlow 2.x and its Keras implementation tf.keras for doing so.

Tf.keras.layers.Bidirectional

Bidirectionality of a recurrent Keras Layer can be added by implementing tf.keras.layers.bidirectional (TensorFlow, n.d.). It is a wrapper layer that can be added to any of the recurrent layers available within Keras, such as LSTM, GRU and SimpleRNN. It looks as follows:

tf.keras.layers.Bidirectional(
    layer, merge_mode='concat', weights=None, backward_layer=None,
    **kwargs
)

The layer attributes are as follows:

The first argument represents the layer (one of the recurrent tf.keras.layers) that must be turned into a bidirectional one.
The merge_mode represents the way that outputs are constructed. Recall that results can be summated, averaged, multiplied and concatenated. By default, it's concat from the options {'sum', 'mul', 'concat', 'ave', None}. When set to None, nothing happens to the outputs, and they are returned as a list (TensorFlow, n.d.).
With backward_layer, a different layer can be passed for backwards processing, should left-to-right and right-to-left directionality be processed differently.

Creating a regular LSTM

The first step in creating a Bidirectional LSTM is defining a regular one. This can be done with the tf.keras.layers.LSTM layer, which we have explained in another tutorial. For the sake of brevity, we won't copy the entire model here multiple times - so we'll just show the segment that represents the model. As you can see, creating a regular LSTM in TensorFlow involves initializing the model (here, using Sequential), adding a word embedding, followed by the LSTM layer. Using a final Dense layer, we perform a binary classification problem.

# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(LSTM(10))
model.add(Dense(1, activation='sigmoid'))

Wrapping the LSTM with Bidirectional

Converting the regular or unidirectional LSTM into a bidirectional one is really simple. The only thing you have to do is to wrap it with a Bidirectional layer and specify the merge_mode as explained above. In this case, we set the merge mode to summation, which deviates from the default value of concatenation.

# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(10), merge_mode='sum'))
model.add(Dense(1, activation='sigmoid'))

Full model code

Of course, we will also show you the full model code for the examples above. This teaches you how to implement a full bidirectional LSTM. Let's explain how it works. Constructing a bidirectional LSTM involves the following steps...

Specifying the model imports. As you can see, we import a lot of TensorFlow modules. We're using the provided IMDB dataset for educational purposes, Embedding for learned embeddings, the Dense layer type for classification, and LSTM/Bidirectional for constructing the bidirectional LSTM. Binary crossentropy loss is used together with the Adam optimizer for optimization. With pad_sequences, we can ensure that our inputs are of equal length. Finally, we'll use Sequential - the Sequential API - for creating the initial model.
Listing the configuration options. I always think it's useful to specify all the configuration options before using them throughout the code. It simply provides the overview that we need. They are explained in more detail in the tutorial about LSTMs.
Loading and preparing the dataset. We use imdb.load_data(...) for loading the dataset given our configuration options, and use pad_sequences to ensure that sentences that are shorter than our maximum limit are padded with zeroes so that they are of equal length. The IMDB dataset can be used for sentiment analysis: we'll find out whether a review is positive or negative.
Defining the Keras model. In other words, constructing the skeleton of our model. Using Sequential, we initialize a model, and stack the Embedding, Bidirectional LSTM, and Dense layers on top of each other.
Compiling the model. This actually converts the model skeleton into a model that can be trained and used for predictions. Here, we specify the optimizer, loss function and additional metrics.
Generating a summary. This allows us to inspect the model in more detail.
Training and evaluating the model. With model.fit(...), we start the training process using our training data, with subsequent evaluation on our testing data using model.evaluate(...).

import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Embedding, Dense, LSTM, Bidirectional
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Model configuration
additional_metrics = ['accuracy']
batch_size = 128
embedding_output_dims = 15
loss_function = BinaryCrossentropy()
max_sequence_length = 300
num_distinct_words = 5000
number_of_epochs = 5
optimizer = Adam()
validation_split = 0.20
verbosity_mode = 1

# Load dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words)
print(x_train.shape)
print(x_test.shape)

# Pad all sequences
padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with <PAD>
padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with <PAD>

# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(10), merge_mode='sum'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics)

# Give a summary
model.summary()

# Train the model
history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split)

# Test the model after training
test_results = model.evaluate(padded_inputs_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')

Results

We can now run our Bidirectional LSTM by running the code in a terminal that has TensorFlow 2.x installed. This is what you should see:

2021-01-11 20:47:14.079739: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/5
157/157 [==============================] - 20s 102ms/step - loss: 0.6621 - accuracy: 0.5929 - val_loss: 0.4486 - val_accuracy: 0.8226
Epoch 2/5
157/157 [==============================] - 15s 99ms/step - loss: 0.4092 - accuracy: 0.8357 - val_loss: 0.3423 - val_accuracy: 0.8624
Epoch 3/5
157/157 [==============================] - 16s 99ms/step - loss: 0.2865 - accuracy: 0.8958 - val_loss: 0.3351 - val_accuracy: 0.8680
Epoch 4/5
157/157 [==============================] - 20s 127ms/step - loss: 0.2370 - accuracy: 0.9181 - val_loss: 0.3010 - val_accuracy: 0.8768
Epoch 5/5
157/157 [==============================] - 22s 139ms/step - loss: 0.1980 - accuracy: 0.9345 - val_loss: 0.3290 - val_accuracy: 0.8686
Test results - Loss: 0.33866164088249207 - Accuracy: 86.49600148200989%

An 86.5% accuracy for such a simple model, trained for only 5 epochs - not too bad! :)

Summary

In this tutorial, we saw how we can use TensorFlow and Keras to create a bidirectional LSTM. Using step-by-step explanations and many Python examples, you have learned how to create such a model, which should be better when bidirectionality is naturally present within the language task that you are performing.

We saw that LSTMs can be used for sequence-to-sequence tasks and that they improve upon classic RNNs by resolving the vanishing gradients problem. However, they are unidirectional, in the sense that they process text (or other sequences) in a left-to-right or a right-to-left fashion. This can be problematic when your task requires context 'from the future', e.g. when you are using the full context of the text to generate, say, a summary.

Bidirectionality can easily be added to LSTMs with TensorFlow thanks to the tf.keras.layers.Bidirectional layer. Being a layer wrapper to all Keras recurrent layers, it can be added to your existing LSTM easily, as you have seen in the tutorial. Configuration is also easy.

I hope that you have learned something from this article! If you did, please feel free to leave a comment in the comments section 💬 Please do the same if you have any remarks or suggestions for improvement. If you have questions, click the Ask Questions button on the right. I will try to respond as soon as I can :)

Thank you for reading MachineCurve today and happy engineering! 😎

References

MachineCurve. (2020, December 29). A gentle introduction to long short-term memory networks (LSTM). https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/

TensorFlow. (n.d.). Tf.keras.layers.Bidirectional. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Learn how large language models and other foundation models are working and how you can train open source ones yourself.

Keras

Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.

TensorFlow

TensorFlow is the most popular deep learning framework. It is is used by many companies.

PyTorch

PyTorch is a deep learning framework which is popular for its ease of use and flexibility.

Machine learning theory

Read about the fundamentals of machine learning, deep learning and artificial intelligence.

Transformer architectures

Emerging since 2017, Transformer architectures are part of the state of the art in deep learning.

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

bidirectional

deep learning

lstm

machine learning

nlp

recurrent neural networks

seq2seq

sequence to sequence learning

tensorflow

Connect on social media

Connect with me on LinkedIn

To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!

See my work on GitHub

My work is available on GitHub. Feel free to check it out and see if it can be of use to you!

Side info

The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.

All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.

If you have any questions or remarks, feel free to get in touch.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.

Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.

Mathjax is licensed under the Apache License, Version 2.0.

Bidirectional LSTMs with TensorFlow 2.0 and Keras

January 11, 2021 by Chris

Code example: using Bidirectional with TensorFlow and Keras

Bidirectional LSTMs: concepts

How LSTMs work

How unidirectionality can limit your LSTM

From unidirectional to bidirectional LSTMs

Implementing a Bidirectional LSTM

Tf.keras.layers.Bidirectional

Creating a regular LSTM

Wrapping the LSTM with Bidirectional

Full model code

Results

Summary

References

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most recent articles

January 8, 2024

LLM in a Flash: improving memory requirements of large language models

January 2, 2024

What is Retrieval-Augmented Generation?

December 27, 2023

Building a zero-shot image classifier with CLIP and HuggingFace Transformers

December 27, 2023

In-Context Learning: what it is and how it works

December 22, 2023

CLIP: how it works, how it's trained and how to use it

Article tags

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Connect on social media

Connect with me on LinkedIn

See my work on GitHub

Side info

Getting started

Foundation models

Keras

TensorFlow

PyTorch

Machine learning theory

Transformer architectures

Most popular articles

February 18, 2020

How to use K-fold Cross Validation with TensorFlow 2 and Keras?

December 28, 2020

Introduction to Transformers in Machine Learning

December 27, 2021

StyleGAN, a step-by-step introduction

July 17, 2019

This Person Does Not Exist - how does it work?

October 26, 2020

Your First Machine Learning Project with TensorFlow 2.0 and Keras

Side info

Connect with me on LinkedIn

See my work on GitHub