← Back to homepage

Easy Causal Language Modeling with Machine Learning and HuggingFace Transformers

March 3, 2021 by Chris

Machine Learning in NLP is making a lot of progress. It can be used for many language tasks, primarily thanks to the so-called Transformer architecture that was invented back in 2017 and has been improved until today. Text summarization, machine translation, named entity recognition and even speech-to-text - those are just a few examples.

But language modeling itself is also a task that can be performed by such models. That is: using machine learning to predict a new word given the previous words. Using language modeling, you will be able to generate text and use ML for generative purposes. In fact, this lies at the basis of many more specialized models, such as the ones mentioned above.

And this importance combined with opportunities for usage is why we'll take a look at language modeling in this tutorial.

After reading this tutorial, you will understand...

Let's take a look! 🚀

Code example: language modeling with Python

This fully working code example shows how you can create a generative language model with Python. We use HuggingFace Transformers for this model, so make sure to have it installed in your environment (pip install transformers). Also make sure to have a recent version of PyTorch installed, as it is also required. However, with a few changes, it can also be adapted to run with TensorFlow.

Make sure to read the rest of the article to understand everything in more detail, but here you go 🚀

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F

# Load GPT-2 tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

# Tokenize input phrase
phrase = f'Make sure to read the rest of this '
inputs = tokenizer.encode(phrase, return_tensors='pt')

# Get logits from last layer
last_layer_logits = model(inputs).logits[:, -1, :]

# Keep top 30 logits at max; stop if cumulative probability >= 1.0.
top_logits = top_k_top_p_filtering(last_layer_logits, top_k=100, top_p=1.0)

# Softmax the logits into probabilities
probabilities = F.softmax(top_logits, dim=-1)

# Generate next token
generated_next_token = torch.multinomial(probabilities, num_samples=1)
generated = torch.cat([inputs, generated_next_token], dim=-1)

# Get result
result_string = tokenizer.decode(generated.tolist()[0])

# Print string
print(result_string)

Result:

Make sure to read the rest of this ____

Causal Language Modeling and Transformers

According to HuggingFace (n.d.):

Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting for generation tasks.

Today's NLP models, which primarily rely on the Transformer architecture that we will discuss shortly, are mostly trained in a pretraining-finetuning fashion. This is a two-stage process where models are first pretrained with a very large, unlabeled set of textual data. This way, machine learning models can benefit from the vast quantities of such data, without the cost of labeling, which is relatively big. Subsequently, in a finetuning step, pretrained models are tailored to a specific task - such as sentiment analysis or named entity recognition.

Pretraining happens with a particular task. Language modeling is one of these tasks. As you can see in the definition above and the image below, the model must compute the most likely token given the current sequence of tokens. In other words, it must learn to predict the best word given a certain phrase, which comes from a particular context. As you can imagine, when language models do this at scale (hence the vast quantities of supervised data and large models), they can learn to understand many patterns underlying human language.

(Causal) language modeling

The definition from HuggingFace (n.d.) quoted above mentions that "the model only attends to the left context", meaning "tokens on the left of the mask". If we want to understand this in more detail, we must take a look at the trade-off between unidirectionality and bidirectionality. If you know Transformers, you know that the multi-head attention mechanism generates attention maps that illustrate which words are closely related.

Attention can be used for paying attention to words that matter and hence play a large role in predicting, say, the next token.

Original Transformer models, such as the Vaswani model and OpenAI's GPT model, added a so-called masked attention segment into their architecture as well. Masking provides unidirectionality: attention can only be computed in a left-to-right or right-to-left fashion; often left-to-right. It ensures that models cannot see 'into the future' during training, which would translate into simply memorizing tokens if the goal is to predict the most likely token that follows another sequence.

Masking works by setting all future tokens in the attention map to minus infinite, meaning that they are converted into zero when fed to the Softmax layer that is common within Transformer attention segments.

Masked self-attention

Masking future tokens in the attention map can benefit text generation. For other tasks, such as sentiment analysis, it may be counterproductive - as was argued by the creators of the BERT model. The B in BERT stands for Bidirectional and it is unsurprising to find that BERT architectures remove masking altogether (by using the encoder segment from the Transformer only). This is why we see traditional (Seq2Seq) and GPT-like (decoder-only; autoregressive) models being used for text generation a lot, whereas BERT-like models are more used for other tasks (say sentiment analysis, text classification, ...).

Let's now take a look at implementing a Language Modeling model with HuggingFace Transformers and Python.

Implementing a Language Modeling model with Python

In this section, you will learn how a Python based pipeline for (Causal) Language Modeling can be implemented. We will first take a brief look at GPT2, the Transformer model that we will be using today. Then, we'll introduce HuggingFace Transformers, which is a library that can be used for creating such models with few lines of code. Then, we teach you how to code your model, and finally show some results.

Today's Transformer: GPT2, part of the GPT family of language models

If you have been following developments within machine learning recently, you know that the GPT family of language models has gained a lot of traction recently. These models use the decoder segment of the original Transformer model, applying some changes, and using an autoregressive language modeling task - where the goal is to predict the next token given the subsequent ones (does that ring a bell? ;-) ).

Very briefly:

We will be using GPT-2 for our Language Modeling pipeline today. It is open source and available within the HuggingFace Model Hub, whereas GPT-3 is exclusively licensed by Microsoft (goodbye, open source?).

Vaswani et al.'s decoder segment lies at the basis of GPT2.

HuggingFace Transformers

For building our pipeline, we will be using HuggingFace Transformers, part of the HuggingFace community that is focused on democratizing NLP models through the open source movement. It is a library that contains many functionalities for using pretrained and finetuned models that are stored in the Model Hub, including GPT-2.

Model code

Time to write some code! Ensure that you have installed HuggingFace Transformers (pip install transformers) and, in this case, PyTorch - although it will also work with TensorFlow backends (search for Causal Language Modeling there).

Here's what happens under the hood:

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F

# Load GPT-2 tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

# Tokenize input phrase
phrase = f'I sleep in a bed that is poorly '
inputs = tokenizer.encode(phrase, return_tensors='pt')

# Get logits from last layer
last_layer_logits = model(inputs).logits[:, -1, :]

# Keep top 30 logits at max; stop if cumulative probability >= 1.0.
top_logits = top_k_top_p_filtering(last_layer_logits, top_k=30, top_p=1.0)

# Softmax the logits into probabilities
probabilities = F.softmax(top_logits, dim=-1)

# Generate next token
generated_next_token = torch.multinomial(probabilities, num_samples=1)
generated = torch.cat([inputs, generated_next_token], dim=-1)

# Get result
result_string = tokenizer.decode(generated.tolist()[0])

# Print string
print(result_string)

Results

Running the code for the first time indeed ensures that the model is downloaded:

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 333kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.41MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 650kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:01<00:00, 843kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████| 548M/548M [01:05<00:00, 8.43MB/s]

I then get the following output:

I sleep in a bed that is poorly iced

A bit strange, but hey! 😂

Summary

In this tutorial, you have learned the following things:

I hope that you have learned a few things from this tutorial! If you did, please feel free to leave a message in the comments section below, as I'd love to hear from you 💬 Please do the same when you have any questions or remarks.

Thank you for reading MachineCurve today and happy engineering! 😎

References

HuggingFace. (n.d.). Summary of the tasks — transformers 4.3.0 documentation. Hugging Face – On a mission to solve NLP, one commit at a time. https://huggingface.co/transformers/task_summary.html

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.

Shree, P. (2020, November 10). The journey of open AI GPT models. Medium. https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.