← Back to homepage

DALL·E: OpenAI GPT-3 model can draw pictures based on text

January 5, 2021 by Chris

In 2020, the GPT-3 model created by OpenAI created big headlines: it was capable of generating text that could not be distinguished from human-written text. In addition, Microsoft acquired an exclusive license to the model, possibly integrating it with its cloud services for text generation.

GPT-3, however, cannot only be used for text purposes. Recently, we have seen the emergence of Transformers for Computer Vision. Today, in a blog post at OpenAI.com, DALL·E was announced. The model, which is named after Salvador Dalí and Pixar's WALL·E, is capable of generating high-quality images based on text.

We've ploughed through the blog article to understand how it works. In this article, you'll therefore find what DALL·E is capable of. In addition, you'll also find how it works and how it was trained. We've also brainstormed about a few possible applications for DALL·E.

We're still awaiting the publishing of the DALL·E paper, but let's already take a look! 😎

What DALL·E does

Suppose that you need to generate images. Previously, you'd hire an artist, which would take your requirements and generate the image in return. Or if you wanted a photograph that looked professional, you'd hire a photographer, tell him or her what to do, and await the results.

With DALL·E, you can instead give the requirements to the Artificial Intelligence model and get the result back. For example, as available in OpenAI's blog article (really recommended to read - more examples can be found there):

Source: OpenAI (2021)

How DALL·E works

DALL·E is based on the GPT-3 model that we have heard a lot of buzz about in the past few months. This model, which is an extension of GPT-2 which extends GPT itself, autoregressively learns to build an understanding of natural language. This understanding can subsequently being used for downstream tasks like text summarization or question answering.

About GPT-3 and previous approaches

Previous approaches like BERT and the original GPT model followed the fine-tuning approach. Here, the model was first pretrained on massive datasets that are unlabeled (e.g. the BooksCorpus dataset, or the English Wikipedia dataset), which allows it to build up an unguided understanding of natural language. It could then be finetuned to a specific language task by means of some labeled, but smaller, dataset.

GPT-2 and GPT-3 recognized that even while pretraining already provided lots of benefits compared to training from scratch, so-called zero-shot learning - where the model is finetuned and then applied to language tasks, without pretraining - could be the way forward. The creators of these successive models argued that pretrained models could build sufficient language understanding to be used in the downstream applications. And they succeeded: GPT-3 is capable of generating human-like language. This does however come at a cost: the models are huge. So huge that they cannot be used normally in practice. But diving into this is beyond the scope of this article. Let's get back to DALL·E now.

Specific implementation of GPT-3 for DALL·E

Like GPT-3, DALL·E is based on the Transformer architecture. This architecture, which was originally proposed back in 2017, has changed the field of Natural Language Processing. The DALL·E model, during pretraining, receives two sequences of data of at max 1280 tokens: both the text as well as the image (OpenAI, 2021).

It is then trained using maximum likelihood, predicting the tokens in a sequence, in some sort of a Language Modeling task (OpenAI, 2021).

As we can see in the article, DALL·E is capable of performing a variety of tasks:

Possible applications for DALL·E

We can come up with a wide variety of applications for the new DALL·E model:

How DALL·E can be used in industrial and interior design: an armchair in the shape of an avocado. Source: OpenAI (2021)

Summary

DALL·E is a GPT-3 based model that can use text for the creation of images. OpenAI published about the model in January 2021, spawning yet another possibility to use GPT-3 in practice.

In this article, we first looked at what DALL·E is. Named after Salvador Dalí and Pixar's WALL·E movie, we saw that it can indeed be used for image creation. Then, when taking a look at how it works, we saw that it is not so different from the original GPT-3 model. Whereas the latter utilizes textual inputs in a language modelling task, DALL·E jointly inputs text and images in a fixed-length sequence to learn how to generate the images.

OpenAI's article gives you the opportunity to create many images yourself. Go check it out. It's really awesome! 😎

References

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog1(8), 9.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020). Language models are few-shot learnersarXiv preprint arXiv:2005.14165.

OpenAI. (2021, January 5). DALL·E: Creating images from texthttps://openai.com/blog/dall-e/

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.