What is DialoGPT and how does it work?

Chris Staff asked 6 months ago
1 Answers
Best Answer
Chris Staff answered 6 months ago

DialoGPT is “a tunable gigaword-scale neural network model for generation of conversational responses, trained on Reddit data”. It uses a Transformer based architecture for doing so, because of their great empirical success. Doing so, the creators have attempted to resolve challenges present with neural response generation – i.e. generating texts relevant to the prompt. These are related to the fact that conversations are informal, noisy, and contain abbreviations or errors. This also translates into issues with current approaches for neural response generation:
 
1. They can be inconsistent.
2. They have difficulty keeping information in longer-term contexts (i.e. over many conversational turns).
3. They produce bland answers.
 
Transformer-based approaches like GPT can solve these issues, and that’s why DialoGPT was born – an extension of GPT2 specifically for neural response generation, a.k.a. chatbots. Let’s take a look the model in more detail by summarizing the paper.
 
Model Architecture
– The model was trained on the basis of GPT-2 (Radford et al., 2018), inheriting the 12-to-48 layer Transformer with layer normaization, initialization scheme and byte pair encodings for the tokenizer.
– Modeling a multi-turn dialogue session is done with long texts and generation task is framed as language modeling.
– First, all dialog turns (x1, …, xn) are concatenated into a long text. The source sentence (dialogue history) is denoted as S = x1, …, xm with m < n; the target sentence (ground truth for the model) is T = xm+1, …, xn. Then, the conditional probability P(T|S) is the product of conditional probabilities (p(xn|x1, …, xn-1)) where n = [m+1; N]. In other words, the conditional probabilities for each word in the target response given the previous words in S, as well as those in T preceding xn.
 
Mutual Information Maximization
– Responses of language models can be bland sometimes.
– By chaining Mutual Information Maximization (MIM) into a ML pipeline involving text generation, this can be avoided.
– It effectively involves predicting P(Source|target), i.e. what is the source text for the target predicted by DialoGPT for top-K sampled predictions.
– The one with highest probability is likely the best sentence, and this also penalizes bland hypotheses, because frequently occurring (bland) target responses can occur with many Sources, therefore not maximizing P(Source|target). Instead, the more unique targets give higher probabilities here.
– As we shall see, MIM is using during an evaluation step.
 
Training Dataset
– A new Reddit based dataset was extracted from comment chains spanning from 2005 to 2017.
– These are natural examples of dialogues because they are structured in tree-based reply chains.
– Creation happened by extracting each path from the root node to the leaf node as a training instance.
– Data was filtered by removing instances (1) containing URLs, (2) target contains word repetitions of >= 3 words, (3) response does not contain one of top-50 most frequent English words (e.g. “the”), indicating that it is likely not written in English, (4) response contains special markers like “[” – can be code; (5) where source and target sequences exceed 200 words; (6) where target contains offensive language.
– In addition, blandness was filtered out aggressively, by removing instances where responses contained 90% of tri-grams that have been seen more than 1000 times.
– This leads to approximately 150 million dialogue instances with 1.8 billion words in total.
 
Experimental Results
Let’s take a look at details for the experiments undertaken with DialoGPT.
 
(Details)
– 3 different model sizes were trained: 117M, 345M and 762M parameters. The model itself follows Radford et al. (2018).
– Vocabulary has 50257 entries and was trained on 16 Nvidia V100 machines.
– Other characteristics: Noam learning rate scheduler, 16000 warmup steps, LR based on validation loss, trained until no progress in validation loss. Small and medium model thus trained for 5 epochs, large model for 3 epochs.
– Training has been accelerated by smartly structuring data by compressing all data into a database file and leveraging asynchronous data processing. In addition, dynamic batching was applied, to maximize the training throughput.
 
(DSTC-7 Dialogue Generation Challenge)
– DSTC (Dialogue System Technology Challenges) 7 track is an end-to-end conversational modeling task.
– It involves conversations with no specific or predefined goal (e.g. making a doctor’s appointment), targeting human interactions.
– Contains threads from Reddit data.
– Test set was created with samples that have > 6 responses, to make the conversation of adequate length.
– Automatic evaluation was performed with BLEU, METEOR and NIST, and also compared to the in-house model powering Microsoft Azure Cognitive Service, PERSONALITYCHAT.
– DialoGPT-345M performs better than DialoGPT-117M. Beam search really improves BLEU and DIST, marginally NIST and METEOR.
– In addition, scores for DialoGPT are higher than those for humans. This likely emerges because model responses are close to the geometric mean of all possible responses, which is a good “average” and hence more appreciable during evaluation.
 
(New Reddit based dataset)
– DialoGPT was also evaluated on a multi-reference test set with 6K examples.
– Tested on two settings: training from scratch and fine-tuning using GPT-2 as pretrained model.
– Larger models here too outperform smaller ones. GPT-2 based pretraining also gives performance gains. Best performing: DialoGPT-345M with beam search.
 
(Reranking responses with MMI)
– Responses can still be bland. Recall that MIM was performed in evaluation to rerank responses and improve response generation.
– 16 samples for each input source generated, using top-K sampling with K=10, where probabilities below the Kth are zeroed out (i.e. 16 responses had the top-10 probabilities here, meaning that some had equal probability). The DialoGPT 345M model was used for this.
– Reranking i.e. performing P(Source|target) was performed with DialoGPT-345M model.
– Response with lowest backwards loss selected for evaluation.
– Compared with greedy generation, MMI produces most diverse responses with higher scores on NIST, METEOR and Entropy/Dist, but slightly lower on BLEU.
 
(Generation)
– Generation examples show that the model exhibits ability to address commonsense questions to some extend.
– This is assumed to come from the rich amount of information that can be learned from Reddit data.
– Interestingly, sometimes the model produces a different alternative answer that is also reasonable instead of the desired answer. For example, asking “which animal has black and white stripes?”, instead of “zebra”, the model answers “A black and white striped cat”. Haha 😂
– In doing so, these models significantly outperform RNN based ones.
 
(Human Evaluation)
– 2000 randomly sampled test sources evaluated from the Reddit 6K test dataset using crowdsourcing.
– System outputs presented to 3 judges, who ranked them for relevance, informativeness and how human-like the text is.
– Strong performance can be observed for DialoGPT over PERSONALITYCHAT.
– Vanilla DialoGPT (without additions) can already produce human-like text.
– MMI based one preferred over human responses, because of the “average-ness” as described above.
 
Limitations and Risks
There are some limitations:
– The decoder must be developed by the user.
– DialoGPT has potential to generate offensive text, despite efforts to avoid using such text in training data.
– Outputs may reflect gender and other historical biases implicit in the data.
– Outputs may be in agreement with unethical or biased or offensive statements due implicit in the data.
– These are known issues.
 
References
Zhang, Y., Sun, S., Galley, M., Chen, Y. C., Brockett, C., Gao, X., … & Dolan, B. (2019). Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

Your Answer

20 + 8 =