AIdventure

GPT1 - Improving Language Understanding by Pre-Training

June 11, 2018

GPT1 - Improving Language Understanding by Pre-Training

Abstract

GPT explores a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training , assuming access to a large corpus of unlabeled text, and datasets with manually annotated training examples for supervised fine-tuning .

To do so, GPT employ a two-stage training procedure:

  1. First, it uses a language modeling objetive on the unlabeled data to learn the initial parameters of a neural network model.
  2. Subsequently, it adapts the model parameters to a target task using the corresponding supervised objective.

Furthermore, this approach showcases zero-shot behaviors of the pre-trained model on different settings, demonstrating that GPT acquires useful linguistic knowledge for downstream tasks during the unsupervised pre-training phase.

Architecture

GPT model architecture is a multi-layer causal Transformer Decoder , almost identical to the original Transformer implementation.

We can denote the number of the Transformer decoder blocks as LL, the hidden size as HH, and the number of self-attention heads as AA. GPT initial model design is the following:

Model NameLL (Transformer blocks)HH (Hidden size)AA (Self-Attention heads)
GPTGPT1276812

Additionally, GPT uses a bytepair encoding (BPE) vocabulary with 40.00040.000 merges. Authors use the ftfy library to clean the raw text in BookCorpus dataset, standardize some punctuation and whitespace, and use the spaCy tokenizer.

Pre-training

Learn effectively from raw text is crucial to alleviating the dependence on supervised learning. Even in cases were considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost.

Given a unsupervised corpus of tokens, GPT uses a standard language modeling objective to maximize the likelihood. This task consists of predicting a token given its previous context . As in the Transformer, this task can be performed in an unsupervised way by taking sequences of tokens and adding a padding on the initial input, typically a special token, <s> for our illustration. GPT1 pre-training

Fine-tuning

After training the model, GPT adapts the parameters to a supervised target task . To do so, a labeled dataset CC is used, where each instance consists of a sequence of input tokens, x1,,xmx^1, \dots, x^m, along with a label yy. The input are passed through the pre-trained model to obtain the final transformer block’s activation hlmh_{l}^{m} (<e>), which is then fed into an added linear output layers with parameters WyW_y to predict yy.

Authors additionally found that including language modeling as an auxiliary objective to the fine-tuning helped improving generalization and accelerating convergence.

GPT setup does not require fine-tuning target tasks to be in the same domain as the unlabeled corpus used during pre-training. During transfer, GPT utilizes task-specific input adaptations, always processing structured text input as a single contiguous sequence of tokens. Taking that into account, minimal changes to the architecture of the pre-trained model are done .

Task-specific input transformations

For some tasks, like text classification, we can directly fine-tune GPT as described above. For other tasks, it is possible to convert structured inputs into an ordered sequence that the pre-trained model can process. These input transformations allow GPT to avoid making extensive changes to the architecture across tasks.

Glossary

References