AIdventure - BERT - Masking things out

Left-to-right architectures may be sub-optimal for sentence-level tasks such as sentence classification, and very harmful for token-level tasks such as question answering or named entity recognition. There are tasks where incorporating context from left and right of a token is crucial.

BERT, which stands for Bidirectional Encoder Representations from Transformers, aims to alleviate the problem of unidirectionality by using a masked language model (MLM) pre-training objective, inspired by the Cloze task. In addition, authors also use a next sentence prediction task that jointly pretrains text-pair representations.

There are two steps in BERT framework: pre-training and ﬁne-tuning. BERT aims to first understand the language and then use that understanding to perform a task, instead of using the task to understand the language as previous models did. This way, the model can be fine-tuned to perform a wide variety of tasks with minimal additional training.

BERT paper proved that Transformer based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed that this pre-trained language model can be transferred into any NLP task without making task specific model architecture.

Architecture

BERT’s model architecture is a multi-layer bidirectional Transformer Encoder, almost identifical to the original Transformer implementation. If you want more details about the Transformer architecture, you can check out my Transformer blog post.

BERT denotes the number of the Transformer encoder blocks as $L$ , the hidden size as $H$ , and the number of self-attention heads as $A$ . BERT initial model designs are the following:

BERT model configurations. $BERT_{BASE}$ was chosen to have the same model size as OpenAI GPT for comparison purposes.
Model Name	$L$ (Transformer blocks)	$H$ (Hidden size)	$A$ (Self-Attention heads)
$BERT_{BASE}$	12	768	12
$BERT_{LARGE}$	24	1024	16

As we will see at the Next Sentence Prediction section, BERT deals with pairs of sentences. To differentiate between the two sentences, BERT separates them with a special token called [SEP]. In addition, BERT adds learned embeddings to every token indicating whether it belongs to the first or second sentence.

Finally, for sentence-level tasks, BERT adds a special token called [CLS] at the beginning of every sequence. The final hidden state of this token is used as the aggregate sequence representation for sentence-level tasks such as classification.

Pre-training

The objective of pre-training is to learn a general-purpose language representation that can be used for downstream tasks. BERT focuses on two unsupervised tasks: learning from the surrounding context, through a masked language model, and learning from the relationship between two sentences, through a next sentence prediction task (helpful for some downstream tasks such as question answering).

Both tasks are trained at the same time, summing their losses. For the pre-training corpus, BERT uses the BooksCorpus (800M words) and English Wikipedia (2,500M words). Note that it is critical to use a document-level corpus rather than a shuffled sentence-level corpus, so long contiguous sequences can be used.

BERT model architecture for pre-training. The model receives a sequence of tokens as input, and outputs a sequence of vectors, one for each input token. The vector corresponding to the `[CLS]` token is used as the aggregate sequence representation for next sentence prediction. Some tokens are masked out with `[MASK]` tokens, and the model is trained to predict the original vocabulary id of the masked word based only on its context.

Masked Language Model

The bidirectional context understanding can enable to capture intricate dependencies and relationships among words, resulting in more robust and contextually rich representations.

Previous methods suffered from the unidirectionality constraint: a word can only attend to previous words in the self-attention layers. In order to train deep bidirectional representations, BERT simply masks some percentage of the input tokens at random, and then predicts those masked tokens. This is different from traditional language modeling, where the model is trained to predict the next word in a sequence where only the previous words are visible. BERT authors refer to this procedure as a masked language model (MLM).

The procedure is as follows:

Tokenize the input sequence.
Replace 15% tokens with:
- 80% of the time: [MASK] token.
- 10% of the time: a random token.
- 10% of the time: the original token.
Feed the sequence to the model.
Only for the replaced tokens, compute cross entropy loss between the output and the original sequence.

Although MLM allows BERT to obtain a bidirectional pre-trained model, a downside is that it creates a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. That’s why BERT sometimes replaces the masked tokens with the original and random tokens.

Next Sentence Prediction

Some downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between sentences.

In order to train a model that understands sentence relationships, BERT authors pre-train the model with a simple task called next sentence prediction. For next sentence prediction two sentences are choosen at random, and the model is trained to predict whether the second sentence is the actual next sentence in the original document. The model is trained with 50% of the time the second sentence is the actual next sentence, and 50% of the time it is a random sentence from the corpus.

BERT adds a special token called [SEP] between the two sentences. Finally, the model introduces a [CLS] token at the beginning of the first sentence, and the final hidden state of this token is used for next sentence prediction.

Fine-tuning

Swapping out the appropiate inputs and outputs, BERT can be used for a wide variety of downstream tasks, whether they involve single text or text pairs. To do so, BERT fine-tunes all the parameters end-to-end. Compared to pre-training, fine-tuning is relatively inexpensive.

At the output, the token representations are fed into an output layer for token-level tasks such as named entity recognition, and the [CLS] representation is fed into an output layer for classification tasks such as entailment or sentiment analysis.

Glossary

$L$ : Number of Transformer encoder blocks.
$H$ : Size of the embeddings. An embedding is a learnable representation of the words of the vocabulary.
$A$ : Number of self-attention heads.
w: Input sequence length.

BERT - Masking things out