AIdventure - RoBERTa - A Robustly Optimized BERT Pretraining Approach

RoBERTa, Robustly optimized BERT, presents a replication study of BERT pretraining, including a careful evaluation of the impact of multiple pretraining hyperparameters. The authors show that BERT was significantly undertrained, and propose an improved recipe for training BERT.

In summary, authors show that, under the right design choices, BERT’s performance is competitive. Authors present a set of BERT design choices and training strategies that lead to better performance, as well as a novel dataset, CC-News, confirming that using more data further improves performance.

BERT Improvements

RoBERTa uses the same model architecture as BERT, but trains it with a few key differences.

Static vs. Dynamic Masking

BERT relies on randomly masking and predicting tokens. The original BERT implmentation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways.

RoBERTa compares this strategy with dynamic masking where the mask pattern is generated every time that a sequence is fed into the model.This becomes crucial when pretraining for more steps or with larger datasets. Finally, the results show that dynamic masking is comparable or slightly better than static masking.

Input Format and Next Sentence Prediction

Original BERT takes as input a concatenation of two segments (sequence of tokens), $x_1, \ldots, x_N$ and $y_1, \ldots, y_M$ . Segments usually consist of more than one natural sentence. The two segments are packed together as a single input sequence to BERT with special tokens delimitig them: $[CLS], x_1, \ldots, x_N, [SEP], y_1, \ldots, y_M, [EOS]$ , constrained such that $N + M < T$ for some maximum sequence length $T$ .

In addition to the masked language modeling objective, BERT is trained to predict whether the observed document segments come from the same document or not via an auxiliary Next Sentence Prediction (NSP) loss.

However, some recent work questioned the necessity of the NSP loss, and RoBERTa compares several alternative training formats:

SEGMENT-PAIR+NSP: The original BERT format. The input is a pair of segments, which can contain multiple natural sentences. Max length 512 tokens. The NSP loss is used.
SENTENCE-PAIR+NSP: The input is a pair of natural sentences. Since the input are significantly shorter, the batch size is increased to achieve similar number of total tokens. The NSP loss is used.
FULL-SENTENCES: The input are full sentences sampled contiguously from one or more documents, such that the length is at most 512 tokens. Inputs may cross document boundaries, so when reach te end of a document, authors begin sampling sentences from the next document and add an extra separator token between documents. The NSP loss is removed.
DOC-SENTENCES: The input are full sentences sampled contiguously from just one document, without crossing document boundaries. Inputs sampled near the end of a document may be shorter than 512 tokens, so the batch size is increased dynamically to achieve similar number of total tokens. The NSP loss is removed.

Authors find that using individual sentences hurts performance on downstream tasks, hypothesizing that the model is not able to learn long-range dependencies. They also find that removing the NSP loss matches or slightly improves performance on most tasks. With that in mind, RoBERTa uses the FULL-SENTENCES format without the NSP loss.

Large Batch Sizes

Training with very large batch sizes can both improve optimization speed and end-task performance, when the learning rate is appropriately scaled. Authors observe that training with large batches improves perplexity for masked language modeling, as well as end-task accuracy.

Text Encoding: Bytes BPE

Byte-Pair Encoding (BPE) relies on subword units, which are extracted by performing statistical analysis of the training corpus. Original BERT implementation uses a character-level BPE vocabulary of size 30K. RoBERTa uses a byte-level BPE vocabulary of size 50K, which allows the model to represent any text in Unicode.

Early experiments revealed only slight differences between these encodings.

Training Data

Additionally, authors investigate two other important factors: the amount of training data used for pretraining, and the number of training passes through the data.

RoBERTa combines different datasets varying sizes and domains, with over 160GB of uncompressed text.

BookCorpus + English Wikipedia: The original BERT training data.
CC-News: About English portion of the CommonCrawl News dataset.
OpenWebText: Web content extracted from URLs shared on Reddit with at least 3 upvotes.
STORIES: A subset of CommonCrawl data filtered to match the story-like style of Winograde schemas.

Authors observe a small improvement when combining all datasets and training the same number of steps. Finally, when pretraining RoBERTa for more steps, the model’s performance improves without overfitting symptoms, suggesting that RoBERTa would likely benefit from additional training.

RoBERTa - A Robustly Optimized BERT Pretraining Approach