BERT (language model)
Bidirectional encoder representations from transformers is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. BERT dramatically improved the state of the art for large language models., BERT is a ubiquitous baseline in natural language processing experiments.
BERT is trained by masked token prediction and next sentence prediction. With this training, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2. It found applications for many natural language processing tasks, such as coreference resolution and polysemy resolution. It improved on ELMo and spawned the study of "BERTology", which attempts to interpret what is learned by BERT.
BERT was originally implemented in the English language at two model sizes, BERTBASE and BERTLARGE. Both were trained on the Toronto BookCorpus and English Wikipedia . The weights were released on GitHub. On March 11, 2020, 24 smaller models were released, the smallest being BERTTINY with just 4 million parameters.
Architecture
BERT is an "encoder-only" transformer architecture. At a high level, BERT consists of 4 modules:- Tokenizer: This module converts a piece of English text into a sequence of integers.
- Embedding: This module converts the sequence of tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensional Euclidean space.
- Encoder: a stack of Transformer blocks with self-attention, but without causal masking.
- Task head: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types, or as an "un-embedding layer".
Embedding
This section describes the embedding used by BERTBASE. The other one, BERTLARGE, is similar, just larger.The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte-pair encoding. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by
. The first layer is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings.
- Token type: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.
- Position: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in a sequence is mapped to a real-valued vector. Each dimension of the vector consists of a sinusoidal function that takes the position in the sequence as input.
- Segment type: Using a vocabulary of just 0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the
special token. All prior tokens are type-0.
Architectural family
The encoder stack of BERT has 2 free parameters:, the number of layers, and, the hidden size. There are always self-attention heads, and the feed-forward/filter size is always. By varying these two numbers, one obtains an entire family of BERT models.For BERT:
- the feed-forward size and filter size are synonymous. Both of them denote the number of dimensions in the middle layer of the feed-forward network.
- the hidden size and embedding size are synonymous. Both of them denote the number of real numbers used to represent a token.
Training
Pre-training
BERT was pre-trained simultaneously on two tasks:- Masked language modeling : In this task, BERT ingests a sequence of words, where one word may be randomly changed, and BERT tries to predict the original words that had been changed. For example, in the sentence "The cat sat on the
," BERT would need to predict "mat." This helps BERT learn bidirectional context, meaning it understands the relationships between words not just from left to right or right to left but from both directions at the same time. - Next sentence prediction : In this task, BERT is trained to predict whether one sentence logically follows another. For example, given two sentences, "The cat sat on the mat" and "It was a sunny day", BERT has to decide if the second sentence is a valid continuation of the first one. This helps BERT understand relationships between sentences, which is important for tasks like question answering or document classification.
Masked language modeling
- replaced with a
token with probability 80%, - replaced with a random word token with probability 10%,
- not replaced with probability 10%.
tokens. It is later found that more diverse training objectives are generally better.As an illustrative example, consider the sentence "my dog is cute". It would first be divided into tokens like "my1 dog2 is3 cute4". Then a random token in the sentence would be picked. Let it be the 4th one "cute4". Next, there would be three possibilities:
- with probability 80%, the chosen token is masked, resulting in "my1 dog2 is3
4"; - with probability 10%, the chosen token is replaced by a uniformly sampled random token, such as "happy", resulting in "my1 dog2 is3 happy4";
- with probability 10%, nothing is done, resulting in "my1 dog2 is3 cute4".
Next sentence prediction
Given two sentences, the model predicts if they appear sequentially in the training corpus, outputting either or . During training, the algorithm sometimes samples two sentences from a single continuous span in the training corpus, while at other times, it samples two sentences from two discontinuous spans.The first sentence starts with a special token,
. The two sentences are separated by another special token, . After processing the two sentences, the final vector for the token is passed to a linear layer for binary classification into and .For example:
- Given "
my dog is cutehe likes playing", the model should predict. - Given "
my dog is cutehow do magnets work", the model should predict.Fine-tuning
The original BERT paper published results demonstrating that a small amount of finetuning allowed it to achieved state-of-the-art performance on a number of natural language understanding tasks:
- GLUE task set ;
- SQuAD v1.1 and v2.0;
- SWAG.
input token is fed into a linear-softmax layer to produce the label outputs.The original code base defined the final linear layer as a "pooler layer", in analogy with global pooling in computer vision, even though it simply discards all output tokens except the one corresponding to
.