History of Pretrained Language Models Development

Pretraining

For similar tasks A and B, task A has already pretrained model A with large-scale data. Model A can then be adapted (fine-tuned or frozen) to task B with a small amount of data.

If tasks A and B are not similar, then BERT is used to solve the problem (will be discussed later).

Language Model

To calculate the probability of the next word in a sentence, all words in the sentence need to be involved.

Statistical Language Model (n-gram Language Model)

If using an n-gram language model, only the previous n words before the next word are considered, which greatly reduces computation.
If the training data does not contain the desired prediction result, a smoothing strategy can be used to assign a very small probability instead of zero (please search for details on smoothing strategies).

Word Embeddings

Word Embeddings Obtained by One-hot Encoding

  • Initially, word vectors were obtained through one-hot encoding, but one-hot encoding occupies too much space.

Word Embeddings Obtained by NNLM (Neural Network Language Model)

  • The NNLM model was originally used to predict the next word, but it can also train a Q matrix to obtain word vectors, which take up less space than one-hot encoding—ten vectors can represent a hundred words.

Word Embeddings Obtained by word2vec

  • Subsequently, word2vec (CBOW, skip-gram) was specifically invented to obtain the Q matrix as word embeddings. However, word2vec cannot solve the problem of polysemy; for example, the word vector for the edible apple and the Apple phone is the same. Therefore, ELMO was invented to address polysemy.

Word Embeddings Obtained by ELMO

ELMO was invented to address polysemy—not only training the Q matrix but also incorporating contextual information of words into the Q matrix.

Downstream Task Modification

Given a sentence, first use one-hot encoding, then directly obtain word vectors using the pretrained Word2Vec Q matrix, and then perform the following tasks:

  1. Freeze: The Q matrix can remain unchanged.
  2. Fine-tune: The Q matrix is changed according to the task.

Attention (QKV)

Calculate the similarity between the query object Q (me looking at the image) and the value object V (the image being looked at). The closer the similarity, the more important V is to Q.
Calculate the similarity between Q and each element (k1, k2, …, kn) in K (where does K come from? The original V?) by dot product, then apply softmax to get probabilities, then multiply by V to get a new V’, which contains new information (what is more important and what is not).

Self-Attention

Multiply Q and K to calculate similarity.

Transformer

GPT

BERT

References

Read 09 Transformer: What is the Attention Mechanism (Attention)_哔哩哔哩_bilibili