History of Pretrained Language Models Development

doggie · October 7, 2025, 3:17pm

Pretraining

For similar tasks A and B, task A has already pretrained model A with large-scale data. Model A can then be adapted (fine-tuned or frozen) to task B with a small amount of data.

If tasks A and B are not similar, then BERT is used to solve the problem (will be discussed later).

Language Model

To calculate the probability of the next word in a sentence, all words in the sentence need to be involved.

Statistical Language Model (n-gram Language Model)

If using an n-gram language model, only the previous n words before the next word are considered, which greatly reduces computation.
If the training data does not contain the desired prediction result, a smoothing strategy can be used to assign a very small probability instead of zero (please search for details on smoothing strategies).

Word Embeddings

Word Embeddings Obtained by One-hot Encoding

Initially, word vectors were obtained through one-hot encoding, but one-hot encoding occupies too much space.

Word Embeddings Obtained by NNLM (Neural Network Language Model)

The NNLM model was originally used to predict the next word, but it can also train a Q matrix to obtain word vectors, which take up less space than one-hot encoding—ten vectors can represent a hundred words.

Word Embeddings Obtained by word2vec

Subsequently, word2vec (CBOW, skip-gram) was specifically invented to obtain the Q matrix as word embeddings. However, word2vec cannot solve the problem of polysemy; for example, the word vector for the edible apple and the Apple phone is the same. Therefore, ELMO was invented to address polysemy.

Word Embeddings Obtained by ELMO

ELMO was invented to address polysemy—not only training the Q matrix but also incorporating contextual information of words into the Q matrix.

Downstream Task Modification

Given a sentence, first use one-hot encoding, then directly obtain word vectors using the pretrained Word2Vec Q matrix, and then perform the following tasks:

Freeze: The Q matrix can remain unchanged.
Fine-tune: The Q matrix is changed according to the task.

Attention (QKV)

Calculate the similarity between the query object Q (me looking at the image) and the value object V (the image being looked at). The closer the similarity, the more important V is to Q.
Calculate the similarity between Q and each element (k1, k2, …, kn) in K (where does K come from? The original V?) by dot product, then apply softmax to get probabilities, then multiply by V to get a new V’, which contains new information (what is more important and what is not).

Self-Attention

Multiply Q and K to calculate similarity.

Transformer

GPT

BERT

References

doggie · October 8, 2025, 3:30am

Read 09 Transformer: What is the Attention Mechanism (Attention)_哔哩哔哩_bilibili

Topic	Replies	Views
【转载】语言模型简介 🛠工具与编程	18	March 9, 2025
【工具推荐】人工智能PDF转换格式（比如markdown 🛠工具与编程人工智能 , pdf	42	August 11, 2024
token简介 🤖人工智能自然语言处理	5	October 5, 2025
transformer详细解释 💻编程	6	October 4, 2025
如何将qwen3小模型和视觉模型拼接，进而为qwen3小模型提供视觉能力 🛠工具与编程	10	July 31, 2025