[Repost] What evaluation metrics are there for large models?

This article is transcoded by SimpRead, original address github.com

You may have heard that Model A is better than Model B, but do you know how to evaluate these models? In the field of large models, there are many metrics that help us evaluate model performance. These metrics help us understand the accuracy, efficiency, and interpretability of models. In this article, we will introduce some commonly used metrics and how to use them to evaluate model performance.

  • During training of large models, we need an objective function (loss function) to guide the model’s gradient descent;
  • After training, we use metrics like Bleu or Rouge to evaluate model performance;
  • Before official release, we use various benchmarks to evaluate model performance, such as GLUE, SuperGLUE, SQuAD, CoLA, etc.;
  • Finally, we compare the model with others in an arena to determine performance.

Below, we introduce LLM evaluation metrics from these four aspects.

Cross Entropy

Entropy

Entropy is a very important concept in physics and information theory. It originally comes from the second law of thermodynamics and describes the degree of disorder of a system or the uniformity of energy distribution. In different fields, entropy has different meanings and applications:

  • Entropy in thermodynamics: Thermodynamic entropy is a state function representing the disorderliness of the system’s energy distribution. An increase in entropy usually means the system becomes more disordered. The second law of thermodynamics states that the entropy of a closed system tends to increase until thermal equilibrium is reached;

  • Entropy in information theory: Claude Shannon introduced the concept of entropy into information theory as a measure of information uncertainty. In information theory, entropy quantifies the expected value of information—if the entropy of an information source is higher, the information it contains is more uncertain and unpredictable;

  • Entropy in statistics and probability theory: In statistics and probability, entropy can be seen as a measure of uncertainty of a random variable. If all possible outcomes of a random variable are equally likely, entropy reaches its maximum.

Mathematically, entropy is usually defined as follows:

  • For a discrete random variable (X) with probability distribution (P(x)), entropy (H(X)) is defined as:
    $$H(X) = -\sum_{x} P(x) \log_b P(x)$$

  • For a continuous random variable (X) with probability density function (p(x)), entropy (H(X)) is defined as:
    $$H(X) = -\int p(x) \log_b p(x) dx$$

In these formulas, (b) is the base of the logarithm, commonly 2, in which case the unit of entropy is bits.

Entropy in Literary Works

Many literary works also reflect the concept of “entropy.” For example, the big boss behind the scenes of Tianxiabachang’s Underground World is “entropy.” Underground World is another long adventure novel series by Tianxiabachang after Ghost Blows Out the Light. It tells the thrilling death journey of an obscure protagonist who follows an expedition team with a mysterious mission deep underground. The author, Tianxiabachang, is known as China’s most imaginative writer with strong market appeal; his stories are exciting, comprehensive, and captivating.

In the 1960s, Sima Hui and Luo Dahai became gang leaders in the Black House area. Influenced and persuaded by a friend’s brother, Xia Tiedong, they joined the Burmese Communist guerrilla army. After years of battles, the guerrillas led by Sima Hui and Luo Dahai withdrew to Wildman Mountain and were forced to join an expedition team led by Yu Feiyan to seek a mysterious underground treasure, embarking on a thrilling life-and-death journey. The group entered the “Ghost Highway,” chased by the tropical wind cyclone “Tufu,” attacked by giant pythons and man-eating leeches, and fell into the massive crevasse of Wildman Mountain. They were hired by someone but did not know what the client was willing to pay any price to find; unexpectedly, they discovered the Golden Spider City built by the ancient Champa King, which had disappeared under dense fog for a thousand years…

Cross Entropy

Cross-entropy is an important concept in machine learning and information theory, often used to measure the difference between two probability distributions. In classification problems, cross-entropy is commonly used to evaluate the difference between model predictions and true labels.

The formula for cross-entropy is usually expressed as:
$$H(p, q) = -\sum_{i} p(i) \log q(i)$$

where (p) is the true probability distribution; (q) is the predicted probability distribution; and (i) is the class index.

For binary classification problems, the cross-entropy loss function simplifies to:
$$H(p, q) = -[p \log q + (1 - p) \log (1 - q)]$$

where (p) is the true label (0 or 1); (q) is the predicted probability by the model.

For multi-class classification problems, the cross-entropy loss function is:
$$H(p, q) = -\sum_{i=1}^{N} p_i \log q_i$$

where (N) is the number of classes, (p_i) is the true probability of class (i) (usually 0 or 1), and (q_i) is the predicted probability for class (i).

Perplexity

Perplexity literally means “degree of confusion” and is a metric used to measure the quality of language models. Its value ranges from 1 up to the size of the candidate vocabulary. The perplexity indicates how confused a language model is when doing next-token prediction. For example, Perplexity = 81 means that when predicting the next token, the model needs to choose the correct answer from 81 candidates — the model’s perplexity is 81.

Given a test set (W = w_1, w_2, w_3, \ldots, w_m),

Perplexity is defined as the reciprocal of the probability of the test set normalized by the number of tokens:

\text{Perplexity}(S) = p(w_1, w_2, w_3, \ldots, w_m)^{-1/m} \\\\ = \sqrt[m]{\prod_{i=2}^{m} \frac{1}{p(w_i \mid w_1, w_2, w_3, \ldots, w_{i-1})}}

The probability of the first word is (p(w_1)), the second is (p(w_2)), the (m)-th is (p(w_m)), and (PP(W)) is the geometric mean of the inverse of these probabilities.

Another Explanation of Perplexity

Suppose there is 1 red ball and 80 black balls; the probability of picking the red ball is 1/81, which also means you need to pick the correct one from 81 options (the inverse). The perplexity is thus 81.

The single red ball represents the correct word; the 80 black balls represent the model’s capability. The stronger the model, the more it can exclude black balls. The strongest model only has one red ball without any black balls — perplexity equals 1.

Bleu Score & Rouge Score

In NLP, traditional metrics such as precision, recall, and F1-score often fail to adequately evaluate generative model performance because the outputs are natural language texts that can have different expressions but the same meaning. Therefore, specific metrics are needed to evaluate generative models.

BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are two important metrics in natural language processing for evaluating machine translation and text summarization.

BLEU is an n-gram-based evaluation method that assesses translation quality by comparing the machine translation output to a set of reference translations. The core of BLEU is calculating the number of n-grams common between candidate and reference translations, assigning higher weight to more identical n-grams. It is simple and fast but less sensitive to semantic similarity and susceptible to n-gram coverage issues.

ROUGE is a recall-based evaluation metric, primarily used for automatic summarization and machine translation quality evaluation. ROUGE measures the quality of generated summaries or translations by comparing n-gram overlaps with reference summaries or translations. ROUGE has several variants, such as ROUGE-N (n-gram recall), ROUGE-L (longest common subsequence), etc. Compared to BLEU, ROUGE emphasizes semantic similarity more but is computationally more complex and sensitive to sentence structural differences.

N-gram

N-gram is a common feature representation method in NLP that segments text into continuous subsequences of length N, serving as features. N-gram models are often used in language modeling, text classification, machine translation, and more.

A single word is called a unigram, a sequence of two words is a bigram, and a longer sequence is called n-gram.

Rouge-N ROUGE-N calculates overlap based on n-grams, where “N” denotes the size of the n-gram (continuous sequence of N elements, usually words).

ROUGE-N mainly focuses on recall — the proportion of n-grams in generated text that also appear in the reference text.

Rouge-L ROUGE-L is based on the Longest Common Subsequence (LCS) between generated and reference texts, which considers the longest sequence of tokens that appear in both texts in order.

Benchmarks

Benchmarks for large models are standard test sets and metrics used to evaluate and compare the performance of large language models (LLMs). They comprehensively assess model capabilities across different domains and tasks, including but not limited to knowledge understanding, logical reasoning, multi-turn dialogue, coding ability, etc.

For example, the General Language Understanding Evaluation (GLUE) benchmark is a famous natural language understanding evaluation suite containing multiple tasks and different datasets to assess model performance across various text types and difficulty levels.

In the Chinese domain, there are benchmarks specialized for Chinese large models, such as CMMLU, which contains 67 questions from various disciplines covering natural sciences, social sciences, engineering, humanities, and common sense, aiming to comprehensively evaluate models’ knowledge reserves and language understanding in Chinese.

Additionally, there are benchmarks focused on specific domains, such as MathEval, which thoroughly evaluates large models’ math problem-solving abilities. It includes 20 math domain test sets and nearly 30K math problems covering branches from arithmetic to advanced mathematics.

Arena

What comes to mind first when you think of Arena?

The large model arena is a platform for comparing LLM performance. It allows large models from different sources to be tested on the same tasks and datasets to evaluate and compare their performance. Such an arena provides researchers, developers, and end users with an intuitive way to measure and select the best AI services.

For example, the LMSys Chatbot Arena Leaderboard is a crowdsourcing-based evaluation leaderboard for large models. Users input questions, and one or more anonymous large models return answers simultaneously. Users vote based on their expectations and effect, forming crowdsourced evaluation results of different large models.

References

[1] LMSYS Chatbot Arena Leaderboard

[2] deeplearning.ai