Perplexity
Disadvantage: Only reflects whether the model’s output resembles the test text.
Human Evaluation
Parseable Benchmark Datasets
For example, multiple-choice questions, fill-in-the-blank questions, etc.—ask the model to answer and then compute accuracy.
Examples include MMLU (Massive Multitask Language Understanding), MMLU-PRO, IF-EVAL, BBH, MATH, and GPQA.
Large Language Model Evaluation
References
Do you know what metrics are used to evaluate large language model performance? PPL, MMLU, MATH, GPQA, BBH, IF-EVAL, MMLU-PRO — Bilibili: 你知道用什么指标评价一个大模型的好坏吗?PPL,MMLU,MATH,GPQA,BBH,IF-EVAL,MMLU-PRO_哔哩哔哩_bilibili

