Large Language Model Evaluation Metric — Perplexity

Perplexity

Disadvantage: Only reflects whether the model’s output resembles the test text.

Human Evaluation

Parseable Benchmark Datasets

For example, multiple-choice questions, fill-in-the-blank questions, etc.—ask the model to answer and then compute accuracy.
Examples include MMLU (Massive Multitask Language Understanding), MMLU-PRO, IF-EVAL, BBH, MATH, and GPQA.

Large Language Model Evaluation

References

Do you know what metrics are used to evaluate large language model performance? PPL, MMLU, MATH, GPQA, BBH, IF-EVAL, MMLU-PRO — Bilibili: 你知道用什么指标评价一个大模型的好坏吗?PPL,MMLU,MATH,GPQA,BBH,IF-EVAL,MMLU-PRO_哔哩哔哩_bilibili