Large Language Model Evaluation Metric — Perplexity

doggie · April 29, 2026, 6:21am

Perplexity

Disadvantage: Only reflects whether the model’s output resembles the test text.

Human Evaluation

Parseable Benchmark Datasets

For example, multiple-choice questions, fill-in-the-blank questions, etc.—ask the model to answer and then compute accuracy.
Examples include MMLU (Massive Multitask Language Understanding), MMLU-PRO, IF-EVAL, BBH, MATH, and GPQA.

Large Language Model Evaluation

References

Do you know what metrics are used to evaluate large language model performance? PPL, MMLU, MATH, GPQA, BBH, IF-EVAL, MMLU-PRO — Bilibili: 你知道用什么指标评价一个大模型的好坏吗？PPL，MMLU，MATH，GPQA，BBH，IF-EVAL，MMLU-PRO_哔哩哔哩_bilibili

Topic	Replies	Views
大模型如何评测？ 🤖人工智能人工智能 , 大模型 , 指标 , 评测	22	November 24, 2025
【转载】大模型有哪些评估指标？ 🤖人工智能测评指标 , 转载 , 大模型	10	November 25, 2025
【转载】大语言模型评估的常用方法、指标与框架 🤖人工智能大模型 , 测评指标 , 转载	86	November 24, 2025
【转载】大模型评测指标全解析：如何精准衡量AI模型的性能 💻编程大模型 , 评测 , 指标	16	November 24, 2025
【转载】什么是困惑度perplexity 🤖人工智能 perplexity , 测评指标	41	November 25, 2025