【Repost】How to Evaluate the Performance of Large Models on Long Text Processing

Notes

These are two related articles used to measure the performance of long text processing.

Article One

This article is transcoded by 简悦 SimpRead, original address github.com

Large Model Performance Evaluation: Needle In A Haystack

  1. Introduction

Large models have a limit on context length, so how is their performance in processing long texts? And how should this be evaluated?

gkamradt conducted an extreme test and found that most people use incorrect methods and do not bring out the true strength of AI.

Can AI really find specific key facts from hundreds of thousands of words? The redder the color, the more mistakes AI makes.

gkamradt named this test NeedleInAHaystack, which literally means “needle in a haystack” in Chinese, a method to evaluate the long-text capability of large models.

Simply put, it hides a key piece of information (needle) inside a long text prompt (haystack), then asks questions for the large model to find this key information.

Because this test truly reflects the ability of large models, it has now gradually become a standard evaluation method.

  1. Brief Description of the Needle In A Haystack Task

Kamradt placed the hidden sentence (the “needle”) at 15 different positions from beginning to end in the text corpus (the “haystack”), and conducted 225 experiments (15×15) with 15 different lengths of corpus evenly distributed from 1K to 128K (200K) characters.

Summary of Greg Kamradt’s “Needle In A Haystack” experiment:

Haystack

218 blog posts by YC founder Paul Graham

Needle

The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.
The best thing to do in San Francisco is to sit in Dolores Park and eat a sandwich on a sunny day.

Question

What is the most fun thing to do in San Francisco based on my context? Don't give information outside the document

Expected Answer

The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.
  1. Other Needle In A Haystack Methods (OpenCompass)

  • Single-Needle Retrieval Task (S-RT): Evaluates the ability of LLMs to extract a single key piece of information from long texts, testing their precise recall of specific details within broad narratives. This corresponds to the original Needle In A Haystack test task.

  • Multi-Needle Retrieval Task (M-RT): Explores LLM’s ability to retrieve multiple related pieces of information from long texts, simulating complex queries in practical scenarios involving comprehensive documents.

  • Multi-Needle Reasoning Task (M-RS): Assesses LLM’s long-text ability by extracting and utilizing multiple key pieces of information from long texts, requiring the model to have a comprehensive understanding of each key information fragment.

  • Ancestral Trace Challenge (ATC): Tests LLM’s ability to handle multilayered logical challenges in real long texts by designing “relationship needles.” In the ATC task, a series of logical reasoning questions examine the model’s memory and analytical ability for each detail in the long text. In this task, irrelevant text (haystack) is removed, and all text is designed as key information. LLMs must comprehensively use all content and reasoning from the long text to answer questions accurately.

Later, we will take a look at the evaluation scheme of Counting Stars.

References

[1] Needle In A Haystack Experimental Evaluation

[2] AI Large Model Evaluation Report: “Long Text” and “Needle Retrieval” Become Large Model Pain Points

Article Two

This article is transcoded by 简悦 SimpRead, original address github.com

Large Model Performance Evaluation: Counting Stars

  1. Introduction

NeedleInAHaystack has become a fundamental method for evaluating the long-text ability of large models. Tencent’s MLPD lab created a clever method to test large models’ long-text abilities by having the “little penguin count stars.”

Tencent is the little penguin counting stars, what if DAMO Academy were the “Pingtouge counting cobras”

  1. Brief Description of the Counting Stars Task

In one study, to evaluate language models’ ability to process long texts and long-distance dependencies, researchers designed a test where the text length gradually increased up to a maximum of 128,000 characters.

The experiment used the Chinese classical novel Dream of the Red Chamber as the baseline text, into which they randomly inserted sentences in a specific format — “The little penguin counted x stars,” where x is a varying number.

Researchers divided the entire text into N parts and inserted M of the above-format sentences within those parts.

Then, the model’s task was to identify and extract all sentences containing numbers, outputting these numbers in JSON format, with the output containing only the numbers.

After the model outputs, researchers compared the numbers recognized by the model with the actual inserted numbers (Ground Truth) to calculate the model’s accuracy.

This “Counting Stars” testing method can more accurately measure a model’s ability to process long texts and long-distance dependencies than traditional “Needle In A Haystack” tests. Through this method, researchers can gain deeper insights into a model’s potential in handling complex information and performing detailed tasks.

Comparison with Needle In A Haystack

The “Needle In A Haystack” inserts multiple “needles,” i.e., multiple clues, then lets the large model find and infer connections among multiple clues to obtain the final answer.

But in actual “multi-needle” Needle In A Haystack tests, the model doesn’t need to find all “needles” to answer correctly; sometimes just the last one is enough.

However, “Counting Stars” is different, because the number of “stars” in each sentence differs, and the model must find all stars to answer the question correctly.

References

[1] Counting-Stars (★): A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

[2] github: Counting-Stars

[3] “Needle In A Haystack” out, “Counting Stars” becomes a more accurate method for measuring long-text capability, from Tencent