What Is the Intraclass Correlation Coefficient (ICC)?

In medical research and image processing, the Intraclass Correlation Coefficient (ICC) is the most fundamental metric for evaluating how accurate and stable the results are when different individuals (or the same individual at different times) perform the same task.

Simply put, in your specific scenario, ICC measures: how well the muscle contours outlined by two physicians on the same image overlap and agree with each other.


1. Why Use ICC Instead of Ordinary Correlation Coefficients?

You may be familiar with the Pearson correlation coefficient (r), but it only assesses whether the trends align.

  • Example: Suppose Physician A consistently assigns a muscle area of 100, while Physician B consistently assigns 200. Although B’s values are always exactly double A’s (perfect trend agreement), their absolute measurements differ by a factor of two!
  • Advantage of ICC: Unlike Pearson’s r, ICC evaluates not only correlation but also absolute agreement in numerical values. If the two physicians’ delineations yield substantially different quantitative measurements, the ICC will be low.

2. What Do ICC Values Mean?

ICC ranges from 0 to 1. Higher values indicate smaller discrepancies between the two physicians’ delineations:

ICC Value Level of Agreement
> 0.75 Excellent — This is the ideal threshold for publication in medical literature
0.60 – 0.74 Good
0.40 – 0.59 Fair
< 0.40 Poor

Note: If your ICC reaches above 0.8, it indicates that the two physicians have highly consistent definitions of boundaries for these four muscle groups (PM, QL, ES, MF), and your data are reliable.

Summary

ICC essentially “scores” the quality of your annotations.

  • If ICC is high: The two physicians’ delineations closely agree; thus, an AI model trained on such annotations will learn stable, robust features.
  • If ICC is low: There is inconsistency in delineation standards—especially for muscles with ambiguous boundaries (e.g., QL or MF)—and physicians must re-align their criteria. Otherwise, the AI model will learn unreliable patterns.