ICLR 2026

Detecting Data Contamination in LLMs via In-Context Learning

A simple, accurate method that reveals what your LLM has already memorized—by measuring how in-context examples change its confidence.

Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta
NVIDIA

Read Paper Code Leaderboard

Do You Trust Benchmark Scores?

When a model scores 95% on a benchmark, how much of that is genuine reasoning versus memorized answers? As LLMs train on ever-larger web corpora, popular benchmarks inevitably leak into training data. This data contamination silently inflates reported performance, making fair model comparison impossible.

Existing detection methods require access to training data, need extensive parameter tuning per model, or fail to produce reliable, interpretable results at scale.

We need a method that works with any model, any dataset, automatically.

40+
Models evaluated
20+
Benchmarks tested
99.9%
Dataset-level AUC

The Key Insight

Models respond differently to in-context examples depending on whether they have already memorized the data.

Unseen Data

When a model hasn't seen a dataset, in-context examples from the same distribution provide useful patterns—new stylistic cues, vocabulary, and structure. The model's confidence on target samples increases.

Confidence increases for unseen data when context is added

Memorized Data

When a model has memorized a dataset, it has already internalized all its patterns. Adding more examples disrupts memorization, and the model's confidence decreases.

Confidence decreases for memorized data when context is added

CoDeC measures this difference. The fraction of samples where confidence drops when context is added = the contamination score.

How CoDeC Works

Four simple steps. Two forward passes. One clear answer.

1
Baseline

Compute the model's average log-likelihood on target tokens without any context.

2
In-Context

Prepend random samples from the same dataset and measure the model's predictions again.

3
Compare

Compute for each sample.

4
Aggregate

The contamination score is the fraction of samples where Δ < 0 (confidence dropped).

Pipeline Overview

CoDeC pipeline overview diagram

For each dataset element, CoDeC augments the context with samples from the same dataset. A decrease in the model's logits indicates potential contamination. The overall contamination level is estimated as the fraction of samples exhibiting this effect.

Why It Works

The effectiveness of CoDeC stems from several intuitive principles.

Dataset-Specific Priors

Models trained on a dataset internalize its unique style, vocabulary, and implicit assumptions. If these priors are already memorized, additional in-context examples add little useful information.

Context Disrupts Memorization

For contaminated data, adding memorized in-context examples interferes with the model's memorized token sequences. The context triggers conflicting patterns, reducing confidence.

ICL Mirrors Finetuning

Contaminated models resemble finetuned models near saturation—additional training yields minimal gains. CoDeC effectively measures the remaining learning capacity for the target dataset.

Loss Landscape Geometry

Memorized samples sit in narrow, sharp local minima. Even small perturbations (like adding context) destabilize predictions. Unseen data occupies flatter regions that benefit from additional context.

Key Results

CoDeC produces clear, interpretable contamination scores across diverse models and datasets.

CoDeC vs. Baselines

Box plots comparing CoDeC vs baseline methods

Contamination scores for training (red) and unseen (blue) datasets. Each point is a model–dataset pair. CoDeC achieves the best separation, enabling consistent classification across models and datasets.

99.9%

Dataset-level AUC across all evaluated models. CoDeC cleanly separates seen from unseen data, while baselines (Vanilla Loss, Min-K%, Zlib Ratio) fail to provide reliable separation.

Early Signal

When applied during model development, CoDeC reveals contamination after just 2% of training steps (10k out of 477k), enabling early intervention.

Early Detection During Training

CoDeC scores during OLMo 7B training

CoDeC scores during training of OLMo 7B. Training dataset scores rise sharply early on while unseen dataset scores remain stable.

Contamination Transfer

Contamination transfer to other datasets Contamination transfer from other datasets Legend

CoDeC detects indirect contamination from related data. Finetuning on MMLU contaminates even unseen questions. This cannot be captured by simple n-gram overlap checks.

Notable Finding

Qwen 2.5 scored 100% contamination on GSM8K's training set but only 27% on the test set—CoDeC detected this difference, confirming that the training set was deliberately included while the test set was excluded.

Explore the full contamination leaderboard →

Applications

CoDeC enables a range of practical use cases for improving LLM evaluation and development.

Fair Benchmark Evaluation

Identify when reported scores may be inflated by training data leakage, enabling more trustworthy performance comparisons.

Model Auditing

Verify training data claims for open-weight models. CoDeC works even when training corpora are undisclosed, requiring only gray-box access.

Training Quality Control

Detect accidental benchmark inclusion during data curation. CoDeC identifies contamination after just 2% of training, enabling early intervention.

Informed Model Comparison

Among models with similar benchmark accuracy, those with lower CoDeC scores should generalize better. CoDeC separates genuine capability from memorization.

Partial Contamination Detection

Catch indirect leakage from related, augmented, or synthetically generated data—contamination that n-gram overlap checks miss.

Contamination Leaderboard

We evaluated 40+ recent models across 20+ popular benchmarks. Explore the full results interactively.

Model HLE LiveCodeBench v5 SWE-Bench (test) GPQA Diamond MMLU-Pro AIME 2024
Llama-4-Scout-17B-16E-Instruct 88% 87% 77% 76% 49% 21%
MiniMax-M2.7 87% 81% 61% 50% 57% 27%
GLM-4.7 79% 56% 61% 46% 43% 64%
GLM-4.6 84% 71% 54% 45% 45% 59%
Phi-4-reasoning-plus 58% 42% 43% 38% 52% 86%
Llama-4-Maverick-17B-128E-Instruct 51% 36% 69% 55% 53% 22%
Qwen2.5-1.5B-Instruct 68% 25% 44% 59% 55% 60%
Phi-4-mini-instruct 60% 29% 44% 48% 48% 50%

Cite This Work

@inproceedings{zawalski2026detecting,
  title={Detecting Data Contamination in {LLMs} via In-Context Learning},
  author={Zawalski, Micha{\l} and Boubdir, Meriem and Ba{\l}azy, Klaudia and Nushi, Besmira and Ribalta, Pablo},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://openreview.net/forum?id=YlpaaYxx4t}
}