A simple, accurate method that reveals what your LLM has already memorized—by measuring how in-context examples change its confidence.
When a model scores 95% on a benchmark, how much of that is genuine reasoning versus memorized answers? As LLMs train on ever-larger web corpora, popular benchmarks inevitably leak into training data. This data contamination silently inflates reported performance, making fair model comparison impossible.
Existing detection methods require access to training data, need extensive parameter tuning per model, or fail to produce reliable, interpretable results at scale.
We need a method that works with any model, any dataset, automatically.
Models respond differently to in-context examples depending on whether they have already memorized the data.
When a model hasn't seen a dataset, in-context examples from the same distribution provide useful patterns—new stylistic cues, vocabulary, and structure. The model's confidence on target samples increases.
When a model has memorized a dataset, it has already internalized all its patterns. Adding more examples disrupts memorization, and the model's confidence decreases.
CoDeC measures this difference. The fraction of samples where confidence drops when context is added = the contamination score.
Consider a mathematician preparing for the International Mathematical Olympiad. If they trained exclusively on past IMO problems, they would internalize IMO-specific patterns—certain proof structures, problem styles, implicit assumptions. Giving them more IMO problems as reference during the competition wouldn't help; they've already memorized those patterns. But a mathematician with broad training who has never seen IMO problems would benefit greatly from the same reference material. CoDeC exploits exactly this asymmetry.
Four simple steps. Two forward passes. One clear answer.
Compute the model's average log-likelihood on target tokens without any context.
Prepend random samples from the same dataset and measure the model's predictions again.
Compute for each sample.
The contamination score is the fraction of samples where Δ < 0 (confidence dropped).
For each dataset element, CoDeC augments the context with samples from the same dataset. A decrease in the model's logits indicates potential contamination. The overall contamination level is estimated as the fraction of samples exhibiting this effect.
The effectiveness of CoDeC stems from several intuitive principles.
Models trained on a dataset internalize its unique style, vocabulary, and implicit assumptions. If these priors are already memorized, additional in-context examples add little useful information.
For contaminated data, adding memorized in-context examples interferes with the model's memorized token sequences. The context triggers conflicting patterns, reducing confidence.
Contaminated models resemble finetuned models near saturation—additional training yields minimal gains. CoDeC effectively measures the remaining learning capacity for the target dataset.
Memorized samples sit in narrow, sharp local minima. Even small perturbations (like adding context) destabilize predictions. Unseen data occupies flatter regions that benefit from additional context.
Controlled experiments confirm that CoDeC reliably detects contamination.
We evaluate CoDeC on models with publicly available training data. CoDeC achieves 99.9% dataset-level AUC, cleanly separating seen from unseen data. Baselines (Vanilla Loss, Min-K%, Zlib Ratio) fail to provide reliable separation.
Contamination scores for training (red) and unseen (blue) datasets. Each point is a model–dataset pair.
Finetuning any model on a dataset reliably pushes CoDeC scores toward 100%, consistently across four different architectures. This controlled experiment confirms that CoDeC accurately tracks the introduction of contamination, even for models without public training data.
Understanding the behavior, robustness, and practical characteristics of CoDeC.
Tracking CoDeC scores across OLMo 7B training checkpoints, we find that scores reach their final values after just 2% of training (10k out of 477k steps). This enables early intervention during model development, before significant training resources are spent.
CoDeC scores are primarily determined by the relationship between the training corpus and the target benchmark. For models trained on the same data (e.g., the Pile), scores for each dataset are narrowly distributed. This stability enables meaningful cross-model comparisons: consistent scores reflect a dataset-level property, while deviations signal model-specific memorization.
CoDeC scores for models trained on the Pile. The last 10 datasets are parts of the training data. Scores usually cluster around a dataset-specific mean.
CoDeC detects indirect contamination from rephrased, augmented, or otherwise related data. The model becomes contaminated with MMLU even when trained on questions that were unseen, highly cropped, rephrased, noised, or highly related. This cannot be captured by simple n-gram overlap checks.
Adding in-context samples and finetuning with a moderate learning rate produce nearly identical confidence curves. For unseen data, both increase confidence with dimnishing gains; for training data, both decrease it in the first step, but then slowly increase it again. This pattern is due to the high variability of local minima in the loss landscape. This explains why CoDeC works: it measures the same underlying phenomenon as actual training—the remaining learning capacity for the target distribution—but at negligible computational cost.
Left: confidence change as in-context samples are added. Right: confidence change during finetuning at various learning rates. The curves are strikingly similar.
CoDeC yields stable estimates with as few as 100 samples, and variance drops below 1% with 1,000 samples. The method requires only two forward passes per sample, is model-agnostic, dataset-agnostic, and parameter-free—no thresholds to tune, no reference models needed.
CoDeC can independently verify decontamination procedures described in technical reports. For example, Qwen 2.5 scores 100% contamination on GSM8K's training set but only 27% on the test set. According to its technical report, training set of GSM8K was deliberately included in training, but with questions overlapping the test set removed. CoDeC confirms both the deliberate inclusion and the successful decontamination—without requiring access to the training data or any prior knowledge of the data curation process.
CoDeC enables a range of practical use cases for improving LLM evaluation and development.
Identify when reported scores may be inflated by training data leakage, enabling more trustworthy performance comparisons.
Verify training data claims for open-weight models. CoDeC works even when training corpora are undisclosed, requiring only gray-box access.
Detect accidental benchmark inclusion during data curation. CoDeC identifies contamination after just 2% of training, enabling early intervention.
Among models with similar benchmark accuracy, those with lower CoDeC scores should generalize better. CoDeC separates genuine capability from memorization.
Catch indirect leakage from related, augmented, or synthetically generated data—contamination that n-gram overlap checks miss.
CoDeC scores measure how strongly a model's predictions depend on memorized patterns rather than genuine reasoning.
Strong evidence of contamination. The model has likely been trained directly on this dataset or very closely related data. Benchmark results on this dataset should be interpreted with caution.
Ambiguous range. May indicate partial contamination, training on closely related data, or a low-diversity dataset. Compare across models on the same benchmark to identify outliers.
No evidence of contamination. The model is likely reasoning based on general knowledge rather than memorized patterns. Benchmark results can be trusted with higher confidence.
We evaluated 60+ recent models across 20+ popular benchmarks. Explore the full results interactively.
@inproceedings{zawalski2026detecting,
title={Detecting Data Contamination in {LLMs} via In-Context Learning},
author={Zawalski, Micha{\l} and Boubdir, Meriem and Ba{\l}azy, Klaudia and Nushi, Besmira and Ribalta, Pablo},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://openreview.net/forum?id=YlpaaYxx4t}
}