A simple, accurate method that reveals what your LLM has already memorized—by measuring how in-context examples change its confidence.
When a model scores 95% on a benchmark, how much of that is genuine reasoning versus memorized answers? As LLMs train on ever-larger web corpora, popular benchmarks inevitably leak into training data. This data contamination silently inflates reported performance, making fair model comparison impossible.
Existing detection methods require access to training data, need extensive parameter tuning per model, or fail to produce reliable, interpretable results at scale.
We need a method that works with any model, any dataset, automatically.
Models respond differently to in-context examples depending on whether they have already memorized the data.
When a model hasn't seen a dataset, in-context examples from the same distribution provide useful patterns—new stylistic cues, vocabulary, and structure. The model's confidence on target samples increases.
When a model has memorized a dataset, it has already internalized all its patterns. Adding more examples disrupts memorization, and the model's confidence decreases.
CoDeC measures this difference. The fraction of samples where confidence drops when context is added = the contamination score.
Four simple steps. Two forward passes. One clear answer.
Compute the model's average log-likelihood on target tokens without any context.
Prepend random samples from the same dataset and measure the model's predictions again.
Compute for each sample.
The contamination score is the fraction of samples where Δ < 0 (confidence dropped).
For each dataset element, CoDeC augments the context with samples from the same dataset. A decrease in the model's logits indicates potential contamination. The overall contamination level is estimated as the fraction of samples exhibiting this effect.
The effectiveness of CoDeC stems from several intuitive principles.
Models trained on a dataset internalize its unique style, vocabulary, and implicit assumptions. If these priors are already memorized, additional in-context examples add little useful information.
For contaminated data, adding memorized in-context examples interferes with the model's memorized token sequences. The context triggers conflicting patterns, reducing confidence.
Contaminated models resemble finetuned models near saturation—additional training yields minimal gains. CoDeC effectively measures the remaining learning capacity for the target dataset.
Memorized samples sit in narrow, sharp local minima. Even small perturbations (like adding context) destabilize predictions. Unseen data occupies flatter regions that benefit from additional context.
CoDeC produces clear, interpretable contamination scores across diverse models and datasets.
Contamination scores for training (red) and unseen (blue) datasets. Each point is a model–dataset pair. CoDeC achieves the best separation, enabling consistent classification across models and datasets.
Dataset-level AUC across all evaluated models. CoDeC cleanly separates seen from unseen data, while baselines (Vanilla Loss, Min-K%, Zlib Ratio) fail to provide reliable separation.
When applied during model development, CoDeC reveals contamination after just 2% of training steps (10k out of 477k), enabling early intervention.
CoDeC scores during training of OLMo 7B. Training dataset scores rise sharply early on while unseen dataset scores remain stable.
CoDeC detects indirect contamination from related data. Finetuning on MMLU contaminates even unseen questions. This cannot be captured by simple n-gram overlap checks.
Qwen 2.5 scored 100% contamination on GSM8K's training set but only 27% on the test set—CoDeC detected this difference, confirming that the training set was deliberately included while the test set was excluded.
CoDeC enables a range of practical use cases for improving LLM evaluation and development.
Identify when reported scores may be inflated by training data leakage, enabling more trustworthy performance comparisons.
Verify training data claims for open-weight models. CoDeC works even when training corpora are undisclosed, requiring only gray-box access.
Detect accidental benchmark inclusion during data curation. CoDeC identifies contamination after just 2% of training, enabling early intervention.
Among models with similar benchmark accuracy, those with lower CoDeC scores should generalize better. CoDeC separates genuine capability from memorization.
Catch indirect leakage from related, augmented, or synthetically generated data—contamination that n-gram overlap checks miss.
We evaluated 40+ recent models across 20+ popular benchmarks. Explore the full results interactively.
@inproceedings{zawalski2026detecting,
title={Detecting Data Contamination in {LLMs} via In-Context Learning},
author={Zawalski, Micha{\l} and Boubdir, Meriem and Ba{\l}azy, Klaudia and Nushi, Besmira and Ribalta, Pablo},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://openreview.net/forum?id=YlpaaYxx4t}
}