Data contamination traditionally refers to leakage of evaluation data into training sets, causing overfitting and invalid test results. This work introduces search-time contamination (STC), a new problem where search-based LLM agents access evaluation data in real-time during retrieval, enabling them to copy answers rather than reason. The authors found that HuggingFace, a platform hosting evaluation datasets, was a common source of such leakage, with agents explicitly referencing test questions and answers from HuggingFace in their reasoning. On three popular benchmarks—Humanity’s Last Exam (HLE), SimpleQA, and GPQA—about 3% of queries were contaminated by direct retrieval of ground truth data from HuggingFace. This contamination accelerates benchmark obsolescence, as repeated leaks erode the test’s ability to measure genuine model capabilities. Blocking HuggingFace sources led to a 15% accuracy drop on the contaminated subset, confirming the impact of STC. Further experiments indicated that HuggingFace is not the only source of contamination, suggesting broader risks from publicly accessible datasets. The authors propose best practices for benchmark design and result reporting to mitigate STC and ensure trustworthy evaluations of search-based LLM agents. They also released their experimental logs publicly to support auditing and transparency in future research.
👉 Pročitaj original: arXiv AI Papers