Dynamic LLM Evaluation
ArxivRollBench Leaderboard
ArxivRollBench is a rolling benchmark for measuring how much LLM evaluation can be inflated by public-test contamination and uneven training. It builds fresh SCP tasks from recent arXiv papers, evaluates them once while private, then releases the expired benchmark for reproducibility.
Releases
32024B, 2025A, 2026ADomains
8Scientific arXiv areasTasks
S/C/PSequencing, Cloze, PredictionMetric
Valid AccMain ranking scoreLeaderboard
Model results by release and domain
| # | Model | Valid Acc | Coverage | Raw Acc | S | C | P |
|---|
Diagnostics
Top-model SCP profile
Loading SCP data...
Protocol
Experimental settings
Usage
Evaluate your model
lm-evaluation-harness
lm_eval --model hf --model_args pretrained="your-model",parallelize=True --tasks arxivrollbench2026a-50 --log_samples --output_path ./logs