Dynamic LLM Evaluation

ArxivRollBench Leaderboard

ArxivRollBench is a rolling benchmark for measuring how much LLM evaluation can be inflated by public-test contamination and uneven training. It builds fresh SCP tasks from recent arXiv papers, evaluates them once while private, then releases the expired benchmark for reproducibility.

Releases

32024B, 2025A, 2026A

Domains

8Scientific arXiv areas

Tasks

S/C/PSequencing, Cloze, Prediction

Metric

Valid AccMain ranking score

Leaderboard

Model results by release and domain

#ModelValid AccCoverageRaw AccSCP

Diagnostics

Top-model SCP profile

Loading SCP data...

Protocol

Experimental settings

Primary score
Pooled valid accuracy, ranked by non-null valid responses with null-rate warnings.
Secondary score
Raw accuracy and coverage remain visible for models with empty or invalid outputs.
Tasks
SCP: Sequencing, Cloze, and Prediction over recent arXiv article contexts.
Domains
CS, Math, Physics, Statistics, EESS, Quantitative Biology, Quantitative Finance, and Economics.
Prompting
Models are asked to return only the final multiple-choice answer.
Splits
The compact -50 split is the recommended default; full-release data is retained for broad audits.
Evaluation
API and open-weight runs cache every sample response before parsing to avoid unnecessary reruns.
Warnings
Models with high null or empty-response rates stay ranked, but are marked explicitly.

Usage

Evaluate your model

lm-evaluation-harness

lm_eval --model hf --model_args pretrained="your-model",parallelize=True --tasks arxivrollbench2026a-50 --log_samples --output_path ./logs