🚀 ArxivRoll Leaderboard

"Fresh from ArXiv, served once, and never reheated."

📌 TL;DR: ArxivRoll tells you "How much of your score is real, and how much is cheating?"

📊 What is ArxivRoll?

ArxivRoll is a dynamic, one-time-pad-inspired evaluation framework 🛡️ that audits how much Large Language Models (LLMs) over-estimate their true abilities on public benchmarks.

⚠️ Key Problems ArxivRoll Tackles

📥 Data contamination: Public benchmarks (MMLU, GSM8K, etc.) often sneak into pre-training data → inflated scores.
🎯 Biased overtraining: Developers may "teach to the test," tuning models only on popular domains.
🕵️ Transparency crisis: Private leaderboards (SEAL, Chatbot Arena) are opaque & hard to reproduce.

🧪 How ArxivRollBench Works

1. 🌱 Fresh Test Cases

Every 6 months we scrape latest ArXiv preprints (Apr–Sep 2024 → ArxivRollBench-2024b).
🏷️ Domains: CS, Math, Physics, Bio, Econ, Finance, Statistics, EE.

2. 🎲 SCP Tasks (Sequencing, Cloze, and Prediction)

Articles are auto-converted into three symbolic tasks:

Sequencing 🔀 → Re-order shuffled sentences
Cloze 🕳️ → Fill masked sentences
Prediction 🔮 → Choose the correct next sentence

3. 📈 Rugged Scores (RS)

RS-I 🧪 = % inflation on public vs. private benchmarks
RS-II ⚖️ = performance variance across domains (biased training detector)

🌟 Unique Features

🕐 One-Time Use: private benchmarks are used once, then expired & open-sourced
✅ High Quality: filtered for length, complexity, minimal math/tables
🌍 Broad Coverage: 8 domains, ~100-word contexts, 1k+ samples per domain

👩‍💻 How Do I Evaluate my Model?

The most easy way is to use llm-eval-harness

Just install lm-eval from here, and then evaluate a huggingface model with:

lm_eval --model hf --model_args pretrained="your-model-name",parallelize=True --tasks arxivrollbench2024b --log_samples --output_path your-log-path

You can also evaluate LLM via APIs with examples detailed in ./eval/.

#	Model	S	C	P	Avg↓

Loading RobenchSCP data...

Support

This project is supported by Astaple Group in PolyU.

Also maintained by MoreoverAI.

Resources

Developer

Developed by Zi Liang

📧 zi1415926.liang@connect.polyu.hk

Citation

@misc{arxivroll,
      title={How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework}, 
      author={Zi Liang and Liantong Yu and Shiyu Zhang and Qingqing Ye and Haibo Hu},
      year={2025},
      eprint={2507.19219},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.19219}, 
}