๐ ArxivRoll Leaderboard
"Fresh from ArXiv, served once, and never reheated."
๐ TL;DR: ArxivRoll tells you "How much of your score is real, and how much is cheating?"
๐ What is ArxivRoll?
ArxivRoll is a dynamic, one-time-pad-inspired evaluation framework ๐ก๏ธ that audits how much Large Language Models (LLMs) over-estimate their true abilities on public benchmarks.
โ ๏ธ Key Problems ArxivRoll Tackles
- ๐ฅ Data contamination: Public benchmarks (MMLU, GSM8K, etc.) often sneak into pre-training data โ inflated scores.
- ๐ฏ Biased overtraining: Developers may "teach to the test," tuning models only on popular domains.
- ๐ต๏ธ Transparency crisis: Private leaderboards (SEAL, Chatbot Arena) are opaque & hard to reproduce.
๐งช How ArxivRollBench Works
1. ๐ฑ Fresh Test Cases
Every 6 months we scrape latest ArXiv preprints (AprโSep 2024 โ ArxivRollBench-2024b).
๐ท๏ธ Domains: CS, Math, Physics, Bio, Econ, Finance, Statistics, EE.
2. ๐ฒ SCP Tasks (Sequencing, Cloze, and Prediction)
Articles are auto-converted into three symbolic tasks:
- Sequencing ๐ โ Re-order shuffled sentences
- Cloze ๐ณ๏ธ โ Fill masked sentences
- Prediction ๐ฎ โ Choose the correct next sentence
3. ๐ Rugged Scores (RS)
- RS-I ๐งช = % inflation on public vs. private benchmarks
- RS-II โ๏ธ = performance variance across domains (biased training detector)
๐ Unique Features
- ๐ One-Time Use: private benchmarks are used once, then expired & open-sourced
- โ High Quality: filtered for length, complexity, minimal math/tables
- ๐ Broad Coverage: 8 domains, ~100-word contexts, 1k+ samples per domain
๐ฉโ๐ป How Do I Evaluate my Model?
The most easy way is to use llm-eval-harness
Just install lm-eval
from here, and then evaluate a huggingface model with:
lm_eval --model hf --model_args pretrained="your-model-name",parallelize=True --tasks arxivrollbench2024b --log_samples --output_path your-log-path
You can also evaluate LLM via APIs with examples detailed in ./eval/
.
# | Model | S | C | P | Avgโ |
---|
Loading RobenchSCP data...