๐Ÿš€ ArxivRoll Leaderboard

"Fresh from ArXiv, served once, and never reheated."

๐Ÿ“Œ TL;DR: ArxivRoll tells you "How much of your score is real, and how much is cheating?"

๐Ÿ“Š What is ArxivRoll?

ArxivRoll is a dynamic, one-time-pad-inspired evaluation framework ๐Ÿ›ก๏ธ that audits how much Large Language Models (LLMs) over-estimate their true abilities on public benchmarks.

โš ๏ธ Key Problems ArxivRoll Tackles

  • ๐Ÿ“ฅ Data contamination: Public benchmarks (MMLU, GSM8K, etc.) often sneak into pre-training data โ†’ inflated scores.
  • ๐ŸŽฏ Biased overtraining: Developers may "teach to the test," tuning models only on popular domains.
  • ๐Ÿ•ต๏ธ Transparency crisis: Private leaderboards (SEAL, Chatbot Arena) are opaque & hard to reproduce.

๐Ÿงช How ArxivRollBench Works

1. ๐ŸŒฑ Fresh Test Cases

Every 6 months we scrape latest ArXiv preprints (Aprโ€“Sep 2024 โ†’ ArxivRollBench-2024b).
๐Ÿท๏ธ Domains: CS, Math, Physics, Bio, Econ, Finance, Statistics, EE.

2. ๐ŸŽฒ SCP Tasks (Sequencing, Cloze, and Prediction)

Articles are auto-converted into three symbolic tasks:

  • Sequencing ๐Ÿ”€ โ†’ Re-order shuffled sentences
  • Cloze ๐Ÿ•ณ๏ธ โ†’ Fill masked sentences
  • Prediction ๐Ÿ”ฎ โ†’ Choose the correct next sentence

3. ๐Ÿ“ˆ Rugged Scores (RS)

  • RS-I ๐Ÿงช = % inflation on public vs. private benchmarks
  • RS-II โš–๏ธ = performance variance across domains (biased training detector)

๐ŸŒŸ Unique Features

  • ๐Ÿ• One-Time Use: private benchmarks are used once, then expired & open-sourced
  • โœ… High Quality: filtered for length, complexity, minimal math/tables
  • ๐ŸŒ Broad Coverage: 8 domains, ~100-word contexts, 1k+ samples per domain

๐Ÿ‘ฉโ€๐Ÿ’ป How Do I Evaluate my Model?

The most easy way is to use llm-eval-harness

Just install lm-eval from here, and then evaluate a huggingface model with:

lm_eval --model hf --model_args pretrained="your-model-name",parallelize=True --tasks arxivrollbench2024b --log_samples --output_path your-log-path

You can also evaluate LLM via APIs with examples detailed in ./eval/.

#ModelSCPAvgโ†“

Loading RobenchSCP data...