📐 Live Benchmark — Updated Regularly

LiveMathematicianBench

A live benchmark for evaluating LLMs' capability as mathematicians, featuring research-level theorem comprehension from the latest arXiv papers.

GitHub

Dataset

Last updated: March 23, 2026

Model Leaderboard

Overall Accuracy –

Accuracy by Month –

Accuracy by Category Click legend to hide/show a model

Benchmark Overview

Monthly updated dataset of research-level mathematics MCQs derived from recent arXiv publications.

Questions per Month

Category Distribution (All Months)

Category Breakdown by Month Note: One question might have different categories at the same time.

Detailed Statistics

Category

*: One question might have different categories at the same time.

Tasks

Browse MCQs from the benchmark. Each question is derived from a real arXiv paper theorem.

About the Benchmark

Understanding the design and methodology behind LiveMathematicianBench.

What is it?

LiveMathematicianBench is a live, continuously updated benchmark that evaluates LLMs on their ability to understand and reason about cutting-edge mathematical theorems from newly published arXiv preprints.

Why "Live"?

New papers appear on arXiv every month. We extract theorems from these papers and generate multiple-choice questions that test deep mathematical understanding, ensuring that models cannot rely on memorized training data.

Question Format

Questions and choices are constructed from theorem statements and proof sketches extracted from arXiv papers. Each question has five carefully crafted choices (one correct, one weaker-but-true, and three false). Only the question and choices are used as input for the model—the original theorem and proof sketch are not provided.