Chenghao Yang1,2*‡, Yinbo Luo2*, Zhoufutu Wen1†, Qi Chu2†, Tao Gong2, Longxiang Liu1, Kaiyuan Zhang1, Jianpeng Jiao1, Ge Zhang1, Wenhao Huang1, Nenghai Yu2
1ByteDance Seed, 2University of Science and Technology of China
*Equal Contribution, †Corresponding Authors, ‡Work done at ByteDance Seed
MARS-Bench is a real-world multi-turn dialogue benchmark that reveals LLM weaknesses in Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Built on play-by-play sports commentary, it offers a rigorous evaluation suite and insights into attention sink phenomena and explicit reasoning.
To install the required packages:
git clone https://github.com/syuchin/MARS-Bench.git
cd MARS-Bench
# we prefer to run the code in a conda environment
conda create -n mars-bench python=3.10
conda activate mars-bench
pip install -r requirements.txtYou can quick start like this:
1️⃣ Configure the config/config.yaml with your API keys (OpenAI, DeepSeek, etc.).
2️⃣ Run a single-game demo:
python chat/demo.py --model_name deepseek-chat --game_id 166909 --data_path data/Context_Retrieval.jsonl3️⃣ Process a specific data file:
python chat/chat.py --input_file data/Context_Retrieval.jsonl --output_path result/chat-result/chat-test.json --model_name deepseek-chat4️⃣ Conduct evaluation:
python eval/eval.py --input result/chat-result/chat-test.json --output result/eval-result/chat-test-eval.jsoneval.py: Main evaluation script, calculates metrics.deepseek_api.py: Evaluation-specific DeepSeek API implementation.
chat.py: Processes full dataset for multi-turn dialogues.demo.py: Single-game demo script.models/: Contains model API wrappers (e.g.,deepseek_api.py,openai_api.py).
*.jsonl: JSONL files with dialogue data of different tasks.
config.yaml: API keys and model settings.config_wrapper.py: Helper for loading configurations.
chat-result/: Dialogue outputs (e.g.,demo.json).eval-result/: Evaluation results.
@inproceedings{MARS-Bench,
title = "{MARS}-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation",
author = "Yang, Chenghao and
Luo, Yinbo and
Wen, Zhoufutu and
Chu, Qi and
Gong, Tao and
Liu, Longxiang and
Zhang, Kaiyuan and
Jiao, Jianpeng and
Zhang, Ge and
Huang, Wenhao and
Yu, Nenghai",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.314/",
pages = "5872--5898",
ISBN = "979-8-89176-335-7",
abstract = "Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction."
}
