GitHub - Syuchin/MARS-Bench: [EMNLP2025 Findings] Implementation for the paper “MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation”

Chenghao Yang^1,2^*^‡, Yinbo Luo²^*, Zhoufutu Wen¹^†, Qi Chu²^†, Tao Gong², Longxiang Liu¹, Kaiyuan Zhang¹, Jianpeng Jiao¹, Ge Zhang¹, Wenhao Huang¹, Nenghai Yu²

¹ByteDance Seed, ²University of Science and Technology of China

^*Equal Contribution, ^†Corresponding Authors, ^‡Work done at ByteDance Seed

🔔 Introduction

MARS-Bench is a real-world multi-turn dialogue benchmark that reveals LLM weaknesses in Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Built on play-by-play sports commentary, it offers a rigorous evaluation suite and insights into attention sink phenomena and explicit reasoning.

⚙️ Installation

To install the required packages:

git clone https://github.com/syuchin/MARS-Bench.git
cd MARS-Bench
# we prefer to run the code in a conda environment
conda create -n mars-bench python=3.10
conda activate mars-bench
pip install -r requirements.txt

🚀 Quick Start

You can quick start like this:

1️⃣ Configure the config/config.yaml with your API keys (OpenAI, DeepSeek, etc.).

2️⃣ Run a single-game demo:

python chat/demo.py --model_name deepseek-chat --game_id 166909 --data_path data/Context_Retrieval.jsonl

3️⃣ Process a specific data file:

python chat/chat.py --input_file data/Context_Retrieval.jsonl --output_path result/chat-result/chat-test.json --model_name deepseek-chat

4️⃣ Conduct evaluation:

python eval/eval.py --input result/chat-result/chat-test.json --output result/eval-result/chat-test-eval.json

🛠️ Project Structure

eval

eval.py: Main evaluation script, calculates metrics.
deepseek_api.py: Evaluation-specific DeepSeek API implementation.

chat

chat.py: Processes full dataset for multi-turn dialogues.
demo.py: Single-game demo script.
models/: Contains model API wrappers (e.g., deepseek_api.py, openai_api.py).

data

*.jsonl: JSONL files with dialogue data of different tasks.

config

config.yaml: API keys and model settings.
config_wrapper.py: Helper for loading configurations.

result

chat-result/: Dialogue outputs (e.g., demo.json).
eval-result/: Evaluation results.

📄 Citation

@inproceedings{MARS-Bench,
    title = "{MARS}-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation",
    author = "Yang, Chenghao  and
      Luo, Yinbo  and
      Wen, Zhoufutu  and
      Chu, Qi  and
      Gong, Tao  and
      Liu, Longxiang  and
      Zhang, Kaiyuan  and
      Jiao, Jianpeng  and
      Zhang, Ge  and
      Huang, Wenhao  and
      Yu, Nenghai",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.314/",
    pages = "5872--5898",
    ISBN = "979-8-89176-335-7",
    abstract = "Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction."
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
chat		chat
config		config
data		data
docs		docs
eval		eval
pictures		pictures
result		result
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔔 Introduction

⚙️ Installation

🚀 Quick Start

🛠️ Project Structure

eval

chat

data

config

result

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Syuchin/MARS-Bench

Folders and files

Latest commit

History

Repository files navigation

🔔 Introduction

⚙️ Installation

🚀 Quick Start

🛠️ Project Structure

eval

chat

data

config

result

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages