Skip to content

[EMNLP2025 Findings] Implementation for the paper “MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation”

License

Notifications You must be signed in to change notification settings

Syuchin/MARS-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chenghao Yang1,2*, Yinbo Luo2*, Zhoufutu Wen1, Qi Chu2, Tao Gong2, Longxiang Liu1, Kaiyuan Zhang1, Jianpeng Jiao1, Ge Zhang1, Wenhao Huang1, Nenghai Yu2

1ByteDance Seed, 2University of Science and Technology of China

*Equal Contribution, Corresponding Authors, Work done at ByteDance Seed


🔔 Introduction

MARS-Bench Overview

MARS-Bench is a real-world multi-turn dialogue benchmark that reveals LLM weaknesses in Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Built on play-by-play sports commentary, it offers a rigorous evaluation suite and insights into attention sink phenomena and explicit reasoning.


⚙️ Installation

To install the required packages:

git clone https://github.com/syuchin/MARS-Bench.git
cd MARS-Bench
# we prefer to run the code in a conda environment
conda create -n mars-bench python=3.10
conda activate mars-bench
pip install -r requirements.txt

🚀 Quick Start

You can quick start like this:

1️⃣ Configure the config/config.yaml with your API keys (OpenAI, DeepSeek, etc.).

2️⃣ Run a single-game demo:

python chat/demo.py --model_name deepseek-chat --game_id 166909 --data_path data/Context_Retrieval.jsonl

3️⃣ Process a specific data file:

python chat/chat.py --input_file data/Context_Retrieval.jsonl --output_path result/chat-result/chat-test.json --model_name deepseek-chat

4️⃣ Conduct evaluation:

python eval/eval.py --input result/chat-result/chat-test.json --output result/eval-result/chat-test-eval.json

🛠️ Project Structure

eval

  • eval.py: Main evaluation script, calculates metrics.
  • deepseek_api.py: Evaluation-specific DeepSeek API implementation.

chat

  • chat.py: Processes full dataset for multi-turn dialogues.
  • demo.py: Single-game demo script.
  • models/: Contains model API wrappers (e.g., deepseek_api.py, openai_api.py).

data

  • *.jsonl: JSONL files with dialogue data of different tasks.

config

  • config.yaml: API keys and model settings.
  • config_wrapper.py: Helper for loading configurations.

result

  • chat-result/: Dialogue outputs (e.g., demo.json).
  • eval-result/: Evaluation results.

📄 Citation

@inproceedings{MARS-Bench,
    title = "{MARS}-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation",
    author = "Yang, Chenghao  and
      Luo, Yinbo  and
      Wen, Zhoufutu  and
      Chu, Qi  and
      Gong, Tao  and
      Liu, Longxiang  and
      Zhang, Kaiyuan  and
      Jiao, Jianpeng  and
      Zhang, Ge  and
      Huang, Wenhao  and
      Yu, Nenghai",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.314/",
    pages = "5872--5898",
    ISBN = "979-8-89176-335-7",
    abstract = "Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction."
}

About

[EMNLP2025 Findings] Implementation for the paper “MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages