Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study

Mohammad Khodadad^1,2,†, Ali Shiraee Kasmaee^1,2,*,†, Mahdi Astaraki ^1,2, Nicholas Sherck³, Hamidreza Mahyar¹, Soheila Samiee²
¹Department of Computational Science and Engineering McMaster University Canada
²BASF Canada Inc Canada
³BASF Corporation USA
{khodam3, shiraeea, astarakm, mahyarh}@mcmaster.ca {nicholas.sherck, soheila.samiee}@basf.com
^*Corresponding Author: [email protected]
^†Equal Contribution

Abstract

In this study, we introduced a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of large language models within the chemistry domain. We designed and validated a fully automated pipeline, verified by subject matter experts, to facilitate this task. Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a comprehensive knowledge graph. By generating multi-hop questions across these graphs, we assess LLM performance in both context-augmented and non-context augmented settings. Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning. The results reflect the importance of augmenting LLMs with document retrieval, which can have a substantial impact on improving their performance. However, even perfect retrieval accuracy with full context does not eliminate reasoning errors, underscoring the complexity of compositional reasoning. This work not only benchmarks and highlights the limitations of current LLMs but also presents a novel data generation pipeline capable of producing challenging reasoning datasets across various domains. Overall, this research advances our understanding of reasoning in computational linguistics.

1 Introduction

Large Language Models (LLMs) have achieved impressive performance on a wide range of tasks, yet their ability to perform complex, multi-step reasoning remains an ongoing challenge. Techniques such as chain-of-thought (CoT) prompting [1, 2, 3, 4, 5] and structural innovations [6, 7, 8] have enabled notable improvements in reasoning, particularly in mathematics and coding. OpenAI’s o-series [9, 10] was among the first to introduce inference-time scaling of CoT reasoning depth, and subsequent open-source models such as DeepSeek R1 [11, 12] and Qwen QwQ [13] have adopted similar strategies. Notably, these models also leverage reinforcement learning during training to further refine their CoT reasoning, consistently improving performance with increased test-time compute.

To evaluate the reasoning capabilities of LLMs, the community relies on a suite of benchmarks requiring multi-hop reasoning, spanning domains from mathematics [14, 15] and programming [16, 17, 18] to general question answering, including open-book tasks such as HotpotQA [19] and StrategyQA [20]. Recent advances in scaling reasoning have led to improvements in these benchmarks. However, these models remain largely general-purpose. In scientific fields such as chemistry, multi-hop reasoning is essential for integrating interconnected, domain-specific knowledge. Although several multi-hop question answering benchmarks exist, evaluations specific to chemical reasoning are limited [21, 22, 23]. Recent work, including datasets targeting subfields such as reticular chemistry [24], highlights the need for more comprehensive and challenging domain-specific benchmarks.

To address this gap, we propose an automated multi-hop reasoning data generation pipeline that leverages OpenAI’s o3-mini and gpt-4o models. Our pipeline systematically extracts and verifies chemical entities via named entity recognition (NER) and links them to external databases to construct a knowledge graph, from which challenging multi-hop question-answer pairs are generated. Our contributions are as follows:

1.

We provide extensive experimental evidence that compositional reasoning in scientific domains remains a significant limitation for current state-of-the-art LLMs.
2.

We demonstrate that even retrieval augmentation with perfect context does not guarantee flawless reasoning due to inherent compositional complexities.
3.

We introduce an automated knowledge graph and multi-hop reasoning data generation pipeline, leveraging OpenAI models and NER, which can potentially generate unlimited domain-specific datasets.
4.

We contribute a challenging benchmark for Multi-hop QA in chemistry

The complete Q&A dataset, together with the evaluation code, is available ¹¹1https://github.com/MohammadKhodadad/ChemKGMultiHopQA.

2 Background and Related Works

Multi-hop question answering (QA) has evolved as a key method to evaluate the multi-step reasoning abilities of large language models [25]. Answering multi-hop questions requires integrating multiple pieces of evidence. Traditionally, datasets such as HotpotQA [19], WikiHop, and MedHop [26] were manually or semi-automatically curated through crowd-sourcing and knowledge-base relations. While these approaches yield high-quality, human-validated questions, they are resource-intensive.

More recently, LLMs have been leveraged to generate multi-hop datasets automatically. In many cases, single-hop QA pairs are first generated and later merged using entity linking techniques, a common approach that connects individual entities across questions. For example, the MuSiQue framework [27] fuses two QA pairs by linking a named entity from the first answer to the subsequent question, thereby forming a chain of reasoning. Other methods, such as MultiHop-RAG [28], extend this paradigm by incorporating retrieval-augmented generation (RAG) to paraphrase factual sentences and group them based on shared topics, reflecting the diverse strategies that are emerging in multi-hop QA generation.

Chemistry Domain.

The chemical sciences pose distinct challenges for multi-hop QA due to the need for expert domain knowledge. Only a small fraction of HotpotQA’s questions (roughly 900) are chemistry-focused, limiting both domain relevance and topical currency. Furthermore, they tend to be limited to 2-hop reasoning which limits the difficulty level of the questions. More specialized datasets, such as ChemLitQA [21], provide around 1,000 single-hop and 700 multi-hop question-answer pairs generated using LLMs based on ChemRxiv papers. Although the ChemLitQA-multi dataset includes multi-hop questions, these are typically defined by a single linked entity across all hops. This limitation has led the authors of [21] to emphasize the need for future studies focused on developing more challenging multi-hop QA benchmarks in the field.

Similarly, the chemistry subset of the OlympicArena benchmark [22] provides high-level Olympiad problems that are few in number, and are not explicitly designed for multi-hop reasoning.

Refer to caption — Figure 1: An Overview of the knowledge graph generation pipeline.

Knowledge Graph Generation.

Automated knowledge graph construction from unstructured text has seen diverse approaches aimed at effectively capturing entities and their interrelations. Early systems, such as Grapher [29], take an end-to-end approach by first generating nodes with a fine-tuned pretrained language model and then forming edges via sequential generation or classification techniques. Building on these ideas, frameworks like the Extract-Define-Canonicalize (EDC) approach [30] employ a three-phase strategy: they extract relational triplets without a fixed schema, generate natural language definitions for each relation, and then standardize and merge equivalent triplets, sometimes with an additional refinement stage that leverages retrieval techniques.

Complementing these general-purpose methods, dynamic and domain-specific approaches address specialized challenges in KG construction. For instance, KG-MRC [31] models the evolution of entity states in procedural texts by integrating neural reading comprehension with recurrent graph updates to capture temporal changes. In parallel, domain-specific systems such as CEAR [32] incorporate tailored ontologies and specialized knowledge to generate more accurate graphs from the scientific literature. Similarly, Cai et al. [33] demonstrate an iterative, coarse-to-fine refinement process that adapts a broad biomedical knowledge graph to specialized domains like oncology, reducing reliance on manual annotations while preserving essential domain nuances.

3 Methodology

Our methodology consists of three main components: knowledge graph generation, multi-hop question-answer generation, and evaluating state-of-the-art large language models on the question-answering task. The first two components are described in detail in this section, and the last one is explained in the next section.

3.1 Knowledge Graph Generation

We began by constructing a comprehensive knowledge graph from chemical literature. First, using the ChemRxiv API, we collected all ChemRxiv articles with licenses that permitted redistribution. Next, we cleaned the articles using regular expressions to extract their introductions. Focusing on objective and factual information, we extracted each introduction’s first few paragraphs (up to 500 words). Finally, we segmented the extracted text into chunks of up to 128 words, ensuring that no paragraph was split across chunks.

Next, we applied named entity recognition (NER) models to these text chunks to identify chemical entities. In particular, we utilized an NER model [34] that leverages a PubMedBERT architecture [35] fine-tuned on various chemical datasets. To ensure that the extracted entities were specific, verified, and chemically relevant, we utilized OpenAI’s gpt-4o to review and refine the outputs. The same model was also employed to extract relations between these verified entities, forming triplets that capture the interactions and associations present in the text. Additionally, large language models were utilized to extract descriptive features from textual data associated with each entity. To enrich the nodes further during the construction of the knowledge graph, supplementary information from Wikipedia and the PubChem dataset [36] was integrated. Consequently, the finalized knowledge graph comprises nodes representing chemical entities, enhanced by metadata and descriptive annotations from these sources, as well as edges representing the relationships extracted from the textual data. Figure 1 illustrates the procedure followed to generate the knowledge graph.

3.2 Multi-hop Question-Answer Generation

To generate multi-hop questions, we first sampled paths of varying lengths from the constructed knowledge graph using a randomized breadth-first search (BFS) path sampling algorithm. During path sampling, we ensured that the sources for the edges were distinct, encouraging solutions to integrate information from multiple sources and different parts of the context to answer the questions. Therefore, each path with a length of K involves K+1 entities, coming from K distinct source texts extracted from the original ChemRxiv database.

Adopting a bottom-up approach, we began by sampling paths and generating individual 1-hop questions from each hop. Specifically, For every $(\text{entity}_{1},\text{relation},\text{entity}_{2})$ triplet, we formulated a corresponding question in which $\text{entity}_{1}$ served as the answer, the prompt inquired about the entity that holds the specified relation to $\text{entity}_{2}$ . When a question lacked sufficient specificity, we instructed an LLM to enrich it with additional metadata or context from the original text, thereby enhancing clarity and precision.

Model	Context	Correctness Rate (%)	Avg Duration (s)	Avg Input Tokens	Avg Output Tokens	Total Input Tokens (K)	Total Output Tokens (K)
Anthropic Claude Sonnet 3.5 V2	✗	40.06	1.54	567	29	550.93	28.69
Anthropic Claude Sonnet 3.5 V2	✓	72.50	1.68	2210	30	2146.11	29.18
Anthropic Claude Sonnet 3.7	✗	44.80	1.61	567	30	550.93	29.35
Anthropic Claude Sonnet 3.7	✓	80.02	1.84	2210	30	2146.11	29.49
Anthropic Claude Sonnet 3.7 (Thinking)	✗	45.73	39.01	583	1777	566.09	1725.79
Anthropic Claude Sonnet 3.7 (Thinking)	✓	84.35	15.35	2228	715	2163.59	694.78
OpenAI GPT-4o-mini	✗	32.34	0.63	204	9	198.60	9.63
OpenAI GPT-4o-mini	✓	62.82	0.71	1628	10	1581.57	10.01
OpenAI GPT-4o	✗	40.27	0.63	204	9	198.60	9.53
OpenAI GPT-4o	✓	68.80	0.71	1628	10	1581.57	9.95
OpenAI o1-mini	✗	41.09	7.78	160	1047	155.88	1017.55
OpenAI o1-mini	✓	71.99	5.68	1609	718	1562.70	697.41
OpenAI o3-mini	✗	47.58	10.84	210	1187	204.43	1153.12
OpenAI o3-mini	✓	80.33	6.12	1634	558	1587.40	542.46
Mistral Large	✗	35.53	0.41	177	13	172.45	13.40
Mistral Large	✓	73.94	0.57	1913	14	1857.70	14.22
Llama 3.3 70B Instruct	✗	32.13	0.33	330	10	320.47	10.56
Llama 3.3 70B Instruct	✓	65.19	0.40	1781	11	1729.91	10.75
Google Gemma 3 27B	✗	32.03	0.89	163	12	158.95	11.94
Google Gemma 3 27B	✓	69.72	1.00	1587	12	1541.55	12.57
DeepSeek R1	✗	44.39	21.06	159	1466	154.40	1423.73
DeepSeek R1	✓	81.98	8.61	1551	573	1506.14	556.55
Qwen QwQ 32B	✗	35.74	68.29	168	2167	163.51	2104.86
Qwen QwQ 32B	✓	79.81	25.18	1665	757	1617.45	735.86
DeepSeek R1 Distill Qwen 32B	✗	34.19	32.04	159	1074	154.70	1043.56
DeepSeek R1 Distill Qwen 32B	✓	79.09	12.11	1633	400	1586.25	389.31

Table 1: Summary of tested models performance in terms of several evaluation metrics for both Contextual and Non-Contextual Setups

These individual questions were then combined into a single multi-hop question using OpenAI’s o3-mini model. Importantly, the final aggregated question was constructed to begin with the last sub-question and chain the entities up to the $\text{entity}_{1}$ of the first relation, ensuring that the final answer corresponds to the answer of the first question. During the verification phase, each one-hop question was first reviewed for clarity, relevance to chemistry, and alignment with the corresponding text that provided the answer. The multi-hop question was then assessed through an additional evaluation step, ensuring that its logical flow effectively led to the final answer. An LLM-based verification process was employed to confirm factual accuracy, answerability based on available context and metadata, and the logical coherence of the sub-questions. Feedback from domain experts was continuously incorporated into the prompts to enhance verification accuracy. To minimize ambiguity, questions that were answered incorrectly by all evaluated models were excluded from the benchmark. Figure 2 illustrates the detailed pipeline of Multi-hop QA generation.

To minimize the impact of writing style and summarization on accuracy evaluation, all questions are designed to have short answers. Answering these questions requires breaking down the main question into smaller sub-questions, finding the answer to each, and combining them to arrive at the final answer. Even with full context available, a correct answer cannot be obtained if the model is not capable of inferring and integrating different pieces of knowledge.

4 Experiments and Results

In our experiments, we evaluated the domain-specific multi-hop question-answering capabilities of a wide variety of state-of-the-art large language models, including both reasoning-focused and general-purpose models. For clarity, throughout this work we refer to models specifically optimized to scale test-time compute as reasoning models. These included open-source and proprietary variants, tested with or without provided context. The summary of tested models and their performance is provided in Table 1.

To access the selected models for evaluation in this experiment, we used different API providers: (i) all tested OpenAI models (gpt-4o, gpt-4o-mini, o1-mini and o3-mini) are accessed via the OpenAI platform; (ii) Amazon Bedrock Platform has been used to access Anthropic Sonnet 3.7 (with and without extended thinking), Anthropic Sonnet 3.5 V2, Mistral Large, DeepSeek R1 and Llama 3.3 70B Instruct; and (iii) Google Gemma 3 27B, Qwen QwQ 32B, and DeepSeek R1 Distill Qwen 32B are accessed via the OpenRouter Platform and operate at bf16 precision. All of these models can perform function-calling tool use, so they were instructed to produce valid JSON outputs to ensure consistency and enable automated validation. All models are evaluated in two settings: with and without provided context. The first scenario reflects performance when the models are paired with an ideal retrieval-augmented generation (RAG) system, while the second scenario relies on the model’s internal memory to answer the questions. After parsing the JSON, we checked whether the output was an exact match to ground truth. If not, OpenAI GPT-4o was instructed to perform a binary assessment; determining whether the answer was correct or not, to calculate the Correctness Rate (%) metric. Our dataset comprises 971 questions spanning 1 to 4 hops (On average 245 questions per hop), generated using the approach described in Section 3.2.

4.1 Models Performance

Figure 3 illustrates the performance of 13 large language models evaluated with respect to correctness rate, cost, and latency in both context-provided and context-not-provided setup. In our performance evaluation, the Llama 3.3 70B Instruct and GPT-4o models achieved the lowest cost and demonstrated notably low latency, but they also registered the lowest correctness rate, making them cost-efficient yet less accurate options. In contrast, Claude Sonnet 3.7 (with extended thinking) achieved the highest correctness rate, albeit at the expense of significantly higher cost and latency. Meanwhile, both Qwen QWEN 32B and Deepseek R1 Distil QWEN 32B maintained a favorable balance between cost and correctness rate when the context is provided – i.e. equipped with a perfect RAG system –, though they incurred above-average latency. Claude Sonnet 3.7 was found to have the highest correctness rate in non-reasoning models.

In Context not provided setup, open-source reasoning models (R1 Distill Qwen, QWQ-32B, and R1) tend to have lower performance ranking compared to other models. In contrast, for OpenAI models we observed an opposite trend, which may indicate potentially richer pre-training data. For Claude 3.7, using extended thinking did not lead to better performance compared to the no-thinking setup when the context is not provided to the model; instead, it increased the output tokens and cost. For a comprehensive breakdown of model performance and experimental settings, refer to Table 1, which details metrics such as correctness rate, latency, and token usage. Note that the output details for reasoning models also include the reasoning tokens.

4.2 Comparison with HotpotQA

To demonstrate that large language models, even those designed for reasoning, often struggle with domain-specific multi-hop questions, we created a chemistry-related subset of HotpotQA [19], a well-known general text benchmark primarily sourced from Wikipedia. We sampled chemistry questions by starting from Wikipedia’s Chemistry category, recursively exploring its subcategories (up to three levels), and filtering HotpotQA based on exact title matches. To maintain consistency with our evaluation scheme, we excluded distractors and included only supporting documents as context. Figure 4, illustrates the average performance of all models on each dataset under two conditions: with context provided and without context in the prompt. The results indicate that when context is provided, models achieve similar performance, with our benchmark resulting in marginally lower average performance and reduced variability among the models’ output. In the setup without context, the models found our benchmark more challenging compared to the HotpotQA chemistry subset. This observation may be due to the fact that HotpotQA was exclusively built from Wikipedia, which has been utilized in the pre-training of all evaluated models, while the new benchmark is constructed from more recent ChemRxiv papers enriched with PubChem and Wikipedia.

5 Analysis and Ablation

This section presents detailed ablation studies and analyses of the benchmark. First, we examine how context availability and test-time reasoning affect model performance and efficiency. Next, we explore how the number of reasoning hops, an indicator of question difficulty, impacts model correctness and the number of tokens needed to generate an answer.

5.1 Context and Reasoning

In this analysis, we compare the performance of reasoning and non-reasoning models—i.e., models with and without test-time reasoning capabilities, respectively—in two scenarios: with context provided and without context provided. As illustrated in Figure 5-A, providing context significantly improves model performance, nearly doubling the correctness of both reasoning and non-reasoning models. Additionally, reasoning models outperform non-reasoning models in correctly answering questions, benefiting further from their reasoning capabilities in the context-provided setup. As expected and shown in Figure 5B, non-reasoning models are faster than reasoning models. Although these models benefit significantly from context to improve their performance, their latency did not significantly change by providing context to the model. This could be explained by the fact that the number of output tokens in these models is not sensitive to the length of the input prompt and availability of context (see Figure S6). For reasoning models, we observed that the availability of context lowers latency and output token count, likely because available context streamlines the thought process. For further analysis of context and reasoning in these models, refer to Appendix Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study.

5.2 Impact of the Number of Hops

In this section, we investigate how the number of reasoning hops influences the correctness rate and the output token count. Figure 6 presents the results for the setup with Context Provided. For reasoning models, Figure 6-A illustrates the distribution of answer correctness in relation to the number of generated output tokens. The first observation is that as the number of hops increases, the output token count, which reflects the number of thinking tokens, also increases. However, the answer correctness rate remains relatively constant for multi-hop questions, albeit slightly lower than that observed in single-hop scenarios. For single-hop questions with context provided, we see a decrease in the correctness rate as the output token count increases, indicating a performance trade-off associated with deeper reasoning in simple questions. These trends are not present when context is not provided to the models, as shown in Supplementary Figure S7-A). In non-reasoning models, the output token count is not sensitive to the context or complexity of the question; thus, these models are evaluated solely based on answer correctness. Figure 6-B depicts the distribution of answer correctness rate across the evaluated reasoning models for different numbers of hops. Single-hop questions are answered at a higher correctness rate than multi-hop questions, a trend that is not observed in the setup without context (Supplementary figure S7-B).

6 Conclusion

In this study, we developed a domain-specific multi-hop question-answering (QA) system and evaluated state-of-the-art large language models within the chemistry domain. Our findings reveal that these models struggle with in-domain multi-hop scientific questions, correctly answering fewer than half of the queries when the context is unavailable. Although reasoning fine-tuned models show marginally improved performance, they still face significant challenges. The incorporation of context leads to substantial enhancements, nearly doubling the performance of both reasoning and non-reasoning models. However, even with context, no model, including those fine-tuned for reasoning, achieved a perfect score. Additionally, we contribute to the field by proposing an automated pipeline that integrates advanced named entity recognition with knowledge graph construction to generate intricate multi-hop reasoning tasks, which were utilized for the benchmark. Notably, this potentially domain-agnostic framework can be adapted for various fields by replacing chemistry-specific named entity recognition with suitable alternatives, laying a robust foundation for future research focused on improving reasoning capabilities across diverse specialized domains.

Limitations

Like any research, our study has certain limitations. Specifically, our benchmarking was conducted in two setups: without context and with full context provided. However, in real-world applications, each piece of context is typically collected step by step throughout the reasoning process. This raises the possibility of partial context retrieval, especially for non-reasoning models, which generally rely on a single retrieval round prior to generating answers. Equipping language models with a retrieval-augmented generation pipeline and allowing for multi-step retrieval could introduce a third scenario that better resembles real-world applications. In the past couple of years, advancements in reasoning models have led to various approaches for integrating generative models with RAG systems for multi-step retrieval [28, 37, 38, 39].

Building on these developments, future research could focus on designing and validating an industrial-grade, multi-step retrieval pipeline tailored to chemical text. This enhanced system will support incremental context acquisition and structured reasoning over lengthy or loosely organized documents, thereby enabling accurate answers to more challenging chemical queries.

Acknowledgements

The author(s) gratefully acknowledge the financial support provided by MITACS under funding number IT32409 for the research leading to the publication of this article. We also thank Adam Wojciech Bartwiki for project management support, Tobias Roth for his essential contributions to the infrastructure, and Stephen Dokas for his invaluable recommendations in the chemistry domain.

References

[1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
[2] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
[3] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
[4] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.
[5] Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, et al. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though. arXiv preprint arXiv:2501.04682, 2025.
[6] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020.
[7] Artur d’Avila Garcez and Luis C Lamb. Neurosymbolic ai: the 3rd wave. arXiv e-prints, pages arXiv–2012, 2020.
[8] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017.
[9] OpenAI. Openai o1 system card, 2024. Accessed: 2025-03-20.
[10] OpenAI. Openai o3 mini system card, 2024. Accessed: 2025-03-20.
[11] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
[12] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
[13] Avinash Patil. Advancing reasoning in large language models: Promising methods and approaches. arXiv preprint arXiv:2502.03671, 2025.
[14] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[15] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
[16] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[17] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
[18] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
[19] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
[20] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
[21] Geemi Wellawatte, Huixuan Guo, Magdalena Lederbauer, Anna Borisova, Matthew Hart, Marta Brucka, and Philippe Schwaller. Chemlit-qa: A human evaluated dataset for chemistry rag tasks. In AI for Accelerated Materials Design-NeurIPS 2024.
[22] Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai. Advances in Neural Information Processing Systems, 37:19209–19253, 2024.
[23] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024.
[24] Zhiling Zheng, Nakul Rampal, Theo Jaffrelot Inizan, Christian Borgs, Jennifer T Chayes, and Omar M Yaghi. Large language models for reticular chemistry. Nature Reviews Materials, pages 1–13, 2025.
[25] Vaibhav Mavi, Anubhav Jangra, Adam Jatowt, et al. Multi-hop question answering. Foundations and Trends® in Information Retrieval, 17(5):457–586, 2024.
[26] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287–302, 2018.
[27] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022.
[28] Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391, 2024.
[29] Igor Melnyk, Pierre Dognin, and Payel Das. Knowledge graph generation from text. arXiv preprint arXiv:2211.10511, 2022.
[30] Bowen Zhang and Harold Soh. Extract, define, canonicalize: An llm-based framework for knowledge graph construction. arXiv preprint arXiv:2404.03868, 2024.
[31] Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, and Andrew McCallum. Building dynamic knowledge graphs from text using machine reading comprehension. arXiv preprint arXiv:1810.05682, 2018.
[32] Stefan Langer, Fabian Neuhaus, and Andreas Nürnberger. Cear: Automatic construction of a knowledge graph of chemical entities and roles from scientific literature. arXiv preprint arXiv:2407.21708, 2024.
[33] Wenxiong Liao, Zhengliang Liu, Yiyang Zhang, Xiaoke Huang, Fei Qi, Siqi Ding, Hui Ren, Zihao Wu, Haixing Dai, Sheng Li, et al. Coarse-to-fine knowledge graph domain adaptation based on distantly-supervised iterative training. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1294–1299. IEEE, 2023.
[34] Pedro Ruas and Francisco M Couto. Nilinker: attention-based approach to nil entity linking. Journal of Biomedical Informatics, 132:104137, 2022.
[35] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing, 2020.
[36] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem in 2021: new data content and improved web interfaces. Nucleic acids research, 49(D1):D1388–D1395, 2021.
[37] Zhengliang Shi, Weiwei Sun, Shen Gao, Pengjie Ren, Zhumin Chen, and Zhaochun Ren. Generate-then-ground in retrieval-augmented generation for multi-hop question answering. arXiv preprint arXiv:2406.14891, 2024.
[38] Xiaoming Zhang, Ming Wang, Xiaocui Yang, Daling Wang, Shi Feng, and Yifei Zhang. Hierarchical retrieval-augmented generation model with rethink for multi-hop question answering. arXiv preprint arXiv:2408.11875, 2024.
[39] Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation. arXiv preprint arXiv:2502.12442, 2025.

Appendix

Appendix S1 Detailed Performance Based on Context Availability

Figure S1 shows that model performance is strongly influenced by whether context is provided in the input. In particular, Claude 3.7 with extended thinking achieved a correctness rate of 84% with context, whereas o3-mini recorded the highest correctness rate (48%) when the context was absent. Note that o3-mini was primarily used to generate the questions, which may have introduced a slight bias resulting in its minor improvement in correctness. Figures S2 and S3 illustrate the token usage and latency of the models, respectively.

S1.1 Reasoning without Provided Context

Figure S4 illustrates the performance of various models when no context is provided. As shown, o3-mini achieves the highest correctness rate, although this comes at a higher cost. It is important to note that o3-mini was primarily used for question generation, which may provide it with a slight advantage. In contrast, Llama3-3 70B Instruct records the lowest cost and lowest correctness rate while also exhibiting low latency. Meanwhile, Claude Sonnet 3.7 (without thinking tokens) balances correctness and cost, with acceptably low latency. This model also exhibited the highest correctness rate among non-reasoning models.

Appendix S2 Performance of models on Chemistry Subset of HotpotQA

Table S1 shows the details of each model performance, latency, and tokens used in both setups of context provided and not provided for the chemistry subset of HotpotQA [19] questions.

Model	Context	Correctness Rate (%)	Avg Duration (s)	Avg Input Tokens	Avg Output Tokens	Total Input Tokens (K)	Total Output Tokens (K)
Anthropic Claude Sonnet 3.5 V2	✗	53.27	1.26	517	29	507.47	29.20
Anthropic Claude Sonnet 3.5 V2	✓	84.80	1.32	618	29	605.73	28.85
Anthropic Claude Sonnet 3.7	✗	58.27	1.87	517	29	507.47	28.49
Anthropic Claude Sonnet 3.7	✓	86.22	1.94	618	29	605.73	28.71
Anthropic Claude Sonnet 3.7 (Thinking)	✗	65.31	17.54	539	726	528.48	711.49
Anthropic Claude Sonnet 3.7 (Thinking)	✓	87.14	10.10	640	390	627.29	382.77
OpenAI GPT-4o-mini	✗	45.61	0.43	170	7	166.96	7.72
OpenAI GPT-4o-mini	✓	80.31	0.50	257	8	252.39	8.00
OpenAI GPT-4o	✗	55.10	0.62	170	8	166.96	8.76
OpenAI GPT-4o	✓	81.33	0.63	257	8	252.39	8.72
OpenAI o1-mini	✗	50.82	4.82	127	719	124.82	705.00
OpenAI o1-mini	✓	85.51	3.41	217	426	213.30	418.09
OpenAI o3-mini	✗	59.69	9.65	166	977	163.04	957.92
OpenAI o3-mini	✓	87.65	3.74	253	247	248.47	242.51
Mistral Large	✗	4.59	0.49	198	18	194.44	17.86
Mistral Large	✓	0.92	0.36	303	12	297.72	12.32
Llama 3.3 70B Instruct	✗	44.49	0.32	284	9	278.92	9.04
Llama 3.3 70B Instruct	✓	79.29	0.29	373	9	365.78	9.03
Google Gemma 3 27B	✗	40.51	0.77	127	10	124.58	9.86
Google Gemma 3 27B	✓	79.80	0.82	218	11	214.04	11.26
DeepSeek R1	✗	59.08	8.93	125	612	122.76	600.15
DeepSeek R1	✓	85.10	5.12	212	358	208.25	351.01
Qwen QwQ 32B	✗	51.94	27.91	126	865	124.45	848.04
Qwen QwQ 32B	✓	88.37	11.01	219	412	215.34	403.94
DeepSeek R1 Distill Qwen 32B	✗	46.63	16.20	119	565	117.10	553.77
DeepSeek R1 Distill Qwen 32B	✓	86.84	7.98	208	287	204.53	282.17

Table S1: Summary of tested models performance on HotpotQA chemistry subset in terms of several evaluation metrics for both Contextual and Non-Contextual Setups

Appendix S3 A Multi-Hop QA Generation Example

Figure S5 illustrates a typical multi-hop QA example derived from our knowledge-graph-based methodology. The context is drawn from chemical literature discussing the use of carbon dioxide as a renewable feedstock for formic acid, which then serves as a non-gaseous CO surrogate in carbonylation reactions. By chaining these facts together, our approach constructs a question that requires integrating multiple pieces of information to arrive at the correct answer. This demonstrates how multi-hop reasoning, guided by entity relations and supplemented with descriptive metadata, enables more complex question generation and evaluation of large language models. Additionally, Figure 2 shows the step-by-step process of deriving multi-hop questions from a knowledge graph, illustrating how entities, relations, and descriptive metadata are combined to construct more complex queries.

Context:
Carbonylation reactions constitute a potent tool to manufacture carboxylic acids and their derivatives both in industry and academic organic synthesis. In general, carbonylation requires the use of toxic carbon monoxide, which thus usually demands certified high-pressure reaction vessels. Therefore, developing non-gaseous CO surrogate for conducting safe and facile-operation carbonylation is an important and ongoing research topic. Among these established CO surrogates, formic acid is one kind of versatile atom. The utilization of carbon dioxide as a C1 feedstock for the generation of industrially relevant chemicals is also an interesting approach. CO₂ is an attractive renewable C1 source, which can lead to formic acid. Those approaches would not only reduce carbon dioxide emissions through carbon capture but also compensate sequestration costs by producing chemicals in global demand. Question:
What is the process that uses a substance, produced from carbon dioxide and known as the simplest carboxylic acid with antibacterial and preservative properties, as a non-gaseous surrogate to safely form carboxylic acids and their derivatives under mild conditions? Answer: carbonylation reactions Sentence-level supporting facts:
1) formic acid is the simplest carboxylic acid with antibacterial and preservative properties. 2) formic acid can be produced from carbon dioxide. 3) formic acid can act as a non-gaseous CO surrogate. 4) carbonylation reactions safely produce carboxylic acids under mild conditions using formic acid as a CO surrogate. Path (multi-hop chain of reasoning):
carbon dioxide $\rightarrow$ formic acid $\rightarrow$ carbonylation reactions

Figure S5: An example of a multi-hop question-answer.

Appendix S4 Impact of Context and Reasoning on output tokens count

Figure S6 illustrates the impact of context availability on the average number of output tokens generated by reasoning and non-reasoning models when answering questions. Non-reasoning models produce a similar number of tokens, as they do not engage in test-time reasoning. In contrast, for reasoning models, the number of tokens generated to answer questions decreases with the availability of context, indicating a potential requirement of less thinking when the context is available.

S4.1 Performance Analysis Based on Number of Hops

Figure S7 illustrates how the number of hops affects performance in the absence of context. The data reveals that as the number of hops increases, the correctness rate declines while the number of output tokens rises. Additionally, both Figure LABEL:fig:hops_context and Figure S7 show a negative correlation between token count and correctness when only one hop is used. This may suggest that overanalyzing simpler questions could lead to errors in the final answer. We visualized the overall performance of all models in Figure S8 and S9 to analyze how the correctness rate varies with the number of reasoning hops. In addition, token usage and latency metrics were separately depicted in Figures S10, S11, S12 and S13, respectively, to provide a more detailed view of the efficiency and resource consumption as the reasoning depth increases.

The following figures illustrate each model’s performance, latency, and token usage for question clusters that require different numbers of hops to answer.