GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning

Luu Quy Tung
GreenNode.ai
[email protected]
&Hoang Quoc Viet
GreenNode.ai
[email protected]
&Vo Trong Thu
John Von Neumann Institute
[email protected]
Corresponding author: [email protected]
Tóm tắt nội dung

Chain-of-Thought (CoT) is a robust approach for tackling LLM tasks that require intermediate reasoning steps prior to generating a final answer. In this paper, we present GreenMind-Medium-14B-R1 111https://huggingface.co/GreenNode/GreenMind-Medium-14B-R1 , the Vietnamese reasoning model inspired by the finetuning strategy based on Group Relative Policy Optimization. We also leverage a high-quality Vietnamese synthesized reasoning dataset and design two reward functions to tackle the main limitations of this technique: i) Language mixing, where we explicitly detect the presence of biased language characters during the process of sampling tokens, and ii) We leverage Sentence Transformer-based models to ensure that the generated reasoning content maintain factual correctness and do not distort the final output. Experimental results on the Vietnamese dataset from the VLSP 2023 Challenge demonstrate that our model outperforms prior works and enhances linguistic consistency in its responses. Furthermore, we extend our evaluation to SeaExam — a multilingual mutiple-choices dataset, showing the effectiveness of our reasoning method compared to few-shot prompting techniques.

1 Introduction

The rapid advancement of Large Language Models (LLMs) has transformed the approach to handling complex tasks such as question answering and multiple-choice problems. Many open-source LLMs have demonstrated impressive capabilities in natural language understanding. However, for tasks like question answering and multiple-choice reasoning, the act of users prompting models to produce direct answers only often fails to ensure accuracy. Meanwhile, at each generation step, models rely on the probability distribution over a list of candidate tokens to select the potential one by greedy or random sampling algorithms. Consequently, producing only a short sequence of tokens as the final output does not guarantee correctness, as these distributions are conditioned solely on the preceding input tokens. This implies that the models often lack the contextual understanding necessary for reasoning toward a correct answer. To address this issue, the CoT Wei et al. (2022b) technique remains an effective approach to fully leverage the power of next token prediction. CoT encourages the model to articulate a sequence of intermediate reasoning steps, which facilitates the resolution of tasks that require multi-step logical thinking. To further enhance the reasoning capabilities of language models, a series of reinforcement learning-based methods have been proposed. Reinforcement Learning with Human Feedback (RLHF) Ouyang et al. (2022) leveraged human-provided feedback to refine LLM outputs, ensuring that the reasoning steps generated by CoT align more closely with human-like judgment and reasoning. Proximal Policy Optimization (PPO) balanced exploration and exploitation by updating the reasoning policy using a clipped objective function, which helps avoid large, destabilizing changes while enhancing CoT reasoning across multiple steps.

In this study, we introduce GreenMind-Medium-14B-R1, a fine-tuned LLM model capable of reasoning for tasks within the Vietnamese community. Our model leverages the GRPO technique Shao et al. ; Guo et al. (2025), which has been shown to enhance reasoning effectiveness for the CoT method as well as reduce computational costs. However, the limitation of this approach is its inability to control for linguistic bias (typically English and Chinese) inherent in the base models, which means that generated responses may contain characters from the language with the dominant training dataset. Additionally, the quality control of the reasoning process has not been addressed in the original work Guo et al. (2025), which may lead to content distortion relative to the original query. To tackle these challenges, we augment the synthesized sequences of reasoning steps for each sample in the training dataset by utilizing a state-of-the-art LLM for reasoning tasks. We then re-check the data based on the labels of each sample. We design two reward functions: one for language check, which uses a banned letter dictionary, and another for reasoning content, which employs Sentence Transformer models to measure the semantic similarity of the generated response compared to the corresponding reasoning data.

Our contributions are described as follows:

  • We propose algorithms and utilize our Vietnamese reasoning dataset to address the issue of language bias and ensure strict control over the reasoning content.

  • We release a Vietnamese reasoning model with a medium size, specifically a 14.7 billion parameters, achieving a high overall accuracy of over 70% on multiple-choice datasets, including the VLSP 2023 Challenge Le et al. (2024) and SeaExam Li et al. (2024).

  • We also conduct experiments across multiple languages and demonstrate that reasoning-based answers significantly improve compared to few-shot learning techniques.

2 Related Work

2.1 Chain-of-Thought

Chain‑of‑Thought (CoT) prompting Wei et al. (2022a) was introduced to encourage models to “think step by step”, providing a few exemplars with intermediate reasoning steps to improve multi‑step inference. Empirical results show that CoT significantly boosts performance on arithmetic, commonsense, and symbolic reasoning benchmarks, with a 540‑billion‑parameter model achieving state‑of‑the‑art accuracy on GSM8K using just eight CoT exemplars. Follow‑up work on self‑consistency decoding samples multiple reasoning paths and selects the most consistent answer, yielding substantial gains on GSM8K +17.9%percent17.9+17.9\%+ 17.9 %, SVAMP (+11.0%percent11.0+11.0\%+ 11.0 %), AQuA (+12.2%percent12.2+12.2\%+ 12.2 %), StrategyQA (+6.4%percent6.4+6.4\%+ 6.4 %), and ARC‑challenge (+3.9%percent3.9+3.9\%+ 3.9 %) Wang et al. (2022). These studies reveal that structured intermediate reasoning can be an emergent capability in sufficiently large models.

2.2 Vietnamese Large Language Models

While the domain of open-source models for the Vietnamese language is relatively nascent, there are already some notable models available. These include Vietcuna 3B222https://huggingface.co/vilm/vietcuna-3b, Vietcuna-7B-v3333https://huggingface.co/vilm/vietcuna-7b-v3, URA-LLaMA-7B444https://huggingface.co/ura-hcmut/ura-llama-7b, and URA-LLaMA-13B555https://huggingface.co/ura-hcmut/ura-llama-13b. Vietcuna-3B and Vietcuna-7B-v3 were developed from the foundational models BLOOMZ-3B666https://huggingface.co/bigscience/bloomz-3b and BLOOMZ-7B1777https://huggingface.co/bigscience/bloomz-7b1 Scao et al. (2022), respectively, and were further trained using 12GB of Vietnamese news texts for causal language modeling888https://www.vilm.org/research/how-did-we-train-vietcuna. This process included fine-tuning with 200K instructional question and answer pairs, and 400K conversational samples. The URA-LLaMA models, originating from LLaMA-2, were pre-trained on Vietnamese content from Wikipedia and online news sources, with additional fine-tuning for instruction following. Furthermore, PhoGPT Nguyen et al. (2023) have recently introduced the PhoGPT series, a new addition to the open-source generative models for Vietnamese, which includes a base 7.5B-parameter model and its instruction-following variant.

2.3 Group Relative Policy Optimization

Reinforcement learning (RL) is a subfield of Machine Learning (ML) in which an agent learns to make decisions through interactions with its environment, aiming to maximize cumulative rewards. When applied to LLMs, RL helps fine-tune these models to better align with human preferences and improve their performance on specialized tasks that require complex reasoning processes. A key category of RL algorithms is policy optimization, which focuses on directly refining the policy—the decision-making strategy an agent follows based on different states. GRPO was introduced in DeepSeekMath Shao et al. , with the aim of improving the reasoning abilities of LLMs, especially in mathematical problem-solving and code generation. The reward serves as the foundation for the training signal, guiding the optimization direction in reinforcement learning. To train DeepSeek-R1-Zero Guo et al. (2025), authors implemented a rule-based reward mechanism comprising two primary reward types:

  • Format rewards: This function is used to evaluate the model’s ability to generate responses that adhere to the desired structure.

  • Accuracy rewards: This function is used to evaluate whether the extracted result (obtained from the response using a heuristic or structure-based method) matches the ground truth.

3 Vietnamese Reasoning Dataset

3.1 Problem Definition

Refer to caption
Figure 1: Reasoning Data Curation

In this section, we concentrate on curating high-quality Vietnamese reasoning tasks with verifiable answers. Each instance in the dataset consists of a pair of question-answer instruction iI𝑖𝐼i\in Iitalic_i ∈ italic_I where I𝐼Iitalic_I represents the space of the instruction problems. The objective is to generate both a final answer aA𝑎𝐴a\in Aitalic_a ∈ italic_A and a corresponding reasoning chain rR𝑟𝑅r\in Ritalic_r ∈ italic_R. We define a reasoning chain r𝑟ritalic_r as a structured sequence of intermediate steps {s1,s2,,sn}subscript𝑠1subscript𝑠2subscript𝑠𝑛\{s_{1},s_{2},...,s_{n}\}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where each step sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT constitutes a logical deduction that incrementally bridges the initial question to the final answer. To enrich factual correctness and coverage, we also retrieve supplementary context cC𝑐𝐶c\in Citalic_c ∈ italic_C from the web using Google Search999https://google.com. This additional context serves to enhance both the precision and depth of the model’s reasoning.

Formally, the reasoning process can be modeled as a function:

f:I×CR×A:𝑓𝐼𝐶𝑅𝐴f:I\times C\to R\times Aitalic_f : italic_I × italic_C → italic_R × italic_A

3.2 Instruction Selection

To ensure the robustness and generalizability of the reasoning capabilities of GreenMind, we design a multi-stage pipeline for selecting and curating instruction problems tailored for Vietnamese logical reasoning. Our selection process emphasizes linguistic diversity, logical depth, and cultural relevance. Specifically, we adopt the following criteria for instruction selection:

  • Task Type Diversity: We include a broad range of reasoning tasks such as arithmetic word problems, commonsense inference, symbolic logic, deductive and inductive reasoning, multi-hop question answering, and ethical dilemma evaluation.

  • Linguistic Complexity: Instructions are sampled across varying syntactic and lexical complexities to challenge the model’s understanding of nuanced Vietnamese expressions.

  • Reasoning Depth: We prioritize tasks that require multi-step deductions, analogical thinking, and counterfactual reasoning over those solvable with shallow pattern matching.

  • Verifiability: Each instruction-answer pair is manually verified or derived from trusted Vietnamese educational and encyclopedic sources, ensuring factual accuracy and clarity in logical steps.

3.3 Reasoning Chain Generation

Beyond high-quality instructions, the generation of structured, verifiable reasoning chains is essential for training large language models capable of logical inference and multi-step deduction. To curate such high-quality solutions, we adopt a automated pipeline that incorporates web-scale retrieval to ensure factual correctness and logical coherence.

Given an instruction iI𝑖𝐼i\in Iitalic_i ∈ italic_I, we first retrieve supplementary context cC𝑐𝐶c\in Citalic_c ∈ italic_C from the web. This retrieved information often includes relevant definitions, background knowledge, or factual references that are not explicitly included in the instruction. The context c𝑐citalic_c serves as external knowledge to support the reasoning process, especially in tasks requiring factual grounding or domain-specific expertise.

Subsequently, we generate a reasoning chain r=s1,s2,,sn𝑟subscript𝑠1subscript𝑠2subscript𝑠𝑛r={s_{1},s_{2},...,s_{n}}italic_r = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where each step sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a logically valid and interpretable inference that incrementally bridges the gap between the given question and the final answer aA𝑎𝐴a\in Aitalic_a ∈ italic_A. These steps are structured to reflect a natural flow of thought, ensuring that the reasoning path remains traceable, coherent, and grounded in both the instruction and the retrieved context.

To ensure the quality of the generated reasoning chains, we apply a multi-stage filtering and validation process:

  • Consistency Check: We verify that the reasoning steps logically lead to the final answer and are internally consistent.

  • Redundancy Elimination: Duplicate or unnecessary steps are pruned to maintain conciseness without sacrificing interpretability.

  • Format Conformity: The reasoning chain must follow a step-by-step format to ensure compatibility with chain-of-thought (CoT) supervision.

Moreover, to promote generalization and robustness, we also include examples with multiple valid reasoning chains for the same instruction. This encourages the model to develop a flexible reasoning strategy rather than memorizing fixed templates.

By focusing on both correctness and interpretability, our approach to reasoning chain generation enables GreenMind to demonstrate superior performance in structured reasoning tasks, setting a strong foundation for Vietnamese LLMs with transparent and explainable outputs

4 GreenMind-Medium-14B-R1

In this section, we present the base architecture, provide statistics on the Vietnamese training data, and describe the optimization strategy we used to transform the pretrained model into a Vietnamese-focused reasoning model.

Base Model. We utilize Qwen2.5-14B-Instruct Team (2024) as a base model for finetuning process. Qwen 2.5-14B-Instruct is a dense, decoder-only Transformer language model comprising approximately 14.7 billion parameters. The architecture features 48 layers with a hidden state dimensionality of 5,120 and incorporates SwiGLU Shazeer (2020) feed-forward blocks alongside RMSNorm Zhang and Sennrich (2019) normalization. The model employs Gated-Query Attention Dhingra et al. (2016) (GQA) with 40 query heads and 8 key-value heads, augmented by Rotary Position Embeddings (RoPE) Su et al. (2024) combined with YaRN Peng et al. scaling to effectively support an extended context window of up to 128,000 tokens, enabling generation of sequences up to 8,192 tokens per request. Pre-training was conducted on an expansive multilingual corpus totaling 18 trillion tokens across more than 29 languages, including Vietnamese, representing a 2.5-fold increase over its version. Subsequently, the model underwent supervised fine-tuning on over one million high-quality instruction-response pairs, followed by staged reinforcement learning-based preference optimization. These design choices collectively enhance the model’s capacity for long-context understanding, multilingual comprehension, and instruction-following capabilities, making it well-suited for complex natural language processing tasks including code generation, mathematical reasoning, and structured data interpretation. This instruct model demonstrates strong, state-of-the-art performance across a range of academic and practical benchmarks, often outperforming models of similar or even larger sizes in several key domains, followed by their report Team (2024).

Training data. We curated a high-quality Vietnamese instruction dataset with 55,418 samples, each containing a question, a reasoning chain, and a final answer. To ensure broad generalization, instructions were drawn from diverse domains:

  • Mathematics: Mathematical problems train the model in symbolic reasoning, structured logic, and step-by-step problem solving, which are foundational for strong STEM-related performance. OLMo et al. (2024).

  • Cultural: These instructions cover Vietnamese idioms, proverbs, traditional practices, historical events, and literary references. This domain strengthens the model’s ability to interpret language with deep cultural semantics and regional specificity.

  • Legal and Civic Knowledge: Focused on basic legal concepts and civic education, particularly relevant in localized Vietnamese contexts such as laws, public policy, and social norms.

  • Education and Exams: Inspired by real-world school and university-level examination formats in Vietnam, fostering academic problem-solving patterns.

Reward function 1 Format
1:Completions 𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, regex of sequence format rgs𝑟subscript𝑔𝑠rg_{s}italic_r italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, regex of answer format rga𝑟subscript𝑔𝑎rg_{a}italic_r italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, list of candidate results lanssubscript𝑙𝑎𝑛𝑠l_{ans}italic_l start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT, score of completion structure scorec𝑠𝑐𝑜𝑟subscript𝑒𝑐score_{c}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, score of answering structure scorea𝑠𝑐𝑜𝑟subscript𝑒𝑎score_{a}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, score of answering candidate structure scoreac𝑠𝑐𝑜𝑟subscript𝑒𝑎𝑐score_{ac}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT
2:function FORMAT-REWARDS(𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, rgs𝑟subscript𝑔𝑠rg_{s}italic_r italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, rga𝑟subscript𝑔𝑎rg_{a}italic_r italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, lanssubscript𝑙𝑎𝑛𝑠l_{ans}italic_l start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT, scorec𝑠𝑐𝑜𝑟subscript𝑒𝑐score_{c}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, scorea𝑠𝑐𝑜𝑟subscript𝑒𝑎score_{a}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, scoreac𝑠𝑐𝑜𝑟subscript𝑒𝑎𝑐score_{ac}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT)
3:     lscores=[]subscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}=[]italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT = [ ] \triangleright List of scores
4:     for i=1𝑖1i=1italic_i = 1 to length(𝒞)𝑙𝑒𝑛𝑔𝑡𝒞length(\mathbf{\mathcal{C}})italic_l italic_e italic_n italic_g italic_t italic_h ( caligraphic_C ) do
5:         score1.0𝑠𝑐𝑜𝑟𝑒1.0score\leftarrow 1.0italic_s italic_c italic_o italic_r italic_e ← 1.0
6:         if not_match(𝒞i,rgs)𝑛𝑜𝑡_𝑚𝑎𝑡𝑐subscript𝒞𝑖𝑟subscript𝑔𝑠not\_match(\mathbf{\mathcal{C}}_{i},rg_{s})italic_n italic_o italic_t _ italic_m italic_a italic_t italic_c italic_h ( caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) then
7:              scorescorescorec𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟subscript𝑒𝑐score\leftarrow score-score_{c}italic_s italic_c italic_o italic_r italic_e ← italic_s italic_c italic_o italic_r italic_e - italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
8:         end if
9:         p^find(𝒞i,rga)^𝑝𝑓𝑖𝑛𝑑subscript𝒞𝑖𝑟subscript𝑔𝑎\hat{p}\leftarrow find(\mathbf{\mathcal{C}}_{i},rg_{a})over^ start_ARG italic_p end_ARG ← italic_f italic_i italic_n italic_d ( caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) \triangleright Get predictions
10:         if length(p^)==1length(\hat{p})==1italic_l italic_e italic_n italic_g italic_t italic_h ( over^ start_ARG italic_p end_ARG ) = = 1 then
11:              if p^lansnot-subset-of-nor-equals^𝑝subscript𝑙𝑎𝑛𝑠\hat{p}\nsubseteq l_{ans}over^ start_ARG italic_p end_ARG ⊈ italic_l start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT then
12:                  scorescorescoreac𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟subscript𝑒𝑎𝑐score\leftarrow score-score_{ac}italic_s italic_c italic_o italic_r italic_e ← italic_s italic_c italic_o italic_r italic_e - italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT
13:              end if
14:         else
15:              scorescorescorea𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟subscript𝑒𝑎score\leftarrow score-score_{a}italic_s italic_c italic_o italic_r italic_e ← italic_s italic_c italic_o italic_r italic_e - italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
16:         end if
17:         Append score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e to lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
18:     end for
19:     return lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
20:end function

Optimization with reward functions. We fine-tune the model to focus on tasks that require generating concise answers, which involve a step-by-step reasoning process. Following DeepSeek-R1 Guo et al. (2025), we design two fundamental reward functions.

  • Format rewards: Our objective is to ensure that reasoning chains are enclosed within the <think>...</think> tags, and that the final answer is enclosed within the <answer>...</answer> tags. Among these, we place greater emphasis on the structure of the final answer, as it remains the ultimate goal to be achieved. Details on how the format reward is computed are described in Algorithm 1. To enable smooth reward assignment, we recommend that:

    {0.0<scoreac<scorea,scorec<1.0scorec+scorea=1.0casesformulae-sequence0.0subscriptscore𝑎𝑐subscriptscore𝑎subscriptscore𝑐1.0otherwisesubscriptscore𝑐subscriptscore𝑎1.0otherwise\begin{cases}0.0<\text{score}_{ac}<\text{score}_{a},\text{score}_{c}<1.0\\ \text{score}_{c}+\text{score}_{a}=1.0\\ \end{cases}{ start_ROW start_CELL 0.0 < score start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT < score start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , score start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT < 1.0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL score start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + score start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1.0 end_CELL start_CELL end_CELL end_ROW
  • Accuracy rewards: This function is used to assign scores to the generated answers. A prerequisite is that the prediction must be extracted from the completion based on the expected answering structure. For tasks involving multiple choices, we still encourage assigning partial score to help the model capture the scope of possible outcomes, e.g. one of candidates [A,B,C,D,E]𝐴𝐵𝐶𝐷𝐸[A,B,C,D,E][ italic_A , italic_B , italic_C , italic_D , italic_E ] (see Algorithm 2).

To the best of our knowledge, there is no specific statistical report on language data distribution mentioned in the papers or technical reports by the authors of the Qwen model family. Therefore, we identify potential biases toward specific languages through empirical observations. The results for base model indicate that the sampling process can effectively handle Vietnamese language, although Chinese characters occasionally appear. Additionally, the reasoning content should be tightly controlled and remain within the scope of the original topic to ensure alignment with the final answer or the list of possible answers. In summary, we propose the design of two additional reward functions to address the above challenges:

  • Language rewards: We define a banned character list to penalize undesired language usage. Our goal is to guide the model to generate content in a single language that matches the input query. Therefore, a completion receives a reward if and only if it does not contain any characters from the banned list (see Algorithm 3).

  • Semantic similarity rewards: Based on our proposed Vietnamese reasoning dataset, we measure the closeness of completions using a Sentence Transformers-based model. The selection model should be validated to ensure good performance on the specific monolingual setting. With Algorithm 4, the cosine score is preserved if it exceeds a predefined threshold; otherwise, it is set to zero to mitigate the risk of hallucination.

Reward function 2 Answering
1:Completions 𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, regex of answering format rga𝑟subscript𝑔𝑎rg_{a}italic_r italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, list of candidate results lanssubscript𝑙𝑎𝑛𝑠l_{ans}italic_l start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT, score of answering candidates scoreac𝑠𝑐𝑜𝑟subscript𝑒𝑎𝑐score_{ac}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT, ground truth gt𝑔𝑡gtitalic_g italic_t
2:function ANSWERING-REWARDS(𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, rga𝑟subscript𝑔𝑎rg_{a}italic_r italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, lanssubscript𝑙𝑎𝑛𝑠l_{ans}italic_l start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT, scoreac𝑠𝑐𝑜𝑟subscript𝑒𝑎𝑐score_{ac}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT, gt𝑔𝑡gtitalic_g italic_t )
3:     lscores=[]subscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}=[]italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT = [ ] \triangleright List of scores
4:     for i=1𝑖1i=1italic_i = 1 to length(𝒞)𝑙𝑒𝑛𝑔𝑡𝒞length(\mathbf{\mathcal{C}})italic_l italic_e italic_n italic_g italic_t italic_h ( caligraphic_C ) do
5:         score0.0𝑠𝑐𝑜𝑟𝑒0.0score\leftarrow 0.0italic_s italic_c italic_o italic_r italic_e ← 0.0
6:         p^find(𝒞i,rga)^𝑝𝑓𝑖𝑛𝑑subscript𝒞𝑖𝑟subscript𝑔𝑎\hat{p}\leftarrow find(\mathbf{\mathcal{C}}_{i},rg_{a})over^ start_ARG italic_p end_ARG ← italic_f italic_i italic_n italic_d ( caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) \triangleright Get predictions
7:         if length(p^)==1length(\hat{p})==1italic_l italic_e italic_n italic_g italic_t italic_h ( over^ start_ARG italic_p end_ARG ) = = 1 then
8:              if p^==gt\hat{p}==gtover^ start_ARG italic_p end_ARG = = italic_g italic_t then
9:                  score1𝑠𝑐𝑜𝑟𝑒1score\leftarrow 1italic_s italic_c italic_o italic_r italic_e ← 1
10:              else if p^lans^𝑝subscript𝑙𝑎𝑛𝑠\hat{p}\subset l_{ans}over^ start_ARG italic_p end_ARG ⊂ italic_l start_POSTSUBSCRIPT italic_a italic_n italic_s end_POSTSUBSCRIPT then
11:                  scorescoreac𝑠𝑐𝑜𝑟𝑒𝑠𝑐𝑜𝑟subscript𝑒𝑎𝑐score\leftarrow score_{ac}italic_s italic_c italic_o italic_r italic_e ← italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_a italic_c end_POSTSUBSCRIPT \triangleright For multiple-choices tasks
12:              else
13:                  score0𝑠𝑐𝑜𝑟𝑒0score\leftarrow 0italic_s italic_c italic_o italic_r italic_e ← 0
14:              end if
15:         else
16:              score0𝑠𝑐𝑜𝑟𝑒0score\leftarrow 0italic_s italic_c italic_o italic_r italic_e ← 0
17:         end if
18:         Append score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e to lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
19:     end for
20:     return lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
21:end function
Reward function 3 Language
1:Completions 𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, List of banned letters lblsubscript𝑙𝑏𝑙l_{bl}italic_l start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT
2:function LANGUAGE-REWARDS(𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, lblsubscript𝑙𝑏𝑙l_{bl}italic_l start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT)
3:     lscores=[]subscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}=[]italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT = [ ] \triangleright List of scores
4:     for i=1𝑖1i=1italic_i = 1 to length(𝒞)𝑙𝑒𝑛𝑔𝑡𝒞length(\mathbf{\mathcal{C}})italic_l italic_e italic_n italic_g italic_t italic_h ( caligraphic_C ) do
5:         score1.0𝑠𝑐𝑜𝑟𝑒1.0score\leftarrow 1.0italic_s italic_c italic_o italic_r italic_e ← 1.0
6:         if exist(𝐜𝒞𝐢,lbl)𝑒𝑥𝑖𝑠𝑡𝐜subscript𝒞𝐢subscript𝑙𝑏𝑙exist(\mathbf{c\in\mathcal{C}_{i}},l_{bl})italic_e italic_x italic_i italic_s italic_t ( bold_c ∈ caligraphic_C start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT ) then
7:              score0.0𝑠𝑐𝑜𝑟𝑒0.0score\leftarrow 0.0italic_s italic_c italic_o italic_r italic_e ← 0.0
8:         end if
9:         Append score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e to lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
10:     end for
11:     return lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
12:end function

5 Experiment

5.1 Implementation Details

We fine-tune our model with all parameters (full fine-tuning) to ensure optimal performance, allowing the model to fully adapt to the downstream task and leverage the capacity of all layers for improved generalization. We use DeepSpeed Rajbhandari et al. (2020) framework to enable efficient large-scale model training by reducing memory footprint and accelerating training throughput. Specifically, we leverage ZeRO Stage 3 to partition optimizer states, gradients, and parameters across GPUs, which allows us to train models that would otherwise exceed device memory limitations. Additionally, mixed-precision training further improves computational efficiency without sacrificing model accuracy. Additionally, we report the configuration of the hyperparameters used during the fine-tuning process, as detailed in Table 1.

Reward function 4 Semantic Similarity of Reasoning Content
1:Completions 𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, Sentence Transformers model ST_model𝑆𝑇_𝑚𝑜𝑑𝑒𝑙ST\_{model}italic_S italic_T _ italic_m italic_o italic_d italic_e italic_l, reasoning data rs𝑟𝑠rsitalic_r italic_s, similarity threshold ξ𝜉\xiitalic_ξ
2:function SS-REWARDS(𝒞𝒞\mathbf{\mathcal{C}}caligraphic_C, ST_model𝑆𝑇_𝑚𝑜𝑑𝑒𝑙ST\_{model}italic_S italic_T _ italic_m italic_o italic_d italic_e italic_l, rs𝑟𝑠rsitalic_r italic_s, ξ𝜉\xiitalic_ξ)
3:     lscores=[]subscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}=[]italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT = [ ] \triangleright List of scores
4:     for i=1𝑖1i=1italic_i = 1 to length(𝒞)𝑙𝑒𝑛𝑔𝑡𝒞length(\mathbf{\mathcal{C}})italic_l italic_e italic_n italic_g italic_t italic_h ( caligraphic_C ) do
5:         scoreST_model(𝒞i,rsi)𝑠𝑐𝑜𝑟𝑒𝑆𝑇_𝑚𝑜𝑑𝑒𝑙subscript𝒞𝑖𝑟subscript𝑠𝑖score\leftarrow ST\_{model}(\mathcal{C}_{i},rs_{i})italic_s italic_c italic_o italic_r italic_e ← italic_S italic_T _ italic_m italic_o italic_d italic_e italic_l ( caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
6:         if score<ξ𝑠𝑐𝑜𝑟𝑒𝜉score<\xiitalic_s italic_c italic_o italic_r italic_e < italic_ξ then
7:              score0.0𝑠𝑐𝑜𝑟𝑒0.0score\leftarrow 0.0italic_s italic_c italic_o italic_r italic_e ← 0.0
8:         end if
9:         Append score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e to lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
10:     end for
11:     return lscoressubscript𝑙𝑠𝑐𝑜𝑟𝑒𝑠l_{scores}italic_l start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e italic_s end_POSTSUBSCRIPT
12:end function

We utilize 7 GPUs for model fine-tuning and, one GPU for inference by employing the vLLM framework Kwon et al. (2023). Our objective is to produce completions that are creative while mitigating hallucinations. The hyperparameters for inferencing are presented in the Table 2. All experiments were conducted on 8 H100 GPUs.

Table 1: Training Hyperparameters
Hyperparameter Value
epochs 4
per_device_train_batch_size 1
gradient_accumulation_steps 8
gradient_checkpointing true
learning_rate 5.0e-7
lr_scheduler_type cosine
warmup_ratio 0.03
beta 0.001
max_prompt_length 256
max_completion_length 1024
num_generations 4
use_vllm true
vllm_gpu_memory_utilization 0.9
Table 2: vLLM Inferencing Hyperparameters
Hyperparameter Value
Repetition Penalty 1.2
Temperature 0.6
Top-p (nucleus) 0.8
Top-k 4
Table 3: SeaExam performance compared to SOTA model
Model SeaExam-ID SeaExam-TH SeaExam-VI Avg
Meta-Llama-3.1-70B-Instruct 65.8 70.6 72.6 69.7
gemma3-27b-it 64.4 67.5 73.1 68.4
Qwen2.5-14B-Instruct 67.6 68.8 73.1 69.8
GreenMind-Medium-14B-R1 74.36 69.75 74.44 72.79
Model Access STEM Social Science Humanities Others Avg
VNPTAI.IO-Medium-R1 Private 77.09 82.3 78.85 69.98 77.43
MISA-Llama3-v1.1 Private 77.5 80.75 76.62 71.6 76.87
BnK-AI-Medium-v2 Private 80.94 80.76 70.7 74.06 76.66
VNPTAI.IO-Large-v4 Private 78.05 79.05 75.39 70.37 76.21
GreenNode-xMedium-v1 Private 75.7 81.09 75.25 69.33 75.5
GreenMind-Medium-14B-R1 (Ours) Weight 76.78 77.36 72.32 69.03 74.29
CakebyVPBank-Large Private 77.75 78.11 70.38 67.82 73.99
DeepSeek-R1-Distill-Llama-70B Weight 76.77 76.23 67.98 66.82 72.41
Table 4: VMLU performance compared to fine-tuned models
Model ComprehensionQA-vi Exams-vi LAMBADA-vi WikiQA-vi MMLU-vi
cpt-smartbot-13b 0.6633 0.3473 21.9864 0.4455 0.414
ura-llama-13b 0.6556 0.342 17.5614 0.438 0.3973
greennode-7b (prior work) 0.6122 0.2892 189.7782 0.3335 0.387
greennode-14b (prior work) 0.6711 0.3672 29.5967 0.468 0.5281
GreenMind-Medium-14B-R1 (our) 0.8689 0.7796 10.7609 0.7915 0.7124
Table 5: VLSP 2023 Challenge. The performance of our model outperforms most SOTA models.

5.2 Experimental Results

Finetuning results. We present some basic analysis after fine-tuning for approximately 4 epochs, as reported in Figure 2. The GRPO loss function starts at 0 and gradually increases. The reason is that the Kullback-Leibler divergence approaches infinity as the distributions of πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT become more different.

Quantitative Evaluation. Experimental results on the SeaExam multiple-choice dataset Li et al. (2024) show that our reasoning model outperforms most Southeast Asian languages, as well as the overall average across all languages, when compared to baseline models with significantly larger parameter sizes — including those with up to 70 billion parameters (see Table 3). Notably, these models were evaluated under few-shot prompting settings. On the VLSP 2023 Challenge dataset Le et al. (2024), our model achieves superior performance over all previously reported models. In particular, greennode-7b and greennode-14b were trained using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) Rafailov et al. (2023), respectively. Regarding the VMLU101010https://vmlu.ai dataset, at the current time, the only accessible model is DeepSeek-R1-Distill-Llama-70B111111https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Results indicate that our model performs slightly better across most topics, including both mathematical and social science domains.

Qualitative results. The qualitative results are presented in the Table 6. These results demonstrate that the model’s responses adhere to structural rules of reasoning and outcome formulation. We showcase reasoning chains across various topics, ranging from natural sciences to social sciences. The visualizations illustrate that the model performs a sequence of logical inferences before arriving at the final answer.

Refer to caption
Figure 2: Training Loss.
Task: Mathematics
Prompt: Hình lập phương có cạnh là 5 m. Vậy thể tích hình đó là: A. 150m3150superscript𝑚3150m^{3}150 italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT B. 25m325superscript𝑚325m^{3}25 italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT C. 100m3100superscript𝑚3100m^{3}100 italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT D. 125m3125superscript𝑚3125m^{3}125 italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT
Answer: ¡think¿Đầu tiên, chúng ta cần nhớ công thức tính thể tích của hình lập phương: Thể tích = a³ (trong đó a là độ dài cạnh của hình lập phương) Trong trường hợp này, cạnh của hình lập phương là 5m. Tiếp theo, áp dụng công thức vào dữ liệu cho sẵn: V = 5³ = 5 × 5 × 5 = 125 m³ Cuối cùng, so sánh kết quả với các lựa chọn được cung cấp: - A. 150 m³ - B. 25 m³ - C. 100 m³ - D. 125 m³ Kết quả đúng nhất chính là 125 m³.¡/think¿ ¡answer¿D¡/answer¿
Task: Natural Science
Prompt: Các cơ quan sinh dục cái và cơ quan sinh dục đực của thực vật có hoa là: A. Đài hoa và nhụy hoa. B. Cánh hoa và bao phấn. C. Nhụy và nhị. D. Cả 3 ý trên.
Answer: ¡think¿Đầu tiên, chúng ta cần xác định: - Cơ quan sinh dục cái (nữ) ở thực vật có hoa là gì? - Cơ quan sinh dục đực (nam) ở thực vật có hoa là gì? 1. Cơ quan sinh dục cái thường gồm phần noãn hay kén noãn nằm bên dưới. 2. Cơ quan sinh dục đực thường gồm nhị hoa sản xuất tinh trùng dạng phấn hoa. Tiếp theo, phân tích mỗi phương án: A - Đài hoa và nhụy hoa không chính xác vì đây chỉ là bộ phận bảo vệ hoặc phụ trợ. B - Cánh hoa và bao phấn cũng tương tự như trên. C - Nhụy và nhị đúng với mô tả phía trên. D - Tất cả ba đều sai khi xét đến vị trí cụ thể của cơ quan sinh dục. Vậy rõ ràng nhất từ góc nhìn cấu trúc chức năng thì C sẽ phù hợp.¡/think¿ ¡answer¿C¡/answer¿
Task: Social Science
Prompt: Chống bão ở nước ta phải luôn kết hợp với chống A. hạn hán. B. động đất. C. ngập lụt. D. sương muối.
Answer: ¡think¿Đầu tiên, chúng cần xác định vấn đề chính của câu hỏi: ”chống bão”. Tiếp theo, phân tích mỗi phương án: - Phương án A: Hạn hán - Đây không liên quan trực tiếp đến hậu quả thường gặp khi có bão. - Phương án B: Động đất - Cũng không liên quan nhiều tới tình hình thời tiết do bão gây nên. - Phương án C: Ngập lụt - Khi xảy ra bão, mưa lớn dẫn đến nguy cơ ngập lụt rất cao. - Phương án D: Sương muối - Thường xuất hiện vào mùa khô hoặc trên vùng biển nhất định, ít liên quan đến bão. Từ đó, rõ ràng phương án phù hợp nhất sẽ là sự kết hợp giữa chống bão và chống ngập lụt vì hai yếu tố này đi cùng nhau trong thực tế.¡/think¿ ¡answer¿C¡/answer¿
Table 6: Qualitative results on SeaExam with GreenMind-Medium-14B-R1.

6 Conclusion

We release GreenMind-Medium-14B-R1, a medium-sized Vietnamese language model capable of effectively addressing questions that require intermediate-level reasoning, such as general knowledge and social science topics. By leveraging the GRPO strategy for fine-tuning, we guide the model to generate logically coherent responses. This approach aims to provide users with informative answers, as well as intuitive explanations—valuable not only for end users but also for further research in improving data quality and sampling techniques.

Acknowledgments

We sincerely express our deep appreciation to GreenNode.ai121212https://greennode.ai/, our affiliated organization, for their unwavering support throughout the course of this research. GreenNode.ai has played a pivotal role by providing essential resources—most notably, access to high-performance H100 GPUs—which significantly accelerated the fine-tuning process of our models. This generous support was instrumental in the successful development of a Vietnamese reasoning language model.

Tài liệu

  • Dhingra et al. (2016) Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.
  • Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  • Le et al. (2024) Hoang-Quynh Le, Duy-Cat Can, Khanh-Vinh Nguyen, and Mai-Vu Tran. 2024. Overview of the vlsp 2023 – comom shared task: A data challenge for comparative opinion mining from vietnamese product reviews. arXiv preprint arXiv:2402.13613.
  • Li et al. (2024) Yixuan Li, Xu Tan, Yichong Wang, Zihan Zhang, Longyue Wang, Shuo Wang, Xiaohua Liu, Rui Wang, Jingjing Liu, and Tie-Yan Liu. 2024. Seaexam: Benchmarking large language models for southeast asian languages with human exam questions. arXiv preprint arXiv:2404.11086.
  • Nguyen et al. (2023) Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Nhung Nguyen, Thien Huu Nguyen, Dinh Phung, and Hung Bui. 2023. PhoGPT: Generative Pre-training for Vietnamese. arXiv preprint, arXiv:2311.02945.
  • OLMo et al. (2024) Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James Miranda, V, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi. 2024. 2 OLMO 2 Furious. arXiv (Cornell University).
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  • (9) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models, 2023. URL https://arxiv. org/abs/2309.00071.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741.
  • Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  • Scao et al. (2022) Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, et al. 2022. What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424.
  • (13) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv. org/abs/2402.03300.
  • Shazeer (2020) Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  • Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  • Team (2024) Qwen Team. 2024. Qwen2.5: A party of foundation models.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-Consistency improves chain of thought reasoning in language models. arXiv (Cornell University).
  • Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022a. Chain-of-Thought prompting elicits reasoning in large language models. arXiv (Cornell University).
  • Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  • Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.