活动介绍
file-type

Tensorflow-seq2seq从零开始实战:模型示例代码

ZIP文件

下载需积分: 50 | 211KB | 更新于2024-11-28 | 60 浏览量 | 1 下载量 举报 收藏
download 立即下载
该存储库不仅包括基本的seq2seq模型,也可能包含更高级的变体,比如注意力机制(Attention Mechanism)以及长短期记忆网络(Long Short-Term Memory, LSTM)的应用。通过这个项目,用户可以在Jupyter Notebook环境中学习如何实现自然语言处理(NLP)和机器翻译中的序列转换任务。 该项目的代码可能包括以下几点: 1. seq2seq模型基础:模型由编码器(Encoder)和解码器(Decoder)组成,编码器将输入序列转换为一个内部状态,然后解码器利用这个状态来生成输出序列。这在机器翻译中非常有用,编码器读取源语言句子,并生成一个表示该句子意思的向量,而解码器则将这个向量转换为目标语言的句子。 2. 注意力机制:注意力机制允许模型在生成输出序列的每个步骤中,关注输入序列的不同部分。它是一种优化的seq2seq模型结构,可以提高模型对长句子的翻译质量。 3. LSTM网络:长短期记忆网络是一种特殊的循环神经网络(RNN),能够学习长期依赖信息。在seq2seq模型中,LSTM可以有效处理序列数据,避免传统RNN在长期依赖问题上出现的梯度消失或梯度爆炸问题。 4. Jupyter Notebook示例:Jupyter Notebook是一个开源的Web应用程序,可以让用户创建和共享包含实时代码、方程、可视化和文本的文档。该项目可能提供了一系列Jupyter Notebook文件,方便用户逐步学习seq2seq模型的构建和应用过程。 5. 使用Tensorflow构建:Tensorflow是谷歌开发的开源机器学习框架,支持多种语言,是实现深度学习模型的热门选择。该项目使用Tensorflow框架从零开始构建seq2seq模型,让学习者可以直接从基础代码中学习模型的工作原理。 对于希望深入学习NLP和机器翻译的学生或研究人员,该存储库提供了一个非常有价值的资源。通过分析和运行存储库中的示例代码,用户可以更深刻地理解seq2seq模型的工作机制,并掌握使用Tensorflow框架实现复杂模型的技巧。" 文件名称列表中的"Tensorflow-seq2seq-from-scratch-master"表明了该项目是一个主版本,意味着它可能包含所有核心代码和文档,并可能包括了多种语言的示例。主分支通常是最稳定和最完整的版本,适合初次接触或需要系统了解整个项目的用户。

相关推荐

filetype

jerrt@jerry:~/ORB-SLAM3-STEREO-FIXED/Examples$ ./Monocular/mono_euroc ../Vocabulary/ORBvoc.txt ./Monocular/EuRoC.yaml /home/jerry/dataset/MH01 ./Monocular/EuRoC_TimeStamps/MH01.txt num_seq = 1 Loading images for sequence 0...LOADED! ------- ORB-SLAM3 Copyright (C) 2017-2020 Carlos Campos, Richard Elvira, Juan J. Gómez, José M.M. Montiel and Juan D. Tardós, University of Zaragoza. ORB-SLAM2 Copyright (C) 2014-2016 Raúl Mur-Artal, José M.M. Montiel and Juan D. Tardós, University of Zaragoza. This program comes with ABSOLUTELY NO WARRANTY; This is free software, and you are welcome to redistribute it under certain conditions. See LICENSE.txt. Input sensor was set to: Monocular Loading settings from ./Monocular/EuRoC.yaml Camera1.k3 optional parameter does not exist... -Loaded camera 1 -Loaded image info -Loaded ORB settings Viewer.imageViewScale optional parameter does not exist... -Loaded viewer settings System.LoadAtlasFromFile optional parameter does not exist... System.SaveAtlasToFile optional parameter does not exist... -Loaded Atlas settings System.thFarPoints optional parameter does not exist... -Loaded misc parameters ---------------------------------- SLAM settings: -Camera 1 parameters (Pinhole): [ 458.65399169921875 457.29598999023438 367.21499633789062 248.375 ] -Camera 1 distortion parameters: [ -0.28340810537338257 0.073959067463874817 0.00019359000725671649 1.7618711353861727e-05 ] -Original image size: [ 752 , 480 ] -Current image size: [ 600 , 350 ] -Camera 1 parameters after resize: [ 365.94735717773438 333.44500732421875 292.99069213867188 181.10678100585938 ] -Sequence FPS: 20 -Features per image: 1000 -ORB scale factor: 1.2000000476837158 -ORB number of scales: 8 -Initial FAST threshold: 20 -Min FAST threshold: 7 Loading ORB Vocabulary. This could take a while... Vocabulary loaded! Initialization of Atlas from scratch Creation of new map with id: 0 Creation of new map with last KF id: 0 Seq. Name: There are 1 cameras in

filetype

import datasets import transformers import modelscope from itertools import chain import glob import torch import evaluate from swanlab.integration.transformers import SwanLabCallback import swanlab import numpy as np from sklearn.metrics import accuracy_score from rdkit import Chem from rdkit import DataStructs from rdkit.Chem import AllChem import evaluate from rdkit import Chem def main(): swanlab.init("PreTrain-llama-SMILES") swanlab_callback = SwanLabCallback( project="PreTrain-llama-SMILES", experiment_name="PreTrain-llama-SMILES" ) raw_datasets = datasets.load_dataset( "json", data_files="/data/coding/transformers_from_scratch-main/txt/smiles.json" ) # split dataset raw_datasets = raw_datasets["train"].train_test_split(test_size=0.1, seed=2333) print("dataset info") print(raw_datasets) tokenizer = transformers.AutoTokenizer.from_pretrained( "/data/coding/transformers_from_scratch-main/DeepSeek-R1-Distill-Llama-8B",use_fast=True, trust_remote_code=True, model_max_length=1024 ) context_length = 1024 # use a small context length) # preprocess dataset def tokenize(element): # 对数据集进行预处理,将文本转换为模型可以处理的输入格式 # 这里使用的是Qwen2-0.5B的Tokenizer,将文本转换为模型可以处理的输入格式 # truncation=True表示如果文本长度超过了context_length,就截断 # max_length=context_length表示文本的最大长度为context_length # return_overflowing_tokens=True表示返回溢出的tokens outputs = tokenizer( element["text"], truncation=True, max_length=context_length, return_overflowing_tokens=True, return_length=True, ) input_batch = [] # 作用是将溢出的tokens转换为模型可以处理的输入格式 # 这里使用的是Qwen2-0.5B的Tokenizer,将文本转换为模型可以处理的输入格式 # 这里的input_ids是一个二维数组,每一行表示一个文本的输入格式 # 这里的length是一个一维数组,每一个元素表示一个文本的长度 # 这里的input_batch是一个二维数组,每一行表示一个文本的输入格式 # 这里的context_length是一个整数,表示文本的最大长度 for length, input_ids in zip(outputs["length"], outputs["input_ids"]): if length == context_length: input_batch.append(input_ids) return {"input_ids": input_batch} # map函数的作用是将tokenize函数应用到数据集的每一个元素上 # batched=True表示将数据集分成batch进行处理 # remove_columns=raw_datasets["train"].column_names表示删除原始数据集的列名 tokenized_datasets = raw_datasets.map( tokenize, batched=True,num_proc=20, remove_columns=raw_datasets["train"].column_names ) print("tokenize dataset info") # print(tokenized_datasets) # eos_token的作用是表示文本的结束 # pad_token的作用是表示填充的token tokenizer.pad_token = tokenizer.eos_token # DataCollatorForLanguageModeling的作用是将数据集转换为模型可以处理的输入格式 # 这里使用的是Qwen2-0.5B的Tokenizer,将文本转换为模型可以处理的输入格式 # mlm=False表示不进行masked language modeling data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False) # prepare a model from scratch config = transformers.AutoConfig.from_pretrained( "/data/coding/transformers_from_scratch-main/DeepSeek-R1-Distill-Llama-8B", trust_remote_code=True) model = transformers.AutoModelForCausalLM.from_config(config) model_size = sum(t.numel() for t in model.parameters()) print("Model Config:") print(config) print(f"Model Size: {model_size/1000**2:.1f}M parameters") # 加载各个所需的指标 accuracy_metric = evaluate.load('./metrics/accuracy') def compute_metrics(eval_preds): logits, labels = eval_preds # 获取预测结果(取logits中概率最大的索引) preds = np.argmax(logits, axis=-1) # 形状: [batch_size, sequence_length] labels1 = labels[:, 1:].reshape(-1) preds1 = preds[:, :-1].reshape(-1) # 计算每个标记的准确度 accuracy = accuracy_metric.compute(predictions=preds1, references=labels1) # 解码预测和参考的文本为SMILES字符串 predictions_texts = [] references_texts = [] # 逐样本解码 for i in range(preds.shape[0]): # 解码单个预测序列 pred_seq = preds[i].tolist() # 转换为Python列表 pred_text = tokenizer.decode(pred_seq, skip_special_tokens=True) predictions_texts.append(pred_text) # 解码单个标签序列 label_seq = labels[i].tolist() # 转换为Python列表 ref_text = tokenizer.decode(label_seq, skip_special_tokens=True) references_texts.append(ref_text) # SMILES 有效性结果 smiles_validity = [] for pred_text, ref_text in zip(predictions_texts, references_texts): # 计算单个SMILES的有效性 mol = Chem.MolFromSmiles(pred_text) if mol is None: smiles_validity.append(0) # 无效SMILES,设为0 else: smiles_validity.append(1) # 有效SMILES # 计算 SMILES 有效性的平均值 avg_smiles_validity = np.mean(smiles_validity) # 返回所有指标的合并结果 metrics = { 'accuracy': accuracy['accuracy'], 'average_smiles_validity': avg_smiles_validity } return metrics # train args = transformers.TrainingArguments( output_dir="./WikiLLM-llama-SMILES", per_device_train_batch_size=3, # 每个GPU的训练batch数 per_device_eval_batch_size=3, # 每个GPU的测试batch数 eval_strategy="steps", eval_steps=2, logging_steps=5, gradient_accumulation_steps=8, # 梯度累计总数 num_train_epochs=100, # 训练epoch数 lr_scheduler_type="cosine", # 学习率衰减策略 learning_rate=1e-5, # 基础学习率, save_steps=500, save_total_limit=10, bf16=True, # 开启bf16训练, 对于Amper架构以下的显卡建议替换为fp16=True ) print("Train Args:") print(args) # enjoy training trainer = transformers.Trainer( model=model, tokenizer=tokenizer, args=args, data_collator=data_collator, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], compute_metrics=compute_metrics, callbacks=[swanlab_callback], ) trainer.train() # save model trainer.save_model("./WikiLLM-llama-SMILES/Weight") # 保存模型的路径 # generate pipe = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer) print("GENERATE:", pipe("人工智能", num_return_sequences=1)[0]["generated_text"]) prompts = ["牛顿", "北京市", "亚洲历史"] examples = [] for i in range(3): # 根据提示词生成数据 text = pipe(prompts[i], num_return_sequences=1)[0]["generated_text"] text = swanlab.Text(text) examples.append(text) swanlab.log({"Generate": examples}) if __name__ == "__main__": main()按照这个代码eval_preds单个样本的labels会是什么

filetype

import json import torch from typing import Dict, List from torch.utils.data import Dataset import transformers from peft import LoraConfig, TaskType, get_peft_model from torch.utils.data import DataLoader, SequentialSampler from transformers import Trainer, TrainingArguments from lora_plus import LoraPlusTrainer from torch.utils.data import RandomSampler def infer_seqlen(source_len: int, target_len: int, cutoff_len: int) -> tuple[int, int]: if target_len * 2 < cutoff_len: # truncate source max_target_len = cutoff_len elif source_len * 2 < cutoff_len: # truncate target max_target_len = cutoff_len - source_len else: # truncate both max_target_len = int(cutoff_len * (target_len / (source_len + target_len))) new_target_len = min(max_target_len , target_len) max_source_len = max(cutoff_len - new_target_len, 0) new_source_len = min(max_source_len, source_len) return new_source_len, new_target_len class SupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__( self, data_path, tokenizer, model_max_length, user_tokens=[128011], assistant_tokens=[128012], ): super(SupervisedDataset, self).__init__() self.data = json.load(open(data_path)) self.tokenizer = tokenizer self.model_max_length = model_max_length self.user_tokens = user_tokens self.assistant_tokens = assistant_tokens self.ignore_index = -100 # 测试第一条数据是否正确处理 item = self.preprocessing(self.data[200]) print("input:", self.tokenizer.decode(item["input_ids"])) labels = [id_ for id_ in item["labels"] if id_ != -100] # 过滤 -100 的标签 def __len__(self): return len(self.data) def preprocessing(self, example): input_ids = [] labels = [] # 将用户和助手的内容配对 messages = example["conversations"] pairs = [] current_user_encoded = None # 将 user 和 assistant 配对,并将其打包成编码后的 pairs for message in messages: if message["role"] == "user": # 编码用户消息 current_user_encoded = [self.tokenizer.bos_token_id] + self.user_tokens + self.tokenizer.encode( message["content"], add_special_tokens=False ) elif message["role"] == "assistant" and current_user_encoded is not None: # 编码助手消息 assistant_encoded = self.assistant_tokens + self.tokenizer.encode( message["content"], add_special_tokens=False ) # 配对形成一个 (source_ids, target_ids) pairs.append((current_user_encoded, assistant_encoded)) current_user_encoded = None total_length = 0 # 初始化总长度 # 逐对处理编码后的 (source_ids, target_ids) for turn_idx, (source_ids, target_ids) in enumerate(pairs): # 检查是否超出最大长度,若超出则停止处理 if total_length >= self.model_max_length: print("Exceeded max length, stopping processing further turns.") break # 动态截断长度 source_len, target_len = infer_seqlen( len(source_ids), len(target_ids), self.model_max_length - total_length ) source_ids = source_ids[:source_len] target_ids = target_ids[:target_len] # 更新总长度 total_length += source_len + target_len source_label = [self.tokenizer.bos_token_id] + [self.ignore_index] * (source_len-1) target_label = target_ids # 数据拼接 input_ids += source_ids + target_ids labels += source_label + target_label # 添加 EOS 标记 input_ids += [self.tokenizer.eos_token_id] labels += [self.tokenizer.eos_token_id] # 转换为 Tensor input_ids = torch.LongTensor(input_ids) labels = torch.LongTensor(labels) # 构造 attention_mask attention_mask = attention_mask = input_ids.ne(self.tokenizer.pad_token_id) return { "input_ids": input_ids, "labels": labels, "attention_mask": attention_mask, } def __getitem__(self, idx) -> Dict[str, torch.Tensor]: return self.preprocessing(self.data[idx]) tokenizer = transformers.AutoTokenizer.from_pretrained( '/data/coding/Weight', use_fast=False, trust_remote_code=True, model_max_length=1024, ) train_dataset = SupervisedDataset( '/data/coding/transformers_from_scratch-main/Deepsmiles_SFT.json', tokenizer, model_max_length=1024 ) data_collator = transformers.DataCollatorForSeq2Seq( tokenizer=tokenizer) model = transformers.AutoModelForCausalLM.from_pretrained( "/data/coding/Weight",trust_remote_code=True, torch_dtype="auto") lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=[ "up_proj", "gate_proj", "o_proj", "q_proj", "v_proj", "down_proj", "k_proj" ], # 目标注意力层 lora_dropout=0.0, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 输出示例:0.3% 参数可训练 training_args = TrainingArguments( output_dir="./LLM_SFT1", per_device_train_batch_size=4, gradient_accumulation_steps=8, num_train_epochs=3, learning_rate=5.0e-05, optim="adamw_torch", logging_steps=10, bf16=True, save_strategy="steps", lr_scheduler_type='cosine', max_grad_norm=1.0, save_steps=2000, warmup_steps=0 ) class CustomTrainer(LoraPlusTrainer): def get_train_dataloader(self) -> DataLoader: """ Returns the training dataloader using a random sampler to shuffle the dataset. """ return DataLoader( self.train_dataset, batch_size=self.args.train_batch_size, shuffle=True, collate_fn=self.data_collator, drop_last=False, ) # 使用修改后的 CustomTrainer lp_trainer = CustomTrainer( model, training_args, train_dataset=train_dataset, tokenizer=tokenizer, data_collator=data_collator ) lp_trainer.train() lp_trainer.save_model(output_dir='./LLM_SFT1')这是我监督微调模型的代码,{ "conversations": [ { "role": "user", "content": "Element:C C C C C C C C C C C C C C C C C C C C C C C C C H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H O O O O O.Mass Error:-0.000694316.Precursor type:[M+H]+.Spectrum:[418.2719243, '129.0193:25.569620 143.0853:27.341772 145.1016:30.632911 157.1008:20.759494 159.1165:32.911392 169.1026:33.417722 173.1333:72.151899 199.1479:100.000000 225.163:46.835443']" }, { "role": "assistant", "content": "O=COCCCCCC=CC=CCC)CCOC=O)CC)C)CC)))))C6%10))))))))C)))))CCO)C6" } ] }这是其中的一条数据,我希望Spectrum:[418.2719243, '129.0193:25.569620 143.0853:27.341772 145.1016:30.632911 157.1008:20.759494 159.1165:32.911392 169.1026:33.417722 173.1333:72.151899 199.1479:100.000000 225.163:46.835443']这部分是通过位置编码之后的数据集输入到模型里,不进行token处理可以吗,可以的话写出修改后的完整代码

filetype

import json import torch from typing import Dict, List from torch.utils.data import Dataset import transformers from peft import LoraConfig, TaskType, get_peft_model from torch.utils.data import DataLoader, SequentialSampler from transformers import Trainer, TrainingArguments from lora_plus import LoraPlusTrainer from torch.utils.data import RandomSampler def infer_seqlen(source_len: int, target_len: int, cutoff_len: int) -> tuple[int, int]: if target_len * 2 < cutoff_len: # truncate source max_target_len = cutoff_len elif source_len * 2 < cutoff_len: # truncate target max_target_len = cutoff_len - source_len else: # truncate both max_target_len = int(cutoff_len * (target_len / (source_len + target_len))) new_target_len = min(max_target_len , target_len) max_source_len = max(cutoff_len - new_target_len, 0) new_source_len = min(max_source_len, source_len) return new_source_len, new_target_len class SupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__( self, data_path, tokenizer, model_max_length, user_tokens=[128011], assistant_tokens=[128012], ): super(SupervisedDataset, self).__init__() self.data = json.load(open(data_path)) self.tokenizer = tokenizer self.model_max_length = model_max_length self.user_tokens = user_tokens self.assistant_tokens = assistant_tokens self.ignore_index = -100 # 测试第一条数据是否正确处理 item = self.preprocessing(self.data[200]) print("input:", self.tokenizer.decode(item["input_ids"])) labels = [id_ for id_ in item["labels"] if id_ != -100] # 过滤 -100 的标签 def __len__(self): return len(self.data) def preprocessing(self, example): input_ids = [] labels = [] # 将用户和助手的内容配对 messages = example["conversations"] pairs = [] current_user_encoded = None # 将 user 和 assistant 配对,并将其打包成编码后的 pairs for message in messages: if message["role"] == "user": # 编码用户消息 current_user_encoded = [self.tokenizer.bos_token_id] + self.user_tokens + self.tokenizer.encode( message["content"], add_special_tokens=False ) elif message["role"] == "assistant" and current_user_encoded is not None: # 编码助手消息 assistant_encoded = self.assistant_tokens + self.tokenizer.encode( message["content"], add_special_tokens=False ) # 配对形成一个 (source_ids, target_ids) pairs.append((current_user_encoded, assistant_encoded)) current_user_encoded = None total_length = 0 # 初始化总长度 # 逐对处理编码后的 (source_ids, target_ids) for turn_idx, (source_ids, target_ids) in enumerate(pairs): # 检查是否超出最大长度,若超出则停止处理 if total_length >= self.model_max_length: print("Exceeded max length, stopping processing further turns.") break # 动态截断长度 source_len, target_len = infer_seqlen( len(source_ids), len(target_ids), self.model_max_length - total_length ) source_ids = source_ids[:source_len] target_ids = target_ids[:target_len] # 更新总长度 total_length += source_len + target_len source_label = [self.tokenizer.bos_token_id] + [self.ignore_index] * (source_len-1) target_label = target_ids # 数据拼接 input_ids += source_ids + target_ids labels += source_label + target_label # 添加 EOS 标记 input_ids += [self.tokenizer.eos_token_id] labels += [self.tokenizer.eos_token_id] # 转换为 Tensor input_ids = torch.LongTensor(input_ids) labels = torch.LongTensor(labels) # 构造 attention_mask attention_mask = attention_mask = input_ids.ne(self.tokenizer.pad_token_id) return { "input_ids": input_ids, "labels": labels, "attention_mask": attention_mask, } def __getitem__(self, idx) -> Dict[str, torch.Tensor]: return self.preprocessing(self.data[idx]) tokenizer = transformers.AutoTokenizer.from_pretrained( '/data/coding/Weight', use_fast=False, trust_remote_code=True, model_max_length=1024, ) train_dataset = SupervisedDataset( '/data/coding/transformers_from_scratch-main/Deepsmiles_SFT.json', tokenizer, model_max_length=1024 ) data_collator = transformers.DataCollatorForSeq2Seq( tokenizer=tokenizer) model = transformers.AutoModelForCausalLM.from_pretrained( "/data/coding/Weight",trust_remote_code=True, torch_dtype="auto") lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=[ "up_proj", "gate_proj", "o_proj", "q_proj", "v_proj", "down_proj", "k_proj" ], # 目标注意力层 lora_dropout=0.0, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 输出示例:0.3% 参数可训练 training_args = TrainingArguments( output_dir="./LLM_SFT1", per_device_train_batch_size=4, gradient_accumulation_steps=8, num_train_epochs=3, learning_rate=5.0e-05, optim="adamw_torch", logging_steps=10, bf16=True, save_strategy="steps", lr_scheduler_type='cosine', max_grad_norm=1.0, save_steps=2000, warmup_steps=0 ) class CustomTrainer(LoraPlusTrainer): def get_train_dataloader(self) -> DataLoader: """ Returns the training dataloader using a random sampler to shuffle the dataset. """ return DataLoader( self.train_dataset, batch_size=self.args.train_batch_size, shuffle=True, collate_fn=self.data_collator, drop_last=False, ) # 使用修改后的 CustomTrainer lp_trainer = CustomTrainer( model, training_args, train_dataset=train_dataset, tokenizer=tokenizer, data_collator=data_collator ) lp_trainer.train() lp_trainer.save_model(output_dir='./LLM_SFT1')这是我监督微调模型的代码,{ "conversations": [ { "role": "user", "content": "Element:C C C C C C C C C C C C C C C C C C C C C C C C C H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H O O O O O.Mass Error:-0.000694316.Precursor type:[M+H]+.Spectrum:[418.2719243, '129.0193:25.569620 143.0853:27.341772 145.1016:30.632911 157.1008:20.759494 159.1165:32.911392 169.1026:33.417722 173.1333:72.151899 199.1479:100.000000 225.163:46.835443']" }, { "role": "assistant", "content": "O=COCCCCCC=CC=CCC)CCOC=O)CC)C)CC)))))C6%10))))))))C)))))CCO)C6" } ] }这是其中的一条数据,我希望Spectrum:[418.2719243, '129.0193:25.569620 143.0853:27.341772 145.1016:30.632911 157.1008:20.759494 159.1165:32.911392 169.1026:33.417722 173.1333:72.151899 199.1479:100.000000 225.163:46.835443']这部分是通过位置编码之后的数据集输入到模型里,不进行token处理可以吗,可以的话写出修改后的完整代码,这样修改是不是可以更加合理的处理质谱数据

filetype

import json import torch from typing import Dict, List from torch.utils.data import Dataset import transformers from peft import LoraConfig, TaskType, get_peft_model from torch.utils.data import DataLoader, SequentialSampler from transformers import Trainer, TrainingArguments from lora_plus import LoraPlusTrainer from torch.utils.data import RandomSampler def infer_seqlen(source_len: int, target_len: int, cutoff_len: int) -> tuple[int, int]: if target_len * 2 < cutoff_len: # truncate source max_target_len = cutoff_len elif source_len * 2 < cutoff_len: # truncate target max_target_len = cutoff_len - source_len else: # truncate both max_target_len = int(cutoff_len * (target_len / (source_len + target_len))) new_target_len = min(max_target_len , target_len) max_source_len = max(cutoff_len - new_target_len, 0) new_source_len = min(max_source_len, source_len) return new_source_len, new_target_len class SupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__( self, data_path, tokenizer, model_max_length, user_tokens=[128011], assistant_tokens=[128012], ): super(SupervisedDataset, self).__init__() self.data = json.load(open(data_path)) self.tokenizer = tokenizer self.model_max_length = model_max_length self.user_tokens = user_tokens self.assistant_tokens = assistant_tokens self.ignore_index = -100 # 测试第一条数据是否正确处理 item = self.preprocessing(self.data[200]) print("input:", self.tokenizer.decode(item["input_ids"])) labels = [id_ for id_ in item["labels"] if id_ != -100] # 过滤 -100 的标签 def __len__(self): return len(self.data) def preprocessing(self, example): input_ids = [] labels = [] # 将用户和助手的内容配对 messages = example["conversations"] pairs = [] current_user_encoded = None # 将 user 和 assistant 配对,并将其打包成编码后的 pairs for message in messages: if message["role"] == "user": # 编码用户消息 current_user_encoded = [self.tokenizer.bos_token_id] + self.user_tokens + self.tokenizer.encode( message["content"], add_special_tokens=False ) elif message["role"] == "assistant" and current_user_encoded is not None: # 编码助手消息 assistant_encoded = self.assistant_tokens + self.tokenizer.encode( message["content"], add_special_tokens=False ) # 配对形成一个 (source_ids, target_ids) pairs.append((current_user_encoded, assistant_encoded)) current_user_encoded = None total_length = 0 # 初始化总长度 # 逐对处理编码后的 (source_ids, target_ids) for turn_idx, (source_ids, target_ids) in enumerate(pairs): # 检查是否超出最大长度,若超出则停止处理 if total_length >= self.model_max_length: print("Exceeded max length, stopping processing further turns.") break # 动态截断长度 source_len, target_len = infer_seqlen( len(source_ids), len(target_ids), self.model_max_length - total_length ) source_ids = source_ids[:source_len] target_ids = target_ids[:target_len] # 更新总长度 total_length += source_len + target_len source_label = [self.tokenizer.bos_token_id] + [self.ignore_index] * (source_len-1) target_label = target_ids # 数据拼接 input_ids += source_ids + target_ids labels += source_label + target_label # 添加 EOS 标记 input_ids += [self.tokenizer.eos_token_id] labels += [self.tokenizer.eos_token_id] # 转换为 Tensor input_ids = torch.LongTensor(input_ids) labels = torch.LongTensor(labels) # 构造 attention_mask attention_mask = attention_mask = input_ids.ne(self.tokenizer.pad_token_id) return { "input_ids": input_ids, "labels": labels, "attention_mask": attention_mask, } def __getitem__(self, idx) -> Dict[str, torch.Tensor]: return self.preprocessing(self.data[idx]) tokenizer = transformers.AutoTokenizer.from_pretrained( '/data/coding/Weight', use_fast=False, trust_remote_code=True, model_max_length=1024, ) train_dataset = SupervisedDataset( '/data/coding/transformers_from_scratch-main/Deepsmiles_SFT.json', tokenizer, model_max_length=1024 ) data_collator = transformers.DataCollatorForSeq2Seq( tokenizer=tokenizer) model = transformers.AutoModelForCausalLM.from_pretrained( "/data/coding/Weight",trust_remote_code=True, torch_dtype="auto") lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=[ "up_proj", "gate_proj", "o_proj", "q_proj", "v_proj", "down_proj", "k_proj" ], # 目标注意力层 lora_dropout=0.0, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 输出示例:0.3% 参数可训练 training_args = TrainingArguments( output_dir="./LLM_SFT1", per_device_train_batch_size=4, gradient_accumulation_steps=8, num_train_epochs=3, learning_rate=5.0e-05, optim="adamw_torch", logging_steps=10, bf16=True, save_strategy="steps", lr_scheduler_type='cosine', max_grad_norm=1.0, save_steps=2000, warmup_steps=0 ) class CustomTrainer(LoraPlusTrainer): def get_train_dataloader(self) -> DataLoader: """ Returns the training dataloader using a random sampler to shuffle the dataset. """ return DataLoader( self.train_dataset, batch_size=self.args.train_batch_size, shuffle=True, collate_fn=self.data_collator, drop_last=False, ) # 使用修改后的 CustomTrainer lp_trainer = CustomTrainer( model, training_args, train_dataset=train_dataset, tokenizer=tokenizer, data_collator=data_collator ) lp_trainer.train() lp_trainer.save_model(output_dir='./LLM_SFT1')这是我监督微调模型的代码,{ "conversations": [ { "role": "user", "content": "Element:C C C C C C C C C C C C C C C C C C C C C C C C C H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H O O O O O.Mass Error:-0.000694316.Precursor type:[M+H]+.Spectrum:[418.2719243, '129.0193:25.569620 143.0853:27.341772 145.1016:30.632911 157.1008:20.759494 159.1165:32.911392 169.1026:33.417722 173.1333:72.151899 199.1479:100.000000 225.163:46.835443']" }, { "role": "assistant", "content": "O=COCCCCCC=CC=CCC)CCOC=O)CC)C)CC)))))C6%10))))))))C)))))CCO)C6" } ] }这是其中的一条数据,我希望我希望Spectrum:[418.2719243, '129.0193:25.569620 143.0853:27.341772 145.1016:30.632911 157.1008:20.759494 159.1165:32.911392 169.1026:33.417722 173.1333:72.151899 199.1479:100.000000 225.163:46.835443']里的每个带小数的数字都是单独的一个token处理不是说将数字和小数点拆分开

filetype

class MolecularDataset(Dataset): def __init__(self, csv_path: str, tokenizer: AutoTokenizer, max_seq_len: int = 512): self.df = pd.read_csv(csv_path) self.tokenizer = tokenizer self.max_seq_len = max_seq_len # 预处理质谱数据 spectra_data = preprocess_spectra(self.df) self.spec_encoded = encode_spectra(spectra_data, P, dimn) def __len__(self): return len(self.df) def __getitem__(self, idx) -> dict: # 分子式向量 formula = self.df.iloc[idx]['Molecular Formula'] formula_vec = formula_to_dense(formula) # 质谱矩阵 spec_matrix = self.spec_encoded[idx] # SELFIES编码 - 创建输入和标签 selfies_str = self.df.iloc[idx]['SELFIES'] # 编码输入序列 encoding = self.tokenizer( selfies_str, padding='max_length', truncation=True, max_length=self.max_seq_len, return_tensors='pt' ) # 创建标签 - 用于下一个token预测 # 标签是输入序列向右移动一位 input_ids = encoding['input_ids'].squeeze(0) labels = input_ids.clone() # 设置标签:所有位置都参与损失计算,除了填充位置 # 使用-100表示忽略该位置的损失计算 labels[labels == self.tokenizer.pad_token_id] = -100 # 对于训练序列,我们需要将标签向右移动一位 # 这样模型预测的是下一个token labels[:-1] = labels[1:].clone() labels[-1] = -100 # 序列结束位置设为忽略 return { 'formula_vec': formula_vec, # 分子式特征向量 'spec_matrix': spec_matrix, # 质谱特征矩阵 'input_ids': input_ids, # 输入token ID 'attention_mask': encoding['attention_mask'].squeeze(0), # 注意力掩码 'labels': labels # 训练标签(用于计算损失) }模型使用formula_vec和spec_matrix作为输入,以encoder的输入为初始状态进行生成,这样的话有必要需要input_ids吗

filetype

(unsloth) root@autodl-container-578c43a3a7-817faf39:~/autodl-tmp/llm_train/src# vllm serve ./ckpts/qwen3-0.6b --enable-auto-tool-choice --tool-call-parser hermes INFO 07-31 23:06:38 [__init__.py:235] Automatically detected platform cuda. INFO 07-31 23:06:41 [api_server.py:1755] vLLM API server version 0.10.0 INFO 07-31 23:06:41 [cli_args.py:261] non-default args: {'model_tag': './ckpts/qwen3-0.6b', 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': './ckpts/qwen3-0.6b'} INFO 07-31 23:06:49 [config.py:1604] Using max model len 40960 WARNING 07-31 23:06:49 [config.py:1084] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 07-31 23:06:49 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 07-31 23:06:54 [__init__.py:235] Automatically detected platform cuda. INFO 07-31 23:06:57 [core.py:572] Waiting for init message from front-end. INFO 07-31 23:06:57 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='./ckpts/qwen3-0.6b', speculative_config=None, tokenizer='./ckpts/qwen3-0.6b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=./ckpts/qwen3-0.6b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null} INFO 07-31 23:06:58 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 WARNING 07-31 23:06:58 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. INFO 07-31 23:06:58 [gpu_model_runner.py:1843] Starting to load model ./ckpts/qwen3-0.6b... INFO 07-31 23:06:58 [gpu_model_runner.py:1875] Loading model from scratch... INFO 07-31 23:06:58 [cuda.py:290] Using Flash Attention backend on V1 engine. INFO 07-31 23:06:58 [bitsandbytes_loader.py:733] Loading weights with BitsAndBytes quantization. May take a while ... Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 31.45it/s] Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] ERROR 07-31 23:06:59 [core.py:632] EngineCore failed to start. ERROR 07-31 23:06:59 [core.py:632] Traceback (most recent call last): ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 623, in run_engine_core ERROR 07-31 23:06:59 [core.py:632] engine_core = EngineCoreProc(*args, **kwargs) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 441, in __init__ ERROR 07-31 23:06:59 [core.py:632] super().__init__(vllm_config, executor_class, log_stats, ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 77, in __init__ ERROR 07-31 23:06:59 [core.py:632] self.model_executor = executor_class(vllm_config) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__ ERROR 07-31 23:06:59 [core.py:632] self._init_executor() ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor ERROR 07-31 23:06:59 [core.py:632] self.collective_rpc("load_model") ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc ERROR 07-31 23:06:59 [core.py:632] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/utils/__init__.py", line 2985, in run_method ERROR 07-31 23:06:59 [core.py:632] return func(*args, **kwargs) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 201, in load_model ERROR 07-31 23:06:59 [core.py:632] self.model_runner.load_model(eep_scale_up=eep_scale_up) ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model ERROR 07-31 23:06:59 [core.py:632] self.model = model_loader.load_model( ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model ERROR 07-31 23:06:59 [core.py:632] self.load_weights(model, model_config) ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 741, in load_weights ERROR 07-31 23:06:59 [core.py:632] loaded_weights = model.load_weights(qweight_iterator) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 321, in load_weights ERROR 07-31 23:06:59 [core.py:632] return loader.load_weights(weights) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights ERROR 07-31 23:06:59 [core.py:632] autoloaded_weights = set(self._load_module("", self.module, weights)) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module ERROR 07-31 23:06:59 [core.py:632] yield from self._load_module(prefix, ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module ERROR 07-31 23:06:59 [core.py:632] loaded_params = module_load_weights(weights) ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 404, in load_weights ERROR 07-31 23:06:59 [core.py:632] weight_loader(param, loaded_weight, shard_id) ERROR 07-31 23:06:59 [core.py:632] File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 806, in weight_loader ERROR 07-31 23:06:59 [core.py:632] assert param_data.shape == loaded_weight.shape ERROR 07-31 23:06:59 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-31 23:06:59 [core.py:632] AssertionError Process EngineCore_0: Traceback (most recent call last): File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 636, in run_engine_core raise e File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 623, in run_engine_core engine_core = EngineCoreProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 441, in __init__ super().__init__(vllm_config, executor_class, log_stats, File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 77, in __init__ self.model_executor = executor_class(vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__ self._init_executor() File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor self.collective_rpc("load_model") File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/utils/__init__.py", line 2985, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 201, in load_model self.model_runner.load_model(eep_scale_up=eep_scale_up) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model self.model = model_loader.load_model( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model self.load_weights(model, model_config) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 741, in load_weights loaded_weights = model.load_weights(qweight_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/qwen3.py", line 321, in load_weights return loader.load_weights(weights) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights autoloaded_weights = set(self._load_module("", self.module, weights)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module yield from self._load_module(prefix, File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module loaded_params = module_load_weights(weights) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 404, in load_weights weight_loader(param, loaded_weight, shard_id) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 806, in weight_loader assert param_data.shape == loaded_weight.shape ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] [rank0]:[W731 23:07:00.371522631 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) Traceback (most recent call last): File "/root/autodl-tmp/miniconda3/envs/unsloth/bin/vllm", line 8, in <module> sys.exit(main()) ^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 54, in main args.dispatch_function(args) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 52, in cmd uvloop.run(run_server(args)) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1791, in run_server await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1811, in run_server_worker async with build_async_engine_client(args, client_config) as engine_client: File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 163, in from_vllm_config return cls( ^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 117, in __init__ self.engine_core = EngineCoreClient.make_async_mp_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client return AsyncMPClient(*client_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 677, in __init__ super().__init__( File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 408, in __init__ with launch_core_engines(vllm_config, executor_class, File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/contextlib.py", line 144, in __exit__ next(self.gen) File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines wait_for_engine_startup( File "/root/autodl-tmp/miniconda3/envs/unsloth/lib/python3.11/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

filetype
filetype

def compute_metrics(eval_preds): logits, labels = eval_preds preds = np.argmax(logits, axis=-1) # 解码预测和参考的文本为SMILES字符串 predictions_texts = tokenizer.decode(preds, skip_special_tokens=True) references_texts = [tokenizer.decode(label, skip_special_tokens=True) for label in labels] # 计算分子有效性 valid_molecule_count = sum(1 for smiles in predictions_texts if Chem.MolFromSmiles(smiles) is not None) total_count = len(predictions_texts) molecule_validity = valid_molecule_count / total_count if total_count > 0 else 0 # 计算Tanimoto相似度 tanimoto_similarities = [] for pred, ref in zip(predictions_texts, references_texts): pred_mol = Chem.MolFromSmiles(pred) ref_mol = Chem.MolFromSmiles(ref) if pred_mol and ref_mol: pred_fp = AllChem.GetMorganFingerprint(pred_mol, 6) ref_fp = AllChem.GetMorganFingerprint(ref_mol, 6) similarity = DataStructs.TanimotoSimilarity(pred_fp, ref_fp) tanimoto_similarities.append(similarity) mean_similarity = sum(tanimoto_similarities) / len(tanimoto_similarities) if tanimoto_similarities else 0 molecular_validity_results = {'molecular_validity': molecule_validity} similarity_results = {'mean_tanimoto_similarity': mean_similarity} # 将所有结果进行组合 combined_results = {**molecular_validity_results, **similarity_results} return combined_results解决报错 File "/data/coding/transformers_from_scratch-main/pretrain.py", line 157, in main trainer.train() File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train return inner_training_loop( File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop self._maybe_log_save_evaluate( File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/trainer.py", line 3085, in _maybe_log_save_evaluate metrics = self._evaluate(trial, ignore_keys_for_eval) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/trainer.py", line 3039, in _evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/trainer.py", line 4105, in evaluate output = eval_loop( File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/trainer.py", line 4394, in evaluation_loop metrics = self.compute_metrics( File "/data/coding/transformers_from_scratch-main/pretrain.py", line 99, in compute_metrics predictions_texts = tokenizer.decode(preds, skip_special_tokens=True) File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3860, in decode return self._decode( File "/data/miniconda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 668, in _decode text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) argument 'ids': 'list' object cannot be interpreted as an integer

LeonardoLin
  • 粉丝: 27
上传资源 快速赚钱