基于python的高质量中文预训练模型_中文问答系统模型有哪些资源-CSDN下载

共211个文件

py：123个

sh：52个

gitignore：15个

版权申诉

python

自然语言处理

开发语言

人工智能

nlp

5星 · 超过95%的资源 172 浏览量 2022-03-20 20:53:13 上传评论 1 收藏 938KB RAR 举报

在自然语言处理（NLP）领域，预训练模型已经成为核心技术之一，特别是在处理中文文本时。基于Python的高质量中文预训练模型提供了强大的工具，能够帮助开发者和研究人员在各种任务上取得优秀的性能。这些模型通常经过大规模无标注文本的训练，学习到丰富的语言表示，之后可以在下游任务中进行微调，如情感分析、问答系统、机器翻译等。 **Python作为开发语言的优势** Python是NLP领域的首选编程语言，因为它具有易读性强、库支持丰富、社区活跃等优点。其中，`TensorFlow`、`PyTorch`、`Keras`等深度学习框架为构建和训练预训练模型提供了便利。Python的生态使得集成和优化预训练模型变得简单，便于快速实现和迭代。 **大模型与小模型** 大模型，如BERT、GPT系列、ERNIE、T5等，拥有数亿甚至数十亿的参数，能在广泛的任务上表现出色。它们通过自注意力机制捕捉上下文信息，但计算资源需求较高，适用于服务器端和高性能计算环境。小模型如DistilBERT、MobileBERT和MiniLM，是大模型的轻量化版本，牺牲一定的性能换取更快的运行速度和更低的内存消耗，更适合移动端和资源受限的场景。 **面向相似性或句子对任务的模型** 对于特定任务，如语义相似度计算、问答配对等，有专门优化的模型，如SIMCSE、MPNet等。这些模型在设计时就考虑到对句子对的理解和比较，可以更精确地捕捉语义关系，提高相关任务的准确率。 **预训练模型的应用** 1. **文本分类**：利用预训练模型提取文本特征，再通过分类器进行分类，如情感分析、新闻分类等。 2. **命名实体识别（NER）**：识别文本中的实体，如人名、地点、时间等。 3. **机器翻译**：将预训练模型用于编码器-解码器架构，实现不同语言间的翻译。 4. **问答系统**：通过模型理解问题，检索相关信息，生成答案。 5. **对话生成**：让模型学习如何进行自然、流畅的对话。 6. **文本生成**：如摘要生成、文章创作等。 **模型微调与部署** 预训练模型通常需要在目标任务的标注数据集上进行微调，以适应具体应用场景。微调后，模型可以通过序列化保存，然后在生产环境中加载使用。此外，还可以使用服务化框架（如Hugging Face的Transformers Server）将模型部署为API，供其他应用调用。总结，基于Python的高质量中文预训练模型集合为NLP领域带来了显著的进步，提供了解决多种任务的有效工具。开发者可以根据需求选择适合的模型，利用Python的强大生态进行模型的训练、微调及部署，推动自然语言处理技术的发展。

资源推荐

资源详情

资源评论

收起资源包目录

基于python的高质量中文预训练模型（211个子文件）

.gitignore 1KB

predicting_movie_reviews_with_bert_on_tf_hub.ipynb 65KB

LICENSE 11KB

multilingual.md 11KB

README.md 4KB

README.md 2KB

CONTRIBUTING.md 1KB

RoBERTa_zh_Large_Learning_Curve.png 191KB

corpus.png 69KB

put_data_here 0B

zh_wiki.py 140KB

modeling_xlnet.py 71KB

modeling_bert.py 58KB

pytorch_modeling.py 57KB

modeling_albert.py 54KB

tokenization_utils.py 54KB

run_squad.py 45KB

modeling_xlm.py 44KB

modeling_utils.py 42KB

run_classifier.py 42KB

run_classifier.py 41KB

modeling_transfo_xl.py 39KB

modeling.py 37KB

tokenization_xlm.py 36KB

modeling_auto.py 36KB

run_classifier.py 35KB

modeling_distilbert.py 34KB

run_c3.py 34KB

run_ner.py 33KB

modeling_gpt2.py 32KB

run_classifier.py 31KB

classifier_utils.py 30KB

modeling_openai.py 30KB

create_pretraining_data.py 25KB

modeling_roberta.py 25KB

modeling_ctrl.py 23KB

google_albert_pytorch_modeling.py 22KB

tokenization_bert.py 22KB

tokenization_transfo_xl.py 21KB

cmrc2018_output.py 19KB

DRCD_output.py 19KB

run_pretraining.py 19KB

clue.py 18KB

run_pretraining.py 18KB

create_pretraining_data.py 16KB

run_multichoice_mrc.py 16KB

create_pretraining_data.py 16KB

cmrc2018_preprocess.py 15KB

CHID_preprocess.py 15KB

DRCD_preprocess.py 14KB

official_tokenization.py 14KB

extract_features.py 14KB

run_mrc.py 13KB

modeling_transfo_xl_utilities.py 13KB

tokenization.py 13KB

tokenization.py 12KB

common.py 12KB

file_utils.py 11KB

run_classifier_with_tfhub.py 11KB

configuration_utils.py 11KB

tokenization_xlnet.py 10KB

conlleval.py 10KB

共 211 条

# CLUE_pytorch 中文语言理解测评基准(Language Understanding Evaluation benchmark for Chinese) **备注**：此版本为个人开发版(目前支持所有的分类型任务)，正式版见https://github.com/CLUEbenchmark/CLUE ## 代码目录说明 ```text ├── CLUEdatasets #　存放数据 | └── tnews　　　 | └── wsc　 | └── ... ├── metrics　　　　　　　　　# metric计算 | └── clue_compute_metrics.py　　　 ├── outputs # 模型输出保存 | └── tnews_output | └── wsc_output　 | └── ... ├── prev_trained_model　# 预训练模型 | └── albert_base | └── bert-wwm | └── ... ├── processors　　　　　# 数据处理 | └── clue.py | └── ... ├── tools　　　　　　　　#　通用脚本 | └── progressbar.py | └── ... ├── transformers　　　# 主模型 | └── modeling_albert.py | └── modeling_bert.py | └── ... ├── convert_albert_original_tf_checkpoint_to_pytorch.py　#　模型文件转换 ├── run_classifier.py # 主程序 ├── run_classifier_tnews.sh #　任务运行脚本 ├── download_clue_data.py # 数据集下载 ``` ### 依赖模块 - pytorch=1.1.0 - boto3=1.9 - regex - sacremoses - sentencepiece - python3.7+ ### 运行方式 **1. 下载CLUE数据集，运行以下命令：** ```python python download_clue_data.py --data_dir=./CLUEdatasets --tasks=all ``` 上述命令默认下载全CLUE数据集，你也可以指定`--tasks`进行下载对应任务数据集，默认存在在`./CLUEdatasets/{对应task}`目录下。 **2. 若下载对应tf模型权重(若下载为pytorch权重，则跳过该步)，运行转换脚本，比如转换`albert_base_tf`:** ```python python convert_albert_original_tf_checkpoint_to_pytorch.py \ --tf_checkpoint_path=./prev_trained_model/albert_base_tf \ --bert_config_file=./prev_trained_model/albert_base_tf/albert_config_base.json \ --pytorch_dump_path=./prev_trained_model/albert_base/pytorch_model.bin ``` **注意**: 当转换完模型(包括下载的pytorch模型权重)之后，需要在对应的文件夹内存放`config.json`和`vocab.txt`文件，比如： ```text ├── prev_trained_model　# 预训练模型 | └── bert-base | | └── vocab.txt | | └── config.json | | └── pytorch_model.bin ``` **3. 直接运行对应任务sh脚本，如：** ```shell sh run_classifier_tnews.sh ``` **4. 评估** 当前默认使用最后一个checkpoint模型作为评估模型，你也可以指定`--predict_checkpoints`参数进行对应的checkpoint进行评估，比如： ```python CURRENT_DIR=`pwd` export BERT_BASE_DIR=$CURRENT_DIR/prev_trained_model/bert-base export GLUE_DIR=$CURRENT_DIR/CLUEdatasets export OUTPUR_DIR=$CURRENT_DIR/outputs TASK_NAME="copa" python run_classifier.py \ --model_type=bert \ --model_name_or_path=$BERT_BASE_DIR \ --task_name=$TASK_NAME \ --do_predict \ --predict_checkpoints=100 \ --do_lower_case \ --data_dir=$GLUE_DIR/${TASK_NAME}/ \ --max_seq_length=128 \ --per_gpu_train_batch_size=16 \ --per_gpu_eval_batch_size=16 \ --learning_rate=1e-5 \ --num_train_epochs=2.0 \ --logging_steps=50 \ --save_steps=50 \ --output_dir=$OUTPUR_DIR/${TASK_NAME}_output/ \ --overwrite_output_dir \ --seed=42 ``` ### 模型列表 ``` MODEL_CLASSES = { ## bert ernie bert_wwm bert_wwwm_ext 'bert': (BertConfig, BertForSequenceClassification, BertTokenizer), # xlnet_base xlnet_mid xlnet_large 'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer), # roberta_base roberta_wwm roberta_wwm_ext roberta_wwm_large_ext 'roberta': (BertConfig, BertForSequenceClassification, BertTokenizer), # albert_tiny albert_base albert_large albert_xlarge 'albert': (BertConfig, AlbertForSequenceClassification, BertTokenizer) } ``` **注意**: bert ernie bert_wwm bert_wwwm_ext等模型只是权重不一样，而模型本身主体一样，因此参数`model_type=bert`其余同理。 ### 结果当前按照https://github.com/CLUEbenchmark/CLUE 提供的参数，除了**COPA**任务无法复现，其余任务基本保持一致。

评论收藏

内容反馈

版权申诉