【机器学习入门】第27讲：自然语言处理进阶——基于Transformer的文本分类与生成实战

【资源软件】动作暨昝绳鹤锁多好 /_494b36Tkwj😕
链接：https://pan.quark.cn/s/43159509c536
「微信被删好友检测工具」筷莱坌教狴犴狾夺郝链接：https://pan.quark.cn/s/43159509c536
链接：https://pan.quark.cn/s/4598337f6b3e
「【美剧系列】」链接：https://pan.quark.cn/s/663e3ca79519

复制群口令 !0b7236TlXn!😕
将加入群聊免费医院分享

引言：当AI“读懂”人类语言，世界将会怎样？

假设你正在开发一个智能新闻聚合平台：

目标：自动对海量新闻进行分类（政治/科技/体育）并生成摘要
挑战：
- 文本长度差异大（短讯 vs 长文）
- 需捕捉上下文依赖（如“苹果”指公司还是水果）
- 生成摘要需保持关键信息不丢失

Transformer模型通过自注意力机制彻底改变了NLP领域。本文将带你实战基于Transformer的文本分类与摘要生成，解锁语言理解的深层能力。

一、Transformer架构精要

1.1 自注意力机制：让模型“聚焦”重点

Query-Key-Value计算：

Attention(Q,K,V) = softmax(QK^T/√d_k)V

多头注意力：并行多个注意力头捕捉不同语义关系

代码实现自注意力：

import torch  
import torch.nn.functional as F  

def self_attention(query, key, value, mask=None):  
    d_k = query.size(-1)  
    scores = torch.matmul(query, key.transpose(-2, -1)) / (d_k ** 0.5)  
    if mask is not None:  
        scores = scores.masked_fill(mask == 0, -1e9)  
    p_attn = F.softmax(scores, dim=-1)  
    return torch.matmul(p_attn, value), p_attn  

# 示例：输入序列长度5，嵌入维度64  
query = key = value = torch.randn(2, 5, 64)  # batch_size=2  
output, attn_weights = self_attention(query, key, value)

1.2 Transformer vs RNN对比

维度	RNN	Transformer
并行性	时序依赖限制并行	完全并行
长程依赖	梯度消失导致遗忘	自注意力直接连接任意位置
训练速度	慢（逐步计算）	快（矩阵运算优化）

二、文本分类实战：新闻主题识别

2.1 使用Hugging Face快速微调BERT

from transformers import BertTokenizer, BertForSequenceClassification  
import torch  

# 加载预训练模型与分词器  
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)  
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  

# 处理输入文本  
texts = ["New vaccine developed for COVID-19", "Football league season postponed"]  
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")  

# 前向传播  
with torch.no_grad():  
    outputs = model(**inputs)  
predictions = torch.argmax(outputs.logits, dim=1)  
print("预测类别:", predictions)  # 输出: tensor([1, 2])

2.2 处理长文本技巧

滑动窗口法：将长文分割为多个段落分别处理
层次化模型：先抽取段落特征再聚合文档特征

# 滑动窗口示例（窗口长度128，步长64）  
max_length = 128  
for text in long_texts:  
    tokens = tokenizer.encode(text, add_special_tokens=False)  
    for i in range(0, len(tokens), 64):  
        window = tokens[i:i+max_length]  
        inputs = tokenizer.decode(window, add_special_tokens=True)

三、文本生成实战：新闻摘要系统

3.1 使用PEGASUS生成摘要

from transformers import PegasusTokenizer, PegasusForConditionalGeneration  

model_name = 'google/pegasus-xsum'  
tokenizer = PegasusTokenizer.from_pretrained(model_name)  
model = PegasusForConditionalGeneration.from_pretrained(model_name)  

article = """  
OpenAI announced GPT-4, the latest version of its language model.  
The new model demonstrates improved reasoning capabilities and multimodal input support.  
Experts warn about potential misuse but acknowledge its technological breakthrough.  
"""  

inputs = tokenizer(article, max_length=512, return_tensors='pt', truncation=True)  
summary_ids = model.generate(inputs['input_ids'], max_length=100)  
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))  
# 输出: "OpenAI releases GPT-4 with enhanced reasoning and multimodal capabilities, raising both excitement and ethical concerns."

3.2 生成控制技巧

参数	作用	示例设置
Temperature	控制随机性（低→保守，高→创意）	0.7（平衡）
Top-k	仅考虑概率前k的候选词	50
Repetition penalty	抑制重复生成	1.2

generation_config = {  
    "max_length": 100,  
    "temperature": 0.7,  
    "top_k": 50,  
    "repetition_penalty": 1.2,  
    "num_return_sequences": 1  
}  
summary_ids = model.generate(**inputs, **generation_config)

四、模型优化与部署

4.1 知识蒸馏压缩模型

from transformers import DistilBertForSequenceClassification, Distiller  

# 初始化学生模型  
student = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)  

# 定义蒸馏配置  
distiller = Distiller(  
    teacher_model=bert_model,  
    student_model=student,  
    temperature=2.0,  
    alpha_ce=0.5,  
    alpha_mse=0.5  
)  

# 执行蒸馏  
distiller.train(train_loader, epochs=3, optimizer=optimizer)

4.2 ONNX Runtime部署

from transformers.convert_graph_to_onnx import convert  

# 转换BERT模型到ONNX  
convert(framework="pt", model="bert-base-uncased", output="bert.onnx", opset=12)  

# 使用ONNX Runtime推理  
import onnxruntime as ort  

session = ort.InferenceSession("bert.onnx")  
inputs_onnx = {  
    "input_ids": inputs["input_ids"].numpy(),  
    "attention_mask": inputs["attention_mask"].numpy()  
}  
outputs = session.run(None, inputs_onnx)

五、监控与持续改进

5.1 文本分类监控指标

Accuracy：整体分类准确率
Class-wise F1：各类别的F1分数（处理不平衡）
Confidence Distribution：模型预测置信度分布

5.2 文本生成质量评估

指标	评估维度	工具推荐
ROUGE	与参考摘要的词汇重叠度	rouge-score库
BERTScore	语义层面的相似度	bert-score库
人工评估	流畅度、信息完整性	众包平台（Amazon Mechanical Turk）