hugging face实战演练——命名体识别

纤度

已于 2024-12-05 00:41:41 修改

阅读量605

点赞数 11

CC 4.0 BY-SA版权

分类专栏： hugging face入门与实战文章标签：人工智能

于 2024-12-05 00:41:08 首次发布

本文链接：https://blog.csdn.net/weixin_56696781/article/details/144251770

hugging face入门与实战专栏收录该内容

6 篇文章

订阅专栏

内容前置

本系列下内容为跟随b站课程总结课程链接：

【手把手带你实战HuggingFace Transformers】

本节内容主要对模型进行命名体识别任务的微调。

首先我们这里加载常用的包

from transformers import AutoTokenizer,AutoModelForTokenClassification,Trainer,TrainingArguments,DataCollatorForTokenClassification
import evaluate
from datasets import load_dataset

这里使用peoples_daily_ner数据

datas = load_dataset("peoples_daily_ner",cache_dir="./data",trust_remote_code=True)

结构为

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 20865
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 2319
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 4637
    })
})

我们取一个观察

datas["train"][0]

结果为：

{'id': '0',
 'tokens': ['海',
  '钓',
  '比',
  '赛',
  '地',
  '点',
  '在',
  '厦',
  '门',
  '与',
  '金',
  '门',
  '之',
  '间',
  '的',
  '海',
  '域',
  '。'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 5, 6, 0, 5, 6, 0, 0, 0, 0, 0, 0]}

这里我们发现句子被拆分成token处理，每一个token都带有自己的标签，这是命名体识别任务的特征，这里我们label采用的是IOB2，看下都有什么标签：

label_list = datas["train"].features["ner_tags"].feature.names
#['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

其中'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'分别表示，非命名体，人名-开始，人名-在其中间，组织名-开始，组织名-在其中间，地名-开始，地名-在其中间。

明白命名体label规则后，我们来看下其数据的tokenizer，举第一条数据为例：

tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
example_1 = tokenizer(datas["train"][0]["tokens"])

结果为：

{'input_ids': [[101, 3862, 102], [101, 7157, 102], [101, 3683, 102], [101, 6612, 102], [101, 1765, 102], [101, 4157, 102], [101, 1762, 102], [101, 1336, 102], [101, 7305, 102], [101, 680, 102], [101, 7032, 102], [101, 7305, 102], [101, 722, 102], [101, 7313, 102], [101, 4638, 102], [101, 3862, 102], [101, 1818, 102], [101, 511, 102]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]}

这里我们看到，直接使用tokenizer是把每一个词都看成一个句子处理，所以我们要加上一个参数is_split_into_words=True把这些词看成句子处理·

example_2 = tokenizer(datas["train"][0]["tokens"],is_split_into_words=True)

结果为：

{'input_ids': [101, 3862, 7157, 3683, 6612, 1765, 4157, 1762, 1336, 7305, 680, 7032, 7305, 722, 7313, 4638, 3862, 1818, 511, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

在学习tokenizer时我们知道处理文本时，一个单词可能处理为多个token，那我们这里要知道这里的token属于哪个单词，只需调用word_ids()函数

ids = tokenizer("property aspect")
ids.word_ids()

结果为：

[None, 0, 0, 0, 1, 1, None]

清楚了以上内容后，我们得为tokenize后的数据重新打上标签，这里用一个映射函数collet_fn实现：

def collet_fn(example):
    tokenized_data = tokenizer(example["tokens"],max_length=128,truncation=True,is_split_into_words=True)
    labels = []
    for i,label in enumerate(example["ner_tags"]):
        word_ids = tokenized_data.word_ids(batch_index=i)
        label_ids = []
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_id])
        labels.append(label_ids)
    tokenized_data["labels"] = labels
    return tokenized_data
data_pro = datas.map(collet_fn,batched=True)

接下来定义模型，值得注意的是，我们调用的是默认num_labels为2的模型，这里我们的num_labels为7，需要做修改

model = AutoModelForTokenClassification.from_pretrained("IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment",num_labels=len(label_list),ignore_mismatched_sizes=True)

这里命名体识别我们用seqeval评估函数处理，这个评估函数需要预先下载：pip install seqeval

eval_fn = evaluate.load("seqeval")
import numpy as np
def eval_process(pred):
    predictions,labels = pred
    predictions = np.argmax(predictions,axis=-1)
    true_predictions = [
        [label_list[p] for p,l in zip(prediction,label) if l != -100]
        for prediction,label in zip(predictions,labels)
    ]
    true_labels = [
        [label_list[l] for p,l in zip(prediction,label) if l != -100]
        for prediction,label in zip(predictions,labels)
    ]
    result = eval_fn.compute(references=true_labels,predictions=true_predictions)
    return {
        "overall f1":result["overall_f1"]
    }

上面我们定义了一个评估的映射函数，将真实值与预测值进行compute，我们返回其中一个评估指标overall_f1 。

为了更直观展现各项指标，这里我推荐wandb，使用这个工具前，我们需要预先下载：pip install wandb，使用前我们要初始化一下：

import wandb

# 初始化 W&B
wandb.init(
    project="tokenclassification",  # 替换为你的项目名称
    name="tokenclassification",        # 可选，指定本次运行的名称
    config={                     # 可选，记录一些超参数
        "learning_rate": 0.001,
        "epochs": 5,
    }
)

然后就是trainer函数的参数设置，具体内容可以看本系列下Trainer那篇文章：

args = TrainingArguments(
    output_dir="model_for_ner",
    per_gpu_train_batch_size=32,
    per_gpu_eval_batch_size=32,
    evaluation_strategy="epoch",
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb"
)
train = Trainer(model=model,args=args,train_dataset=data_pro["train"],eval_dataset=data_pro["validation"],compute_metrics=eval_process,data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer))

最后我们使用pipline来调用微调后模型

from transformers import pipeline
ner_pip = pipeline(task="token-classification",model=model,tokenizer=tokenizer)
ner_pip("小明在厦门上班")

输出为：

[{'entity': 'LABEL_1',
  'score': 0.9702725,
  'index': 1,
  'word': '小',
  'start': 0,
  'end': 1},
 {'entity': 'LABEL_2',
  'score': 0.98203045,
  'index': 2,
  'word': '明',
  'start': 1,
  'end': 2},
 {'entity': 'LABEL_0',
  'score': 0.99971205,
  'index': 3,
  'word': '在',
  'start': 2,
  'end': 3},
 {'entity': 'LABEL_5',
  'score': 0.9953642,
  'index': 4,
  'word': '厦',
  'start': 3,
  'end': 4},
 {'entity': 'LABEL_6',
...
  'score': 0.999699,
  'index': 7,
  'word': '班',
  'start': 6,
  'end': 7}]

这里我们发现输出并不是我们想要的，这是因为model.config中id2label不是我们设置的IOB2，这里我们更换下：

model.config.id2label = label_list

[{'entity': 'B-PER',
  'score': 0.9702725,
  'index': 1,
  'word': '小',
  'start': 0,
  'end': 1},
 {'entity': 'I-PER',
  'score': 0.98203045,
  'index': 2,
  'word': '明',
  'start': 1,
  'end': 2},
 {'entity': 'B-LOC',
  'score': 0.9953642,
  'index': 4,
  'word': '厦',
  'start': 3,
  'end': 4},
 {'entity': 'I-LOC',
  'score': 0.9987676,
  'index': 5,
  'word': '门',
  'start': 4,
  'end': 5}]

要使他们连起来，这里pipline需要设置一个参数aggregation_strategy="simple"，但连起来的命名体中间有空格，这是因为bert的tokenizer解码时就存在空格这里也同时处理：

ner_pip = pipeline(task="token-classification",model=model,tokenizer=tokenizer,aggregation_strategy="simple")
ner_pip("小明在厦门上班")
res = ner_pip("小明在厦门上班")
ner_result = {}
x = "小明在厦门上班"
for r in res:
    if r["entity_group"] not in ner_result:
        ner_result[r["entity_group"]]=[]
    ner_result[r["entity_group"]].append(x[r["start"]:r["end"]])

输出为：