bert-for-token-classification-ner-tutorial
bert-for-token-classification-ner-tutorial
Name Entity Recognition in short called NER is method of information extractions from
the text data which comes under the NLP space. The most common entities that are
extracted from the text data could be name of the person, Country, company, contact
information like email-id, phone-no, home address etc.
However NER tasks are not just limited to these standard entities, they could also be
finetuned/trained to identify custom entities as per our need from the text data.
There are function in NLTK and Spacy that can be used for NER tasks. However in this
tutorial we will be going through how can we identify the entities of our interest using
BERT by using the method BertForTokenClassification from HuggingFace Transformer
package using Pytorch
There are extensive tutorials available online to understand BERT. However you can
check out Jay Alammar's blog on A Visual Guide to Using BERT for the First Time which
is one of the excellent way to learn BERT visually
about:srcdoc Page 1 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
If it is a multiclass problem then the number of possible classes will be N and there will
be N class probablilites for each token.
about:srcdoc Page 2 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
The textual information are from different publications, and each of them are stored as a
json file with ids as unique key to identify them. The labels from publications are stored
in the train.csv which we will use to label the textual documents from the publication. I
would recommend you to go through the competition overview to understand in details
about the data.
Contents
Declare configurations
Reading the train dataset and preprocessing the text files
Tokenization and Data Labelling for the NER task
Define the DataLoader to batch the train dataset for training
Model training using BertForTokenClassification
Evaluation function to evaluate the model while training
Reading the test dataset
Prediction function to get the prediction for the test data
Result consolidation
I have written a similar notebook on how we can use Roberta for NER
tasks. Do check this notebook for the tutorial. Consider leaving an upvote
if you like and comment for any questions :)
about:srcdoc Page 3 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
In [1]: import os
import pandas as pd
import numpy as np
import json
import re
from nltk.tokenize import sent_tokenize
from transformers import BertTokenizer, AutoTokenizer
from torch.utils.data import DataLoader
import torch
import torch.nn as nn
import transformers
from tqdm import tqdm
import glob
from sklearn.model_selection import train_test_split
import datetime
import warnings
warnings.filterwarnings('ignore')
Config
In [2]: platform = 'Kaggle'
model_name = 'model1_bert_base_uncased.bin'
if platform == 'Kaggle':
bert_path = '../input/huggingface-bert/bert-base-uncased/'
train_path = '/kaggle/input/coleridgeinitiative-show-us-the-data/train/'
test_path = '/kaggle/input/coleridgeinitiative-show-us-the-data/test/*'
model_path = '../input/coleridgemodels/'+ model_name
config = {'MAX_LEN':128,
'tokenizer': AutoTokenizer.from_pretrained(bert_path , do_lower_case=Tru
'batch_size':5,
'Epoch': 1,
'train_path':train_path,
'test_path':test_path,
'device': 'cuda' if torch.cuda.is_available() else 'cpu',
'model_path':model_path,
'model_name':model_name
}
about:srcdoc Page 4 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
0010357a-6365-4e5f-b982-
4 1 genome sequence of covid 19
582e6d32c3ee
ffd19b3c-f941-45e5-9382-
14311 1 census of agriculture
934b5041ec96
ffe7f334-245a-4de7-b600-
14313 1 genome sequences of sars cov 2
d7ff4e28bfca
about:srcdoc Page 5 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
return text_data
about:srcdoc Page 6 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
def data_joining(data_dict_id):
'''
This function is to join all the text data from different
sections in the json to a single text file.
'''
data_length = len(data_dict_id)
return temp
def make_shorter_sentence(sentence):
'''
This function is to split the long sentences into chunks of shorter sentences
maximum length of words specified in config['MAX_LEN']
'''
sent_tokenized = sent_tokenize(sentence)
max_length = config['MAX_LEN']
overlap = 20
final_sentences = []
if len(tok_sent)<max_length:
final_sentences.append(sent_tokenized_clean)
else :
# print("Making shorter sentences")
start = 0
end = len(tok_sent)
return final_sentences
about:srcdoc Page 7 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
Check out this blog on different methods of tokenizer to understand why BERT tokenizes
the sentence the way it does
Labelling
The function form_labels() is actually where we label the data to be fed to the algorithms
later in the notebook.
Example sentence: "control samples were selected from individuals who had
participated in genomewide association studies performed by our group 787 samples
from the neurogenetics collection at the coriell cell repository and 728 from the
baltimore longitudinal study of aging blsa"
Portion to label: "control samples were selected from individuals who had participated
in genomewide association studies performed by our group 787 samples from the
neurogenetics collection at the coriell cell repository and 728 from the baltimore
longitudinal study of aging blsa"
How do we do this?
Tokenize the train sentence.
['control', 'samples', 'were', 'selected', 'from', 'individuals', 'who', 'had', 'participated', 'in',
about:srcdoc Page 8 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
'genome', '##wide', 'association', 'studies', 'performed', 'by', 'our', 'group', '78', '##7',
'samples', 'from', 'the', 'ne', '##uro', '##gen', '##etic', '##s', 'collection', 'at', 'the', 'co',
'##rie', '##ll', 'cell', 'repository', 'and', '72', '##8', 'from', 'the', 'baltimore', 'longitudinal',
'study', 'of', 'aging', 'b', '##ls', '##a']
For each tokenized label, loop the tokenized train sentence and wherever you find a
match label them as 'B' and rest of them as 'O'.
Since there are two labels we will loop the train sentence twice
At the end of the 1st loop for ['genome', '##wide', 'association', 'studies'], the label
looks like this
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'B', 'B', 'B', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O']
At the end of the 2nd loops for ['baltimore', 'longitudinal', 'study', 'of', 'aging'] the
final label looks like this
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'B', 'B', 'B', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B',
'B', 'B', 'B', 'B', 'O', 'O', 'O']
Most tutorials follow a pattern of BIO labelling (B-begin, I-interior, O-outside) which is
nothing but label the first word as B, last word as O and all the intermediate words as I
and all others as some other character say X
['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'B', 'I', 'I', 'O', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X',
'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'B', 'I', 'I', 'I', 'O', 'X', 'X',
'X']
However I would like to stick to only labeling them as 'B' as I find it to be simple
about:srcdoc Page 9 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
# Since there are many sentences which are more than 512 words,
# Let's make the max length to be 128 words per sentence.
tokens = make_shorter_sentence(sentence)
if (len(kword_split) == 1):
z[i] = 'B'
else:
z[i] = 'B'
z[(i+1) : (i+ len(kword_split))]= 'B'
if matched_keywords >1:
label[-1] = (z.tolist())
matched_token[-1] = tok
matched_kwords[-1].append(kword)
else:
label.append(z.tolist())
matched_token.append(tok)
matched_kwords.append([kword])
else:
un_matched_kwords.append(kword)
about:srcdoc Page 10 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
Id_list_ = []
sentences_ = []
key_ = []
labels_ = []
un_mat = []
un_matched_reviews = 0
sentence = data_joining(data_dict[Id])
labels = train_df.label[train_df.Id == Id].tolist()[0].split("|")
if len(s) == 0:
un_matched_reviews += 1
un_mat.append(un_matched)
else:
sentences_.append(s)
key_.append(k)
labels_.append(l)
Id_list_.append([Id]*len(l))
print("")
print(f" train sentences: {len(train_sentences)}, train label: {len(train_labels)}
train sentences: 52073, train label: 52073, train keywords: 52073, train_id
list: 52073
about:srcdoc Page 11 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
(50772, 5)
Out[10]: id train_sentences kword label sent_len
about:srcdoc Page 12 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
['O', 'O',
ffee2676-a778- my prior research ['beginning
'O', 'O', 'O',
52072 4521-b947- illustrated with use postsecondary 105
'O', 'O', 'B',
e1e420b126c5 of begi... student']
'B', ...
Out[11]: (5077, 5)
train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)
print(train_df.shape, valid_df.shape)
(4061, 5) (1016, 5)
about:srcdoc Page 13 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
We have labelled all our train dataset and our labelled positions are denoted as B.
However when we have to pass it to the model we will have to convert them into
numerical representation.
Here P means padding which is appended to our train and label when the total word
count in the sentence is not equal to the mentioned MAX_LEN in config. We will add the
padding in the class form_input()
def dataset_2_list(df):
id_list = df.id.values.tolist()
sentences_list = df.train_sentences.values.tolist()
keywords_list = df.kword.apply(lambda x : eval(x)).values.tolist()
about:srcdoc Page 14 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
One added information to note is we have set the Max_length to be 128 words per
sentence. However when we tokenize it, due to Bert's way of splitting non-vocab and
compound words into multiple sub-words, the no.of tokens could be more than the no.
of words. If the number of tokens is <128 then the extra positions are padded with "P",
However if they exceed the Max_Length we truncate them to 128 tokens at this portion
of the code which is a common practice.
if len(toks)>self.max_length:
toks = toks[:self.max_length]
label = label[:self.max_length]
Do you feel there could be information loss due to the truncation? Well if you check
make_shorter_sentence(sentence) function there is a variable overlap=20, So the
begining of every sentence is from the last 20 words from the previous sentence. So the
tokens truncated in the sentence1 could be captured in the sentence 2.
about:srcdoc Page 15 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
def __len__(self):
return len(self.sentence)
if len(toks)>self.max_length:
toks = toks[:self.max_length]
label = label[:self.max_length]
########################################
# Forming the inputs
ids = config['tokenizer'].convert_tokens_to_ids(toks)
tok_type_id = [0] * len(ids)
att_mask = [1] * len(ids)
# Padding
pad_len = self.max_length - len(ids)
ids = ids + [2] * pad_len
tok_type_id = tok_type_id + [0] * pad_len
att_mask = att_mask + [0] * pad_len
########################################
# Forming the label
if self.data_type !='test':
label = label + [2]*pad_len
else:
label = 1
about:srcdoc Page 16 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
valid_prod_input = form_input(ID=final_valid_id_list,
sentence=final_valid_sentences,
kword=final_valid_keywords,
label=final_valid_labels,
data_type='valid')
train_prod_input_data_loader = DataLoader(train_prod_input,
batch_size= config['batch_size'],
shuffle=True)
valid_prod_input_data_loader = DataLoader(valid_prod_input,
batch_size= config['batch_size'],
shuffle=True)
print("")
print("Input label:")
print(final_train_keywords[ind])
print("")
print("Output:")
train_prod_input[ind]#, valid_prod_input[ind]
Input sentence:
7 college enrollment and completion data come from the beginning postsecondar
y student longitudinal study 2004 2009 and from the national education longit
udinal study for 1988 which followed a sample of eighth graders from 1988 unt
il 2000
Input label:
['national education longitudinal study', 'beginning postsecondary student',
'education longitudinal study']
Output:
Out[17]: {'ids': tensor([ 1021, 2267, 10316, 1998, 6503, 2951, 2272, 2013, 19
96, 2927,
8466, 8586, 29067, 2854, 3076, 20134, 2817, 2432, 2268, 19
98,
2013, 1996, 2120, 2495, 20134, 2817, 2005, 2997, 2029, 26
about:srcdoc Page 17 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
28,
1037, 7099, 1997, 5964, 23256, 2013, 2997, 2127, 2456,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2, 2,
2,
2, 2, 2, 2, 2, 2, 2, 2]),
'tok_type_id': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0]),
'att_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0]),
'target': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2,
2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2,
about:srcdoc Page 18 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2,
2, 2, 2, 2, 2, 2, 2, 2])}
As we can see that the input sentence we passed to the function returned a dictionary of
ids, tok_type_ids, att_mask, target .
The ids contains some numbers which is nothing but the index position of the tokens in
the Bert vocabulary. If you want to know how the values are obtained in the ids I would
recommend downloading the vocab.txt from HuggingFace repo and checking the index
with respect to the token starting from 0.
I have shown the token ids for the first 5 words of an example sentence ("2
comparsions with the most") in the screenshot below. The 2 in the ids are nothing but
padded value.
The att_mask contains 1 for the all the tokens positions and 0 for all the padded
positsion, so the model give attention to the tokens and not the padded positions.
target as we saw before are lablled as 1 for the label positions("survey of doctorate
recipients"), 0 for the others and 2 for the padded positions.
about:srcdoc Page 19 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
Training section
about:srcdoc Page 20 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
train_loss = 0
for index, dataset in enumerate(tqdm(data_loader, total = len(data_loader))):
batch_input_ids = dataset['ids'].to(config['device'], dtype = torch.long)
batch_att_mask = dataset['att_mask'].to(config['device'], dtype = torch.lo
batch_tok_type_id = dataset['tok_type_id'].to(config['device'], dtype = to
batch_target = dataset['target'].to(config['device'], dtype = torch.long)
output = model(batch_input_ids,
token_type_ids=None,
attention_mask=batch_att_mask,
labels=batch_target)
step_loss = output[0]
prediction = output[1]
step_loss.sum().backward()
optimizer.step()
train_loss += step_loss
optimizer.zero_grad()
return train_loss.sum()
about:srcdoc Page 21 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
model.eval()
eval_loss = 0
predictions = np.array([], dtype = np.int64).reshape(0, config['MAX_LEN'])
true_labels = np.array([], dtype = np.int64).reshape(0, config['MAX_LEN'])
with torch.no_grad():
for index, dataset in enumerate(tqdm(data_loader, total = len(data_loader)
batch_input_ids = dataset['ids'].to(config['device'], dtype = torch.lo
batch_att_mask = dataset['att_mask'].to(config['device'], dtype = torc
batch_tok_type_id = dataset['tok_type_id'].to(config['device'], dtype
batch_target = dataset['target'].to(config['device'], dtype = torch.lo
output = model(batch_input_ids,
token_type_ids=None,
attention_mask=batch_att_mask,
labels=batch_target)
step_loss = output[0]
eval_prediction = output[1]
eval_loss += step_loss
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy()
actual = batch_target.to('cpu').numpy()
about:srcdoc Page 22 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
model =
transformers.BertForTokenClassification.from_pretrained('bert-
base-uncased', num_labels = len(tags_2_idx))
However if we had labelled them in the BIO format then there would be 5 classes [X, B,
I, O, P] then tags_2_idx={'X':0 , 'B':1, 'I':2, 'O':3, 'P':5} so
num_labels would have been num_labels = 5
params = model.parameters()
optimizer = torch.optim.Adam(params, lr= 3e-5)
best_eval_loss = 1000000
for i in range(epoch):
train_loss = train_fn(data_loader = train_data,
model=model,
optimizer=optimizer)
eval_loss, eval_predictions, true_labels = eval_fn(data_loader = valid_dat
model=model)
about:srcdoc Page 23 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
about:srcdoc Page 24 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
test_text_data = {}
total_files = len(glob.glob(test_data_folder))
if (i%1000) == 0:
print(f"Completed {i}/{total_files}")
Completed 0/4
All files read
Prediction function
For the prediction part we pass the text data sentence.
about:srcdoc Page 25 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
[1999, 2930, 1019, 1996, 3818, 4118, 2003, 7203, 2011, 20253,
1037, 19241, 23435, 12126, 26718, 2072, 2951, 2275, 2013,
1996, 21901, 4295, 11265, 10976, 9581, 4726, 6349, 4748,
3490, 7809]
Prediction output:
The token Ids are passed to the trained model which outputs the predicted probabilites
for each token and the argmax is taken to get the class with max probability and it looks
like this.
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Compare the prediction and the test input tokenized text. Get all the tokens that are
predicted as 1 from the test input Tokenized text which is
[['tensor', 'imaging', 'dt', '##i', 'data', 'set'], ['alzheimer', 'disease', 'ne', '##uro',
'##ima', '##ging', 'initiative', 'ad', '##ni', 'database']].
Combine the subwords with "##", Separate each element in the list by spaces and
return them as the identified entity by the model
"tensor imaging dti data set", "alzheimer disease neuroimaging initiative adni
database"
Note: If you labelled your dataset in the BIO format then you will have to modify the
prediction_fn() function accordingly
In [24]: # Prediction
def prediction_fn(tokenized_sub_sentence):
tkns = tokenized_sub_sentence
indexed_tokens = config['tokenizer'].convert_tokens_to_ids(tkns)
segments_ids = [0] * len(indexed_tokens)
tokens_tensor = torch.tensor([indexed_tokens]).to(config['device'])
segments_tensors = torch.tensor([segments_ids]).to(config['device'])
model.eval()
with torch.no_grad():
logit = model(tokens_tensor,
token_type_ids=None,
about:srcdoc Page 26 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
attention_mask=segments_tensors)
logit_new = logit[0].argmax(2).detach().cpu().numpy().tolist()
prediction = logit_new[0]
kword = ''
kword_list = []
for k, j in enumerate(prediction):
if (len(prediction)>1):
if begin.startswith('##'):
kword = previous + begin[2:]
else:
kword = begin
if k == (len(prediction) - 1):
#print('begin and end word is the last word of the sentenc
kword_list.append(kword.rstrip().lstrip())
if inter.startswith('##'):
kword = kword + "" + inter[2:]
else:
kword = kword + " " + inter
if k == (len(prediction) - 1):
#print('begin and end')
kword_list.append(kword.rstrip().lstrip())
about:srcdoc Page 27 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
else:
if (j!=0):
begin = tkns[k]
kword = begin
kword_list.append(kword.rstrip().lstrip())
return kword_list
start = 0
end = len(long_tokens)
max_length = 64
final_long_tok_split = []
for i in range(start, end, max_length):
temp = long_tokens[i: (i + max_length)]
final_long_tok_split.append(temp)
return final_long_tok_split
about:srcdoc Page 28 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
results = {}
for i, Id in enumerate(data_dict.keys()):
current_id_predictions = []
# print(Id)
sentences = data_joining(data_dict[Id])
sentence_tokens = sent_tokenize(sentences)
if len(tokenized_sub_sentence) == 0:
# If the tokenized sentence are empty
sub_sentence_prediction_kword_list = []
else:
# If the tokenized sentence are >512 which is long sentences
long_sent_kword_list = []
tokenized_sub_sentence_tok_split = long_sent_split(tokenized_sub_s
for i, sent_tok in enumerate(tokenized_sub_sentence_tok_split):
if len(sent_tok) != 0:
kword_list = prediction_fn(sent_tok)
long_sent_kword_list.append(kword_list)
flat_long_sent_kword = [item for sublist in long_sent_kword_list f
sub_sentence_prediction_kword_list = flat_long_sent_kword
if len(sub_sentence_prediction_kword_list) !=0:
current_id_predictions = current_id_predictions + sub_sentence_pre
results[Id] = list(set(current_id_predictions))
return results
about:srcdoc Page 29 of 30
bert-for-token-classification-ner-tutorial 12/10/24, 10:18 AM
Out[28]: Id PredictionString
2f392438-e215-4169-bebf-
2 mathematics|istituti d arte|organization for e...
21ac4ff253e1
about:srcdoc Page 30 of 30