BERT+CRF实现命名体识别的代码
时间: 2023-12-09 17:03:52 浏览: 256
下面是基于BERT+CRF的命名实体识别代码,其中使用了Hugging Face的Transformers库和TensorFlow 2.0。
首先,需要安装Transformers库:
```
pip install transformers
```
然后,可以使用以下代码实现基于BERT+CRF的命名实体识别:
```
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
from tensorflow.keras.layers import Input, Dense, TimeDistributed, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow_addons.layers import CRF
# 加载BERT模型和tokenizer
bert_model = TFBertModel.from_pretrained('bert-base-chinese')
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
# 加载数据,将数据转换为BERT输入格式
def load_data(path):
sentences = []
labels = []
with open(path, 'r', encoding='utf-8') as f:
words = []
tags = []
for line in f:
line = line.strip()
if not line:
if words:
sentences.append(' '.join(words))
labels.append(tags)
words = []
tags = []
else:
word, tag = line.split()
words.append(word)
tags.append(tag)
return sentences, labels
train_sentences, train_labels = load_data('train.txt')
test_sentences, test_labels = load_data('test.txt')
train_encodings = tokenizer(train_sentences, truncation=True, padding=True)
test_encodings = tokenizer(test_sentences, truncation=True, padding=True)
train_labels = [[tag2id.get(tag, 0) for tag in tags] for tags in train_labels]
test_labels = [[tag2id.get(tag, 0) for tag in tags] for tags in test_labels]
# 构建模型
input_ids = Input(shape=(None,), dtype=tf.int32, name='input_ids')
attention_mask = Input(shape=(None,), dtype=tf.int32, name='attention_mask')
token_type_ids = Input(shape=(None,), dtype=tf.int32, name='token_type_ids')
output = bert_model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0]
output = Dropout(0.1)(output)
output = TimeDistributed(Dense(num_tags, activation='relu'))(output)
crf = CRF(num_tags)
output = crf(output)
model = Model(inputs=[input_ids, attention_mask, token_type_ids], outputs=[output])
model.compile(optimizer=Adam(lr=1e-5), loss=crf.loss_function, metrics=[crf.accuracy])
# 训练模型
history = model.fit(
x=[train_encodings['input_ids'], train_encodings['attention_mask'], train_encodings['token_type_ids']],
y=train_labels,
validation_data=(
[test_encodings['input_ids'], test_encodings['attention_mask'], test_encodings['token_type_ids']],
test_labels
),
batch_size=32,
epochs=10
)
# 对测试集进行预测
test_pred = model.predict([test_encodings['input_ids'], test_encodings['attention_mask'], test_encodings['token_type_ids']])
test_pred = np.argmax(test_pred, axis=-1)
test_pred_labels = [[id2tag.get(id, 'O') for id in pred] for pred in test_pred]
# 输出测试集上的结果
for i in range(len(test_sentences)):
print(test_sentences[i])
print('True:', test_labels[i])
print('Pred:', test_pred_labels[i])
```
在这个代码中,我们使用了Hugging Face的Transformers库加载了BERT模型和tokenizer,并将数据转换为BERT输入格式。我们使用了TensorFlow 2.0的Keras API构建了基于BERT+CRF的命名实体识别模型,并使用了CRF层作为最后一层。最后,我们使用了Keras API训练了模型,并使用模型对测试集进行了预测,并输出了预测结果。
需要注意的是,实际应用中,需要根据具体情况进行修改。例如,需要根据数据集的标签数量修改`num_tags`的值,根据具体的数据集修改`load_data`函数,等等。
阅读全文