基于bert的命名实体识别python代码
时间: 2023-11-09 09:06:59 浏览: 105
以下是基于BERT的命名实体识别Python代码示例:
```python
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import tokenization
# 加载BERT模型
module_url = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)
# 加载BERT预处理tokenizer
tokenizer = tokenization.FullTokenizer(vocab_file=bert_layer.resolved_object.vocab_file.asset_path.numpy(), do_lower_case=True)
# 定义输入、输出和BERT配置
max_seq_length = 128
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
clf_output = sequence_output[:, 0, :]
# 添加分类层
out = tf.keras.layers.Dense(1, activation='sigmoid')(clf_output)
# 编译模型
model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
model.compile(optimizer=tf.keras.optimizers.Adam(lr=2e-5), loss='binary_crossentropy', metrics=['accuracy'])
# 加载数据
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
# 转换数据为BERT输入格式
def convert_text_to_features(text):
tokens = tokenizer.tokenize(text)
tokens = ["[CLS]"] + tokens + ["[SEP]"]
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids = token_ids + [0] * (max_seq_length - len(token_ids))
input_mask = [1] * len(token_ids)
segment_ids = [0] * max_seq_length
return token_ids, input_mask, segment_ids
# 将数据转换为BERT输入格式
input_ids, input_masks, segment_ids = [], [], []
for sentence in data["Sentence #"]:
sentence_tokens, sentence_mask, sentence_segment = convert_text_to_features(sentence)
input_ids.append(sentence_tokens)
input_masks.append(sentence_mask)
segment_ids.append(sentence_segment)
# 训练模型
history = model.fit([np.array(input_ids), np.array(input_masks), np.array(segment_ids)], data["tag"].values, validation_split=0.2, epochs=3, batch_size=32)
# 预测标记
def predict_tags(text):
input_id, input_mask, segment_id = convert_text_to_features(text)
prediction = model.predict([[input_id], [input_mask], [segment_id]])
return prediction
# 测试模型
test_text = "John Smith works at Microsoft in Seattle."
print(predict_tags(test_text))
```
此代码将加载BERT模型和预处理tokenizer,并定义用于输入、输出和BERT配置的TensorFlow张量。模型将使用分类层进行训练,并使用Binary Crossentropy损失和Adam优化器进行编译。数据将从CSV文件加载,并转换为BERT输入格式,然后使用这些数据来训练模型。最后,代码将使用测试文本对模型进行测试。
阅读全文