一行代码使用bert生成句向量,bert做文本分类、文本相似度计算
时间: 2023-05-08 20:59:54 浏览: 276
对于生成句向量,使用BERT模型可以非常简单地实现。步骤如下:
1. 首先需要安装BERT相关的Python库,在终端中输入 pip install pytorch-pretrained-bert。
2. 导入相关库和BERT模型,如下所示:
from pytorch_pretrained_bert import BertTokenizer, BertModel
import torch
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
3. 定义输入文本,并通过BERT Tokenizer将文本转换成tokens。接着,将tokens转换成BERT的tokens id,并将其用PyTorch张量表示。
text = "Here is some text to encode"
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
4. 将tokens_tensor传递给BERT模型,获取所有层的隐藏状态。最终,将每个token的最后一层的隐藏状态拼接成单个向量作为句向量。
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor)
# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(encoded_layers, dim=0)
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
# Concatenate the vectors for each token to form a single vector.
sentence_embedding = torch.mean(token_embeddings, dim=0)
至于如何使用BERT做文本分类和文本相似度计算,可以使用Fine-tuning方法。具体步骤如下:
1. 准备训练集和测试集。
2. 加载预训练的BERT模型,替换其输出层为对应的任务层。
from pytorch_pretrained_bert import BertForSequenceClassification, BertForNextSentencePrediction
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
3. 训练模型,可以使用PyTorch自带的优化算法,如Adam。训练完毕后,可以保存模型。
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=1e-5)
for epoch in range(num_epochs):
for data in training_data:
optimizer.zero_grad()
text = data['text']
labels = data['labels']
tokens = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
tokens_tensor = torch.tensor([indexed_tokens])
scores = model(tokens_tensor, labels)
loss = scores[0]
loss.backward()
optimizer.step()
4. 对于文本相似度计算,可以使用Fine-tuned的BERT模型计算文本向量的余弦相似度。
from scipy.spatial.distance import cosine
text1 = 'I like to play football'
text2 = 'Football is my favorite sport'
tokens1 = tokenizer.tokenize(text1)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokens1)
tokens_tensor1 = torch.tensor([indexed_tokens1])
tokens2 = tokenizer.tokenize(text2)
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokens2)
tokens_tensor2 = torch.tensor([indexed_tokens2])
with torch.no_grad():
encoded_layers1, _ = model(tokens_tensor1)
encoded_layers2, _ = model(tokens_tensor2)
token_embeddings1 = torch.stack(encoded_layers1, dim=0)
token_embeddings2 = torch.stack(encoded_layers2, dim=0)
token_embeddings1 = torch.squeeze(token_embeddings1, dim=1)
token_embeddings2 = torch.squeeze(token_embeddings2, dim=1)
token_embeddings1 = token_embeddings1.permute(1,0,2)
token_embeddings2 = token_embeddings2.permute(1,0,2)
sentence_embedding1 = torch.mean(token_embeddings1, dim=0)
sentence_embedding2 = torch.mean(token_embeddings2, dim=0)
similarity_score = 1 - cosine(sentence_embedding1, sentence_embedding2)
阅读全文