上述代码根据这个提示:BertModel的权重没有从模型检查点中初始化,这意味着我们应该为它的下游任务训练它,并使用训练后的模型进行预测和推理,对于这个问题,可以参考以下步骤: 1.为你的下游任务收集训练数据。 2.根据需要修改Bert的配置文件。 3.加载预训练模型并根据你的任务进行微调。 4. 针对验证和测试数据评估训练好的模型。 5.使用训练好的模型进行预测和推理。上述代码该如何去优化?给出优化后的代码
时间: 2024-03-14 13:48:34 浏览: 92
根据提示,需要进行微调,因此需要添加微调的代码。下面给出优化后的代码:
```
import jieba
import torch
from transformers import BertTokenizer, BertModel, BertConfig, AdamW, get_linear_schedule_with_warmup
# 自定义词汇表路径
vocab_path = "output/user_vocab.txt"
count = 0
with open(vocab_path, 'r', encoding='utf-8') as file:
for line in file:
count += 1
user_vocab = count
# 种子词
seed_words = ['姓名']
# 加载微博文本数据
text_data = []
with open("output/weibo_data(small).txt", "r", encoding="utf-8") as f:
for line in f:
text_data.append(line.strip())
# 加载BERT分词器,并使用自定义词汇表
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', vocab_file=vocab_path)
config = BertConfig.from_pretrained("bert-base-chinese", vocab_size=user_vocab)
# 加载BERT模型
model = BertModel.from_pretrained('bert-base-chinese', config=config, ignore_mismatched_sizes=True)
# 添加微调代码
# 定义微调的超参数
epochs = 3
batch_size = 32
learning_rate = 2e-5
warmup_steps = 100
max_length = 128
# 定义优化器和学习率调度器
optimizer = AdamW(model.parameters(), lr=learning_rate, correct_bias=False)
total_steps = len(text_data) * epochs // batch_size
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
# 将数据集转换为张量
def create_tensors(texts, labels):
input_ids = []
attention_masks = []
for text in texts:
encoded_dict = tokenizer.encode_plus(
text,
add_special_tokens=True, # 添加特殊标记,如[CLS]和[SEP]
max_length=max_length, # 设定最大长度
pad_to_max_length=True, # 不足最大长度则在末尾进行填充
return_attention_mask=True, # 返回注意力掩码
return_tensors='pt' # 返回张量
)
input_ids.append(encoded_dict['input_ids'])
attention_masks.append(encoded_dict['attention_mask'])
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)
return input_ids, attention_masks, labels
# 定义微调函数
def fine_tune(text_data):
model.train()
for epoch in range(epochs):
for i in range(0, len(text_data), batch_size):
batch_texts = text_data[i:i+batch_size]
input_ids, attention_masks, labels = create_tensors(batch_texts, [0] * len(batch_texts)) # 标签设为0
optimizer.zero_grad()
outputs = model(input_ids, attention_masks)
last_hidden_state = outputs[0]
pooled_output = last_hidden_state[:, 0, :]
logits = torch.cosine_similarity(pooled_output, model.bert.pooler.dense.weight.T)
loss = torch.mean(1 - logits)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
# 调用微调函数
fine_tune(text_data)
# 构建隐私词库
privacy_words = set()
privacy_words_sim = set()
for text in text_data:
words = jieba.lcut(text.strip())
tokens = ["[CLS]"] + words + ["[SEP]"]
token_ids = tokenizer.convert_tokens_to_ids(tokens)
segment_ids = [0] * len(token_ids)
# 转换为张量,调用BERT模型进行编码
token_tensor = torch.tensor([token_ids])
segment_tensor = torch.tensor([segment_ids])
model.eval()
with torch.no_grad():
outputs = model(token_tensor, segment_tensor)
encoded_layers = outputs[0]
# 对于每个词,计算它与种子词的余弦相似度
for i in range(1, len(tokens) - 1):
word = tokens[i]
if word in seed_words:
continue
if len(word) <= 1:
continue
sim_scores = []
for j in range(len(seed_encoded_layers)):
sim_scores.append(torch.cosine_similarity(seed_encoded_layers[j][0], encoded_layers[j][i], dim=0).item())
cos_sim = sum(sim_scores) / len(sim_scores)
print(cos_sim, word)
if cos_sim >= 0.5:
privacy_words.add(word)
privacy_words_sim.add((word, cos_sim))
# 输出隐私词库
with open("output/privacy_words.txt", "w", encoding="utf-8") as f1:
for word in privacy_words:
f1.write(word + '\n')
with open("output/privacy_words_sim.txt", "w", encoding="utf-8") as f2:
for word, cos_sim in privacy_words_sim:
f2.write(word + "\t" + str(cos_sim) + "\n")
```
在优化后的代码中,我们添加了微调的代码。具体来说,我们定义了微调的超参数,包括epochs、batch_size、learning_rate、warmup_steps和max_length。然后我们定义了优化器和学习率调度器,并且编写了create_tensors函数将数据集转换为张量。接着我们定义了fine_tune函数进行微调,然后调用fine_tune函数进行微调。最后,我们计算隐私词库的余弦相似度,并输出结果。
阅读全文