确定语料和词向量,预训练模型完整代码
时间: 2024-02-14 21:07:39 浏览: 25
确定语料和词向量的代码:
1. 语料
```python
import nltk
# 下载语料
nltk.download('brown')
# 加载语料
from nltk.corpus import brown
sentences = brown.sents()
```
2. 词向量
```python
import gensim
# 训练词向量
model = gensim.models.Word2Vec(sentences=sentences, size=100, window=5, min_count=5, workers=4)
# 保存词向量
model.save('word2vec.model')
# 加载词向量
model = gensim.models.Word2Vec.load('word2vec.model')
```
预训练模型的完整代码:
1. BERT
```python
from transformers import BertTokenizer, BertModel
import torch
# 加载预训练模型和分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# 输入文本
text = "Hello, my dog is cute"
# 分词
inputs = tokenizer(text, return_tensors='pt')
# 预测
with torch.no_grad():
outputs = model(**inputs)
# 得到向量表示
last_hidden_states = outputs.last_hidden_state
```
2. GPT
```python
from transformers import GPT2Tokenizer, GPT2Model
import torch
# 加载预训练模型和分词器
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
# 输入文本
text = "Hello, my dog is cute"
# 分词
inputs = tokenizer(text, return_tensors='pt')
# 预测
with torch.no_grad():
outputs = model(**inputs)
# 得到向量表示
last_hidden_states = outputs.last_hidden_state
```
3. XLNet
```python
from transformers import XLNetTokenizer, XLNetModel
import torch
# 加载预训练模型和分词器
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')
# 输入文本
text = "Hello, my dog is cute"
# 分词
inputs = tokenizer.encode_plus(text, return_tensors='pt')
# 预测
with torch.no_grad():
outputs = model(**inputs)
# 得到向量表示
last_hidden_states = outputs.last_hidden_state
```