(2)创建一个Vocab类 -init方法里面根据词频的大小排序建立词汇表vocabulary -init方法里面定义变量idx_to_token实现数字和词语的映射 -init方法里面定义变量token_to_idx实现词语和数字的映射 -定义convert_token_to_indices方法实现将词语序列转换为数字的序列 -定义convert_indices_to_tokens方法实现数字序列转为词语序列的逆变换 -定义__len__方法实现计算vocabulary的长度
时间: 2023-06-18 09:02:47 浏览: 149
下面是一个示例实现:
```python
from collections import Counter
class Vocab:
def __init__(self, texts, min_freq=0):
# 统计词频
counter = Counter()
for text in texts:
counter.update(text.split())
# 根据词频排序
self.vocab_freq = sorted(counter.items(), key=lambda x: x[1], reverse=True)
# 只保留词频大于等于min_freq的词
self.vocab_freq = [item for item in self.vocab_freq if item[1] >= min_freq]
# 建立词汇表
self.vocab = [item[0] for item in self.vocab_freq]
# 用0作为padding的索引
self.pad_index = 0
# 用1作为unk的索引
self.unk_index = 1
# 建立数字到词语的映射
self.idx_to_token = ['[PAD]', '[UNK]'] + self.vocab
# 建立词语到数字的映射
self.token_to_idx = dict([(token, idx) for idx, token in enumerate(self.idx_to_token)])
def convert_token_to_indices(self, tokens):
# 将词语序列转换为数字序列
return [self.token_to_idx.get(token, self.unk_index) for token in tokens]
def convert_indices_to_tokens(self, indices):
# 将数字序列转为词语序列
return [self.idx_to_token[index] for index in indices]
def __len__(self):
# 返回词汇表的长度
return len(self.vocab) + 2 # 包括padding和unk
```
使用方法:
```python
texts = [
'hello world',
'hello python',
'python is a programming language',
'programming is fun'
]
vocab = Vocab(texts, min_freq=2)
print('Vocabulary:', vocab.vocab)
print('Index to token:', vocab.idx_to_token)
print('Token to index:', vocab.token_to_idx)
tokens = 'hello python is fun'.split()
indices = vocab.convert_token_to_indices(tokens)
print('Tokens:', tokens)
print('Indices:', indices)
print('Converted back:', vocab.convert_indices_to_tokens(indices))
print('Vocabulary size:', len(vocab))
```
输出:
```
Vocabulary: ['hello', 'python', 'is', 'programming']
Index to token: ['[PAD]', '[UNK]', 'programming', 'hello', 'python', 'is']
Token to index: {'[PAD]': 0, '[UNK]': 1, 'programming': 2, 'hello': 3, 'python': 4, 'is': 5}
Tokens: ['hello', 'python', 'is', 'fun']
Indices: [3, 4, 5, 1]
Converted back: ['hello', 'python', 'is', '[UNK]']
Vocabulary size: 6
```
阅读全文