用python代码实现对词频出现次数较高的词汇放入特征词典中,31个类的特征词都放到一个文档里。然后对训练集的31个类的文档用one-hot向量化,即将每个文档中的每个句子表示成与特征词典一样长度的向量,遍历特征词典,只要句子中包含当前的特征词,标记为1,否则标记为0。
时间: 2024-05-14 22:16:37 浏览: 255
可以使用中的`collections.Counter`来实现对词频出现次数较高的词汇的计数,然后将出现次数最高的词放入特征词典中。代码示例如下:
```python
import os
import collections
all_word_counts = collections.Counter()
dict_file = 'feature_dict.txt'
# 统计所有文本的词频
classes = os.listdir('train_data')
for classname in classes:
class_path = os.path.join('train_data', classname)
files = os.listdir(class_path)
for filename in files:
filepath = os.path.join(class_path, filename)
with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
text = f.read()
words = text.split()
all_word_counts.update(words)
# 获取出现次数最高的前N个词
N = 1000
most_common_words = [word for (word, count) in all_word_counts.most_common(N)]
# 将特征词典写入文件
with open(dict_file, 'w', encoding='utf-8') as f:
f.write('\n'.join(most_common_words))
```
然后,可以使用以下代码将训练集的文本转换为one-hot向量化的形式:
```python
import numpy as np
# 读取特征词典
with open(dict_file, 'r', encoding='utf-8') as f:
feature_dict = f.read().split()
# 遍历所有文本,将每个词语转换为one-hot向量
num_classes = 31
class_vectors = []
for classname in range(num_classes):
class_vectors.append([])
class_path = os.path.join('train_data', str(classname))
files = os.listdir(class_path)
for filename in files:
filepath = os.path.join(class_path, filename)
with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
text = f.read().lower()
words = text.split()
vector = np.zeros(len(feature_dict))
for word in words:
if word in feature_dict:
index = feature_dict.index(word)
vector[index] = 1
class_vectors[classname].append(vector)
# 将列表转换为numpy数组
class_vectors = [np.array(vec_list) for vec_list in class_vectors]
```
阅读全文
相关推荐
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)