基于sklearn构建One-hot词向量
时间: 2023-06-14 20:06:13 浏览: 76
可以通过sklearn中的CountVectorizer和TfidfVectorizer来构建One-hot词向量。
首先,使用CountVectorizer来将文本转换为词频向量。代码如下:
```python
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This is the second document.', 'And this is the third one.', 'Is this the first document?']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
```
输出结果如下:
```
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
```
其中,vectorizer.get_feature_names()用来获取词汇表,即文本中所有出现过的单词。X.toarray()则是将文本转化为词频向量。
接下来,使用TfidfVectorizer来计算词语的TF-IDF值。代码如下:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.', 'This is the second document.', 'And this is the third one.', 'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
```
输出结果如下:
```
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0. 0.43877674 0.5519395 0.43877674 0. 0. 0.35872874 0. 0.43877674]
[0. 0.43877674 0. 0.43877674 0. 0.61221722 0.35872874 0. 0.43877674]
[0.51184851 0. 0. 0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379]
[0. 0.43877674 0.5519395 0.43877674 0. 0. 0.35872874 0. 0.43877674]]
```
可以看到,TfidfVectorizer计算出的是TF-IDF值,对于常见的单词“this”、“is”、“the”等,它们的TF-IDF值都比较低。
相关推荐
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)