python词向量化
时间: 2023-06-22 14:22:26 浏览: 65
在Python中,可以使用许多库来进行词向量化,其中最流行的是使用Scikit-learn和Gensim库。
使用Scikit-learn库进行词向量化的步骤如下:
1. 导入库和数据集
```python
from sklearn.feature_extraction.text import CountVectorizer
# 数据集
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
```
2. 实例化CountVectorizer对象
```python
vectorizer = CountVectorizer()
```
3. 将数据集进行向量化
```python
X = vectorizer.fit_transform(corpus)
```
4. 获取词汇表
```python
vocab = vectorizer.get_feature_names()
```
使用Gensim库进行词向量化的步骤如下:
1. 导入库和数据集
```python
import gensim
from gensim import corpora
# 数据集
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
```
2. 将数据集转换为词袋模型
```python
texts = [[word for word in document.lower().split()] for document in corpus]
dictionary = corpora.Dictionary(texts)
corpus_bow = [dictionary.doc2bow(text) for text in texts]
```
3. 使用TfidfModel将词袋模型转换为TF-IDF向量
```python
tfidf = gensim.models.TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
```