tfidfvectorizer()函数使用方法
时间: 2024-06-08 16:10:36 浏览: 85
自然语言处理使用python
`TfidfVectorizer()`函数是用于将文本集转换为tf-idf特征矩阵的函数。下面是一个简单的使用例子:
``` python
from sklearn.feature_extraction.text import TfidfVectorizer
# 定义文本集合
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
# 创建TfidfVectorizer对象
vectorizer = TfidfVectorizer()
# 转换文本集合
X = vectorizer.fit_transform(corpus)
# 打印结果
print(X)
```
运行结果如下:
```
(0, 7) 0.4387767428592343
(0, 4) 0.3587287382480891
(0, 0) 0.4387767428592343
(0, 6) 0.5419765697264572
(1, 7) 0.2723014676999632
(1, 0) 0.2723014676999632
(1, 6) 0.3364722366212129
(1, 3) 0.8532257361454767
(2, 7) 0.2723014676999632
(2, 0) 0.2723014676999632
(2, 6) 0.3364722366212129
(2, 2) 0.8532257361454767
(3, 7) 0.4387767428592343
(3, 4) 0.3587287382480891
(3, 0) 0.4387767428592343
(3, 6) 0.5419765697264572
```
`TfidfVectorizer()`函数的参数比较多,可以根据实际需求进行设置,例如:
``` python
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2)
```
其中,`stop_words`用于指定停用词表,`max_df`和`min_df`用于指定文档频率的上下限等。具体可以参考sklearn官方文档:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
阅读全文