去除停用词、用到函数CountVectorizer 和 TfidfTransformer()构建文本的 TF-IDF 矩阵
时间: 2024-03-25 19:41:50 浏览: 115
python TF-IDF算法实现文本关键词提取
好的,以下是一个简单的示例代码,演示如何使用CountVectorizer和TfidfTransformer函数构建文本的TF-IDF矩阵,并去除停用词:
```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.corpus import stopwords
# 收集文本数据
text_data = [
"This is an example of text data",
"We will use it to demonstrate how to build a TF-IDF matrix",
"The first step is to preprocess the data",
"We need to remove stop words, punctuation, and convert all text to lowercase",
"Then we can use CountVectorizer to create a bag of words representation of the text",
"Finally, we can use TfidfTransformer to compute the TF-IDF matrix"
]
# 去除停用词
stop_words = set(stopwords.words('english'))
# 构建 CountVectorizer 对象
vectorizer = CountVectorizer(stop_words=stop_words)
# 对文本进行特征提取
X = vectorizer.fit_transform(text_data)
# 构建 TfidfTransformer 对象并进行特征转换
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
# 输出 TF-IDF 矩阵
print(tfidf.toarray())
```
在上面的代码中,我们使用了NLTK库中的停用词列表去除了文本中的停用词。然后,我们使用CountVectorizer函数对文本进行特征提取,得到了文本的词频矩阵。接着,我们使用TfidfTransformer函数对词频矩阵进行特征转换,得到了TF-IDF矩阵。最后,我们输出了TF-IDF矩阵的数组形式。
阅读全文