去除停用词、用到函数CountVectorizer 和 TfidfTransformer()构建文本的 TF-IDF 矩阵

好的，以下是一个简单的示例代码，演示如何使用CountVectorizer和TfidfTransformer函数构建文本的TF-IDF矩阵，并去除停用词： ```python from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from nltk.corpus import stopwords # 收集文本数据 text_data = [ "This is an example of text data", "We will use it to demonstrate how to build a TF-IDF matrix", "The first step is to preprocess the data", "We need to remove stop words, punctuation, and convert all text to lowercase", "Then we can use CountVectorizer to create a bag of words representation of the text", "Finally, we can use TfidfTransformer to compute the TF-IDF matrix" ] # 去除停用词 stop_words = set(stopwords.words('english')) # 构建 CountVectorizer 对象 vectorizer = CountVectorizer(stop_words=stop_words) # 对文本进行特征提取 X = vectorizer.fit_transform(text_data) # 构建 TfidfTransformer 对象并进行特征转换 transformer = TfidfTransformer() tfidf = transformer.fit_transform(X) # 输出 TF-IDF 矩阵 print(tfidf.toarray()) ``` 在上面的代码中，我们使用了NLTK库中的停用词列表去除了文本中的停用词。然后，我们使用CountVectorizer函数对文本进行特征提取，得到了文本的词频矩阵。接着，我们使用TfidfTransformer函数对词频矩阵进行特征转换，得到了TF-IDF矩阵。最后，我们输出了TF-IDF矩阵的数组形式。

阅读全文