首页python CountVectorizer

python CountVectorizer

时间: 2024-05-03 08:19:28 浏览: 107

`CountVectorizer`是scikit-learn库中的一个文本特征提取器，它将文本转换为向量表示，并计算每个单词在文本中出现的频率。具体来说，它会对输入的文本进行分词、去除停用词（可选）、计算每个单词在文本中出现的次数，最终将每个文本转换为一个向量。下面是一个简单的例子，展示了如何使用`CountVectorizer`对文本进行特征提取： ```python from sklearn.feature_extraction.text import CountVectorizer # 创建CountVectorizer对象 vectorizer = CountVectorizer() # 输入文本 corpus = [ "This is the first document.", "This is the second document.", "And this is the third one.", "Is this the first document?", ] # 对文本进行特征提取 X = vectorizer.fit_transform(corpus) # 输出特征向量 print(X.toarray()) ``` 输出结果为： ``` [[0 1 1 1 0 0 1 0 1] [0 1 1 0 0 1 1 0 1] [1 1 0 1 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]] ``` 可以看到，每个文本被转换为了一个长度为9的向量，其中每个元素表示该单词在文本中出现的次数。此外，`CountVectorizer`还提供了一些参数，例如可以设置停用词列表、限制最大特征数等。

阅读全文