首页对二维词向量进行词频统计以及反文档频率的计算 python 代码

对二维词向量进行词频统计以及反文档频率的计算 python 代码

时间: 2024-05-12 10:16:39 浏览: 96

以下是对二维词向量进行词频统计以及反文档频率的计算的Python代码： ```python import numpy as np from sklearn.feature_extraction.text import CountVectorizer corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] # 将语料库中的文本表示为二维向量 vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) # 输出词汇表 print(vectorizer.get_feature_names()) # 计算词频 word_freq = np.sum(X.toarray(), axis=0) print(word_freq) # 计算逆文档频率 doc_freq = np.sum(X.toarray()>0, axis=0) idf = np.log((len(corpus)+1)/(doc_freq+1))+1 print(idf) ``` 其中，`corpus`是语料库，`vectorizer`被用于将文档转换为词频表示。`X`是词频计数的稀疏矩阵形式，`get_feature_names()`用于获取词汇表，`word_freq`计算词汇表中每个单词的词频。`doc_freq`计算每个单词在语料库中出现的文档数，`idf`计算每个单词的逆文档频率。

阅读全文