首页python实现英文txt文本TF-IDF提取关键词

python实现英文txt文本TF-IDF提取关键词

时间: 2023-05-30 22:07:49 浏览: 318

以下是Python实现英文txt文本TF-IDF提取关键词的示例代码： ```python import nltk from sklearn.feature_extraction.text import TfidfVectorizer # 读取文本文件 with open('text.txt', 'r') as f: text = f.read() # 分词 tokens = nltk.word_tokenize(text) # 去除停用词 stop_words = set(nltk.corpus.stopwords.words('english')) tokens = [token for token in tokens if not token in stop_words] # 计算TF-IDF值 tfidf = TfidfVectorizer() tfs = tfidf.fit_transform([text]) # 提取关键词 feature_names = tfidf.get_feature_names() top_n = 10 top_features = [feature_names[i] for i in tfs[0].indices.argsort()[-top_n:][::-1]] # 输出关键词 print(top_features) ``` 具体步骤如下： 1. 读取文本文件，使用Python内置的`open`函数读取文本文件。 2. 对文本进行分词，使用NLTK库的`word_tokenize`函数进行分词。 3. 去除停用词，使用NLTK库的`stopwords`函数获取停用词列表，然后遍历分词结果去除停用词。 4. 计算TF-IDF值，使用sklearn库的`TfidfVectorizer`函数计算TF-IDF值。 5. 提取关键词，获取TF-IDF值最高的前n个特征词，使用sklearn库的`get_feature_names`函数获取特征词列表，然后按照TF-IDF值排序并取前n个特征词。 6. 输出关键词，使用Python内置的`print`函数输出关键词列表。需要注意的是，为了使用NLTK库和sklearn库，需要先安装这两个库。可以使用pip安装，命令如下： ```bash pip install nltk sklearn ``` 此外，还需要下载NLTK库的停用词列表，可以使用以下Python代码下载： ```python import nltk nltk.download('stopwords') ```

阅读全文