我现在有一个已经分过词并且去除过停用词的名为“2015.txt”的文档,要对其基于TF-IDF抽取关键词,另有一个文档,其中包含了很多篇子文档,每篇子文档都以“本篇文章为:”开头,要将这个文档作为TF-IDF的语料库,提取“2015.txt”的关键词,请你用python实现,请直接从读取语料库这一步开始
时间: 2024-02-18 15:04:00 浏览: 121
好的,以下是用Python实现基于TF-IDF从语料库中抽取关键词的示例代码:
``` python
import jieba
from collections import Counter
import math
# 读取语料库
corpus_file = 'corpus.txt'
corpus = []
with open(corpus_file, 'r', encoding='utf-8') as f:
content = f.read()
docs = content.split('本篇文章为:')
for doc in docs[1:]:
corpus.append(doc.strip())
# 分词和去除停用词
def cut_and_remove_stopwords(text):
stopwords = []
with open('stopwords.txt', 'r', encoding='utf-8') as f:
for line in f:
stopwords.append(line.strip())
words = jieba.cut(text)
return [word for word in words if word not in stopwords]
# 计算TF-IDF值
def calculate_tf_idf(word, doc, corpus):
tf = doc.count(word) / len(doc)
idf = math.log(len(corpus) / (sum([1 for d in corpus if word in d])))
return tf * idf
# 读取待处理文档
doc_file = '2015.txt'
with open(doc_file, 'r', encoding='utf-8') as f:
doc = f.read()
# 分词和去除停用词
doc_words = cut_and_remove_stopwords(doc)
# 计算每个词的TF-IDF值
tf_idf = {}
for word in doc_words:
tf_idf[word] = calculate_tf_idf(word, doc_words, corpus)
# 提取TF-IDF值最高的前10个词作为关键词
keywords = Counter(tf_idf).most_common(10)
print(keywords)
```
其中,`corpus.txt`为包含多篇子文档的语料库,每篇子文档以“本篇文章为:”开头;`stopwords.txt`为停用词表,每行一个停用词。在计算TF-IDF值时,使用了Python内置的`math`库中的`log`函数计算IDF值。最后,提取TF-IDF值最高的前10个词作为关键词,并打印输出。
希望这个示例代码对你有所帮助!
阅读全文