python实现英文大数据txt文本TF-IDF提取关键词
时间: 2023-06-02 17:01:38 浏览: 129
利用Python实现中文文本关键词抽取的三种方法(TF-IDF、TextRank和Word2Vec)【100010838】
5星 · 资源好评率100%
以下是Python实现英文大数据txt文本TF-IDF提取关键词的代码:
```python
import os
import math
import string
from collections import Counter
# 读取文本文件
def read_file(filename):
with open(filename, 'r', encoding='utf-8') as f:
text = f.read()
return text
# 分词
def tokenize(text):
words = text.lower().split()
# 去除标点符号
words = [word.strip(string.punctuation) for word in words]
# 去除数字和单个字母
words = [word for word in words if not any(c.isdigit() for c in word) and len(word) > 1]
return words
# 计算单词出现次数
def count_words(words):
word_counts = Counter(words)
return word_counts
# 计算单词在文档中出现的频率
def compute_word_frequency(word_counts):
total_count = sum(word_counts.values())
word_freqs = {word: count/total_count for word, count in word_counts.items()}
return word_freqs
# 计算单词在文档集合中出现的文档数量
def compute_document_frequency(word, documents):
count = sum(1 for document in documents if word in document)
return count
# 计算单词的逆文档频率
def compute_inverse_document_frequency(word, documents):
N = len(documents)
df = compute_document_frequency(word, documents)
idf = math.log(N/df)
return idf
# 计算TF-IDF
def compute_tf_idf(word, words, documents):
tf = words[word]
idf = compute_inverse_document_frequency(word, documents)
tf_idf = tf * idf
return tf_idf
# 提取关键词
def extract_keywords(filename, num_keywords=10):
# 读取文本文件
text = read_file(filename)
# 分词
words = tokenize(text)
# 计算单词出现次数
word_counts = count_words(words)
# 计算单词在文档中出现的频率
word_freqs = compute_word_frequency(word_counts)
# 计算TF-IDF
documents = [words]
tf_idfs = {word: compute_tf_idf(word, word_freqs, documents) for word in word_counts.keys()}
# 获取前num_keywords个TF-IDF最高的关键词
keywords = sorted(tf_idfs.items(), key=lambda x: x[1], reverse=True)[:num_keywords]
return [keyword[0] for keyword in keywords]
# 测试
filename = 'data.txt'
keywords = extract_keywords(filename, num_keywords=10)
print(keywords)
```
其中,`read_file`函数用于读取文本文件,`tokenize`函数用于对文本进行分词,`count_words`函数用于统计单词出现次数,`compute_word_frequency`函数用于计算单词在文档中出现的频率,`compute_document_frequency`函数用于计算单词在文档集合中出现的文档数量,`compute_inverse_document_frequency`函数用于计算单词的逆文档频率,`compute_tf_idf`函数用于计算TF-IDF,`extract_keywords`函数用于提取关键词。
在测试时,将要提取关键词的文本文件路径传入`extract_keywords`函数,同时可以指定要提取的关键词数量。运行后,将会返回一个关键词列表,其中包含了TF-IDF值最高的前num_keywords个关键词。
阅读全文