如果这里的texts是很多文本,用一个excel存储,应该怎么改进代码
时间: 2024-02-18 16:03:08 浏览: 103
如果`texts`是很多文本,最好使用Pandas库将它们存储在Excel文件中,并使用`pandas.read_excel`方法进行读取。下面是一个改进后的代码:
```python
import jieba.analyse
import jieba
import os
import codecs
import pandas as pd
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
# 加载停用词
def get_stopwords_list():
stopwords_file = os.path.join(os.getcwd(), 'stopwords.txt')
stopwords_list = [line.strip() for line in codecs.open(stopwords_file, 'r', encoding='utf-8').readlines()]
return stopwords_list
# 分词
def cut_words(text):
words = jieba.cut(text)
stopwords_list = get_stopwords_list()
result = ''
for word in words:
if word not in stopwords_list:
result += word + ' '
return result
# TF-IDF提取关键词
def tfidf_extract(texts, topK):
corpus = []
for text in texts:
text = cut_words(text)
corpus.append(text)
vectorizer = TfidfVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
word = vectorizer.get_feature_names()
weight = tfidf.toarray()
ids = []
for i in range(len(weight)):
ids.append(i)
for i in range(len(weight)):
cosine_similarities = cosine_similarity(tfidf[i], tfidf).flatten()
related_docs_indices = [index for index in cosine_similarities.argsort()[::-1] if index != i]
top_k = related_docs_indices[:topK]
print("关键词:")
for index in top_k:
print(word[index], end=" ")
print("\n")
if __name__ == '__main__':
# 读取Excel文件
df = pd.read_excel('texts.xlsx')
texts = df['text'].tolist()
topK = 3
tfidf_extract(texts, topK)
```
在这个改进后的代码中,我们首先使用`pandas.read_excel`方法将存储在Excel文件中的文本读取到一个DataFrame对象中,然后使用`tolist`方法将其转换为一个列表。然后,我们将这个列表作为参数传递到`tfidf_extract`函数中,使用TF-IDF提取每个文本的关键词。
阅读全文