python编程,导入文件夹,从文件夹中随机选取10条新闻。将选取的十条新闻去除空格和回车,并利用基于lsi、lda模型的关键词提取技术提取10个关键字
时间: 2023-12-03 18:44:01 浏览: 84
基于Python的关键词爬取新闻
好的,首先我们需要导入Python中的一些库,包括os、random、re、gensim等。代码如下:
```python
import os
import random
import re
from gensim import corpora, models
```
接下来,我们需要定义一个函数来从文件夹中随机选取10条新闻,并去除空格和回车。代码如下:
```python
def get_news(folder_path):
news_list = []
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
with open(file_path, 'r', encoding='utf-8') as f:
news_content = f.read().replace('\n', '').replace(' ', '')
news_list.append(news_content)
selected_news = random.sample(news_list, 10)
return selected_news
```
这个函数接受一个文件夹路径作为参数,然后遍历文件夹中的所有文件,将文件内容去除空格和回车后存储到一个列表中。接着,从列表中随机选取10条新闻,并返回选取的新闻列表。
最后,我们需要定义一个函数来利用基于lsi、lda模型的关键词提取技术提取关键词。代码如下:
```python
def extract_keywords(news_list):
stop_words = [] # 停用词
with open('stopwords.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
stop_words.append(line.strip())
texts = [[word for word in re.split('[^a-zA-Z]', news.lower()) if word not in stop_words] for news in news_list]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lda_corpus = lda[corpus_tfidf]
lda_keywords = []
for i in range(len(lda_corpus)):
lda_keywords.append([word for word, _ in lda.show_topic(max(lda_corpus[i], key=lambda x: x[1])[0])])
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi_corpus = lsi[corpus_tfidf]
lsi_keywords = []
for i in range(len(lsi_corpus)):
lsi_keywords.append([word for word, _ in lsi.show_topic(max(lsi_corpus[i], key=lambda x: abs(x[1]))[0])])
keywords = set()
for kw in lda_keywords + lsi_keywords:
for word in kw:
if word.isalpha():
keywords.add(word)
return list(keywords)[:10]
```
这个函数接受一个新闻列表作为参数,其中每个新闻都是一个字符串。首先,我们需要读取停用词表,并将其存储到一个列表中。然后,我们需要对每篇新闻进行分词,并去除停用词和非字母字符。接着,我们使用gensim库构建一个词典,并将每篇新闻转换成词袋表示法。接下来,我们使用tf-idf模型和LDA模型对新闻进行主题建模,并提取每篇新闻的关键词。然后,我们使用LSI模型对新闻进行主题建模,并提取每篇新闻的关键词。最后,我们将所有新闻的关键词合并到一个集合中,并返回其中的前10个关键词。
完整代码如下:
```python
import os
import random
import re
from gensim import corpora, models
def get_news(folder_path):
news_list = []
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
with open(file_path, 'r', encoding='utf-8') as f:
news_content = f.read().replace('\n', '').replace(' ', '')
news_list.append(news_content)
selected_news = random.sample(news_list, 10)
return selected_news
def extract_keywords(news_list):
stop_words = [] # 停用词
with open('stopwords.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
stop_words.append(line.strip())
texts = [[word for word in re.split('[^a-zA-Z]', news.lower()) if word not in stop_words] for news in news_list]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lda_corpus = lda[corpus_tfidf]
lda_keywords = []
for i in range(len(lda_corpus)):
lda_keywords.append([word for word, _ in lda.show_topic(max(lda_corpus[i], key=lambda x: x[1])[0])])
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi_corpus = lsi[corpus_tfidf]
lsi_keywords = []
for i in range(len(lsi_corpus)):
lsi_keywords.append([word for word, _ in lsi.show_topic(max(lsi_corpus[i], key=lambda x: abs(x[1]))[0])])
keywords = set()
for kw in lda_keywords + lsi_keywords:
for word in kw:
if word.isalpha():
keywords.add(word)
return list(keywords)[:10]
if __name__ == '__main__':
folder_path = 'news'
news_list = get_news(folder_path)
for i, news in enumerate(news_list):
print(f'News {i + 1}: {news}')
keywords = extract_keywords(news_list)
print('Keywords:', keywords)
```
注意,上述代码中使用了一个停用词表文件stopwords.txt,需要将其放在代码所在目录下。停用词表文件中每行包含一个停用词。
阅读全文