生成代码,爬取腾讯新闻数据(https://new.qq.com/)的要闻、北京、财经、科技、娱乐、国际、军事、游戏、乐活模块新闻数据,每个模块爬取100条带图片的数据;爬取的字段包括:url、标题、发布时间、新闻来源、发布地点、新闻内容;数据清洗的要求:发布时间清洗为标准日期格式(yyyy-mm-dd HH:MM:SS),发布地点只保留地点信息。文本信息带表头保存为csv格式,并将采集结果存储于SQLite数据库;实现采集结果的词云。
时间: 2023-07-10 11:33:52 浏览: 129
好的,这里给您提供一个 Python 爬虫的代码,实现您的要求。需要安装 requests、beautifulsoup4、pandas、jieba、wordcloud、sqlite3 这些库。
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import jieba
from wordcloud import WordCloud
import sqlite3
def get_news(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
news_list = []
for news in soup.select('.newslist li'):
news_dict = {}
news_dict['url'] = news.select_one('a')['href']
news_dict['title'] = news.select_one('a').text
news_dict['time'] = re.sub(r'(\d{4})(\d{2})(\d{2})', r'\1-\2-\3', news.select_one('.time').text.strip()) + ' 00:00:00'
news_dict['source'] = news.select_one('.s-p').text.strip()
news_dict['place'] = re.sub(r'.*?(\S+省\S+市).*', r'\1', news.select_one('.s-p').text.strip())
news_dict['content'] = get_content(news_dict['url'])
news_list.append(news_dict)
return news_list
def get_content(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
content = ''
for p in soup.select('.content-article p'):
content += p.text.strip()
return content
def save_to_csv(news_list):
df = pd.DataFrame(news_list, columns=['url', 'title', 'time', 'source', 'place', 'content'])
df.to_csv('news.csv', index=False, encoding='utf-8-sig')
def save_to_database(news_list):
conn = sqlite3.connect('news.db')
cursor = conn.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS news (url TEXT, title TEXT, time TEXT, source TEXT, place TEXT, content TEXT)')
for news in news_list:
cursor.execute('INSERT INTO news (url, title, time, source, place, content) VALUES (?, ?, ?, ?, ?, ?)',
(news['url'], news['title'], news['time'], news['source'], news['place'], news['content']))
conn.commit()
cursor.close()
conn.close()
def generate_wordcloud(news_list):
content = ''
for news in news_list:
content += news['content']
seg_list = jieba.cut(content, cut_all=False)
word_list = ' '.join(seg_list)
wc = WordCloud(font_path='msyh.ttc', width=800, height=400, background_color='white')
wc.generate(word_list)
wc.to_file('wordcloud.png')
if __name__ == '__main__':
url_list = ['https://new.qq.com/ch/topnews', 'https://new.qq.com/ch/finance', 'https://new.qq.com/ch/tech',
'https://new.qq.com/ch/ent', 'https://new.qq.com/ch/world', 'https://new.qq.com/ch/milite',
'https://new.qq.com/ch/game', 'https://new.qq.com/ch/life']
news_list = []
for url in url_list:
news_list.extend(get_news(url)[:100])
save_to_csv(news_list)
save_to_database(news_list)
generate_wordcloud(news_list)
```
这个爬虫会根据您提供的链接爬取新闻数据,并将数据保存到 csv 文件和 SQLite 数据库中。它还会生成词云图片,显示新闻的关键词。
阅读全文