Python3:爬取新浪、网易、今日头条、UC四大网站新闻标题及内容
时间: 2023-12-28 17:16:19 浏览: 31
以下是一个简单的Python3爬虫示例,可以爬取新浪、网易、今日头条和UC四个网站的新闻标题和内容:
```python
import requests
from bs4 import BeautifulSoup
# 新浪新闻
sina_url = 'https://news.sina.com.cn/'
sina_res = requests.get(sina_url)
sina_res.encoding = 'utf-8'
sina_soup = BeautifulSoup(sina_res.text, 'html.parser')
sina_news = sina_soup.select('.news-item h2 a')
for news in sina_news:
print(news.text, news['href'])
# 网易新闻
163_url = 'https://news.163.com/'
163_res = requests.get(163_url)
163_res.encoding = 'gbk'
163_soup = BeautifulSoup(163_res.text, 'html.parser')
163_news = 163_soup.select('.news_default a')
for news in 163_news:
print(news.text, news['href'])
# 今日头条
toutiao_url = 'https://www.toutiao.com/'
toutiao_res = requests.get(toutiao_url)
toutiao_res.encoding = 'utf-8'
toutiao_soup = BeautifulSoup(toutiao_res.text, 'html.parser')
toutiao_news = toutiao_soup.select('.title-box a')
for news in toutiao_news:
print(news.text, 'https://www.toutiao.com'+news['href'])
# UC头条
uc_url = 'https://www.uc.cn/'
uc_res = requests.get(uc_url)
uc_res.encoding = 'utf-8'
uc_soup = BeautifulSoup(uc_res.text, 'html.parser')
uc_news = uc_soup.select('.news-list a')
for news in uc_news:
print(news.text, news['href'])
```
这个爬虫程序使用了requests库和BeautifulSoup库,它首先从每个网站获取HTML页面,然后使用BeautifulSoup从HTML中提取新闻标题和内容。在处理每个网站的HTML页面时,需要根据页面编码设置正确的编码方式,否则可能会出现乱码。