python爬虫抓取百度贴吧数据
时间: 2023-11-29 11:06:11 浏览: 41
为了爬取百度贴吧数据,我们可以使用Python的requests和BeautifulSoup库。具体步骤如下:
1. 导入requests和BeautifulSoup库
```python
import requests
from bs4 import BeautifulSoup
```
2. 构造URL并发送请求
```python
url = 'https://tieba.baidu.com/f?kw=python&ie=utf-8&pn=0'
response = requests.get(url)
```
其中,kw参数指定了要爬取的贴吧名称,pn参数指定了要爬取的页数。
3. 解析HTML并提取数据
```python
soup = BeautifulSoup(response.text, 'html.parser')
post_list = soup.find_all('li', class_='j_thread_list clearfix')
for post in post_list:
title = post.find('a', class_='j_th_tit').text.strip()
author = post.find('span', class_='tb_icon_author').text.strip()
reply_num = post.find('span', class_='threadlist_rep_num').text.strip()
print('标题:', title)
print('作者:', author)
print('回复数:', reply_num)
```
其中,我们使用find_all方法找到所有的帖子,然后使用find方法找到每个帖子的标题、作者和回复数,并打印出来。
完整代码如下:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://tieba.baidu.com/f?kw=python&ie=utf-8&pn=0'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
post_list = soup.find_all('li', class_='j_thread_list clearfix')
for post in post_list:
title = post.find('a', class_='j_th_tit').text.strip()
author = post.find('span', class_='tb_icon_author').text.strip()
reply_num = post.find('span', class_='threadlist_rep_num').text.strip()
print('标题:', title)
print('作者:', author)
print('回复数:', reply_num)
```