如何用Python批量爬取小红书帖子标签,并使用pandas库保存
时间: 2023-05-28 11:05:55 浏览: 195
1. 导入所需库
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
```
2. 定义函数
```python
def get_tags(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.select('.note-tag-wrap .note-tag-item a')
tag_list = []
for tag in tags:
tag_list.append(tag.text.strip())
return tag_list
```
该函数接受一个小红书帖子的链接,返回该帖子的标签列表。
3. 遍历链接列表,获取标签数据并保存
```python
url_list = ['https://www.xiaohongshu.com/discovery/item/5f7c9b15000000000105d939',
'https://www.xiaohongshu.com/discovery/item/5fc9c7f50000000001066f47',
'https://www.xiaohongshu.com/discovery/item/5fa764c2000000000105b0e5']
tag_data = []
for url in url_list:
tags = get_tags(url)
tag_data.append(tags)
df = pd.DataFrame(tag_data, columns=['Tags'])
df.to_csv('tags.csv', index=False)
```
遍历链接列表,获取每个帖子的标签数据,将所有标签数据保存为一个二维列表。使用pandas库将标签数据转换为DataFrame对象,并将其保存为CSV文件。
阅读全文