首页python爬取微博内容

python爬取微博内容

时间: 2023-10-06 18:05:08 浏览: 103

要爬取微博内容，可以使用 Python 爬虫库中的 requests 和 BeautifulSoup 库，以及模拟用户登录的库。下面是一个基本的爬取微博内容的示例代码（假设要爬取的是微博用户“Python”发布的微博）： ```python import requests from bs4 import BeautifulSoup # 模拟用户登录，获取 cookies login_url = 'https://passport.weibo.cn/signin/login' data = { 'username': 'your_username', 'password': 'your_password', 'savestate': '1', 'entry': 'mweibo', 'mainpageflag': '1' } session = requests.Session() session.post(login_url, data=data) cookies = session.cookies.get_dict() # 爬取微博内容 url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' } response = requests.get(url, headers=headers, cookies=cookies) json_data = response.json() cards = json_data['data']['cards'] for card in cards: mblog = card.get('mblog') if mblog: text = mblog.get('text') if text: soup = BeautifulSoup(text, 'html.parser') print(soup.get_text()) ``` 在代码中，首先模拟用户登录，获取 cookies，然后使用 requests 库发起 GET 请求，获取微博内容的 JSON 数据。最后使用 BeautifulSoup 库解析 HTML 代码，获取微博正文内容。需要注意的是，微博对于爬虫比较敏感，如果频繁爬取可能会被封禁账号或者 IP，因此需要注意爬取频率。此外，还需要遵守爬虫的相关法律法规，不得进行非法爬取行为。

阅读全文