首页用requests库和正则表达式爬取豆瓣电影top250数据完整代码

用requests库和正则表达式爬取豆瓣电影top250数据完整代码

时间: 2024-11-15 22:23:14 浏览: 110

在Python中，我们可以使用`requests`库来发送HTTP请求获取网页内容，然后通过`re`（正则表达式）模块来解析和提取所需的数据。下面是一个简单的示例，展示如何爬取豆瓣电影Top250的电影标题和链接： ```python import requests from bs4 import BeautifulSoup import re # 请求URL url = "https://movie.douban.com/top250" # 发送GET请求 response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: # 解析HTML内容 soup = BeautifulSoup(response.text, 'lxml') # 找到电影列表部分 movie_list = soup.select('.item') # 提取电影信息（这里假设每个电影信息包含标题和链接） movies_data = [] for movie in movie_list: title_element = movie.find('span', {'class': 'title'}) link_element = movie.find('a', href=True) title = title_element.text.strip() if title_element else '' link = link_element['href'] if link_element else '' movies_data.append({ 'title': title, 'link': link }) # 输出或保存数据 for movie in movies_data: print(f"电影标题: {movie['title']}") print(f"链接: {movie['link']}\n") else: print("请求失败，状态码:", response.status_code)

阅读全文