首页python爬取豆瓣电影数据集

python爬取豆瓣电影数据集

时间: 2023-10-02 21:12:57 浏览: 166

爬取豆瓣电影数据集需要使用 Python 爬虫技术，具体步骤如下： 1. 安装 requests 库和 BeautifulSoup4 库 ```python pip install requests pip install beautifulsoup4 ``` 2. 发送 HTTP 请求获取页面内容 ```python import requests url = 'https://movie.douban.com/top250' response = requests.get(url) html = response.text ``` 3. 使用 BeautifulSoup 解析页面内容 ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') ``` 4. 查找需要的数据并提取 ```python movies = [] for item in soup.select('.item'): movie = {} movie['title'] = item.select('.title')[0].text.strip() movie['score'] = item.select('.rating_num')[0].text.strip() movie['quote'] = item.select('.quote')[0].text.strip() movies.append(movie) ``` 5. 将数据保存到文件中 ```python import json with open('movies.json', 'w', encoding='utf-8') as f: json.dump(movies, f, ensure_ascii=False, indent=4) ``` 完整代码如下： ```python import requests from bs4 import BeautifulSoup import json url = 'https://movie.douban.com/top250' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'html.parser') movies = [] for item in soup.select('.item'): movie = {} movie['title'] = item.select('.title')[0].text.strip() movie['score'] = item.select('.rating_num')[0].text.strip() movie['quote'] = item.select('.quote')[0].text.strip() movies.append(movie) with open('movies.json', 'w', encoding='utf-8') as f: json.dump(movies, f, ensure_ascii=False, indent=4) ``` 执行完毕后，会在当前目录下生成一个名为 `movies.json` 的文件，里面包含了豆瓣电影 Top250 的电影名称、评分和简介。

阅读全文