python爬取豆瓣电影数据集
时间: 2023-10-02 11:12:57 浏览: 153
爬取豆瓣电影数据集需要使用 Python 爬虫技术,具体步骤如下:
1. 安装 requests 库和 BeautifulSoup4 库
```python
pip install requests
pip install beautifulsoup4
```
2. 发送 HTTP 请求获取页面内容
```python
import requests
url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text
```
3. 使用 BeautifulSoup 解析页面内容
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
```
4. 查找需要的数据并提取
```python
movies = []
for item in soup.select('.item'):
movie = {}
movie['title'] = item.select('.title')[0].text.strip()
movie['score'] = item.select('.rating_num')[0].text.strip()
movie['quote'] = item.select('.quote')[0].text.strip()
movies.append(movie)
```
5. 将数据保存到文件中
```python
import json
with open('movies.json', 'w', encoding='utf-8') as f:
json.dump(movies, f, ensure_ascii=False, indent=4)
```
完整代码如下:
```python
import requests
from bs4 import BeautifulSoup
import json
url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
movies = []
for item in soup.select('.item'):
movie = {}
movie['title'] = item.select('.title')[0].text.strip()
movie['score'] = item.select('.rating_num')[0].text.strip()
movie['quote'] = item.select('.quote')[0].text.strip()
movies.append(movie)
with open('movies.json', 'w', encoding='utf-8') as f:
json.dump(movies, f, ensure_ascii=False, indent=4)
```
执行完毕后,会在当前目录下生成一个名为 `movies.json` 的文件,里面包含了豆瓣电影 Top250 的电影名称、评分和简介。
阅读全文