Python爬取豆瓣电影top250,并把爬取的数据存到数据库中
时间: 2024-01-08 09:04:16 浏览: 142
好的,可以使用 Python 的 requests、BeautifulSoup 和 pymysql 库来实现这个需求。
首先,我们需要请求豆瓣电影top250页面,获取页面的 HTML 文本。代码如下:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text
```
接着,我们需要使用 BeautifulSoup 对 HTML 文本进行解析,提取出我们需要的电影信息。代码如下:
```python
soup = BeautifulSoup(html, 'html.parser')
movies = []
for movie in soup.select('.item'):
title = movie.select('.title')[0].text # 电影名称
score = movie.select('.rating_num')[0].text # 评分
info = movie.select('.bd p')[0].text.strip() # 电影信息
director_and_cast = info.split('\n')[0] # 导演和演员
year_and_region = info.split('\n')[-1] # 上映年份和地区
movies.append({
'title': title,
'score': score,
'director_and_cast': director_and_cast,
'year_and_region': year_and_region,
})
```
最后,我们需要把爬取到的电影信息存入数据库中。这里我们使用 pymysql 库来连接 MySQL 数据库,并将电影信息插入到名为 `movies` 的表中。代码如下:
```python
import pymysql
conn = pymysql.connect(
host='localhost',
user='root',
password='password',
database='test',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
with conn.cursor() as cursor:
for movie in movies:
sql = '''
INSERT INTO movies (title, score, director_and_cast, year_and_region)
VALUES (%s, %s, %s, %s)
'''
cursor.execute(sql, (
movie['title'],
movie['score'],
movie['director_and_cast'],
movie['year_and_region']
))
conn.commit()
```
完整代码如下:
```python
import requests
from bs4 import BeautifulSoup
import pymysql
url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
movies = []
for movie in soup.select('.item'):
title = movie.select('.title')[0].text # 电影名称
score = movie.select('.rating_num')[0].text # 评分
info = movie.select('.bd p')[0].text.strip() # 电影信息
director_and_cast = info.split('\n')[0] # 导演和演员
year_and_region = info.split('\n')[-1] # 上映年份和地区
movies.append({
'title': title,
'score': score,
'director_and_cast': director_and_cast,
'year_and_region': year_and_region,
})
conn = pymysql.connect(
host='localhost',
user='root',
password='password',
database='test',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
with conn.cursor() as cursor:
for movie in movies:
sql = '''
INSERT INTO movies (title, score, director_and_cast, year_and_region)
VALUES (%s, %s, %s, %s)
'''
cursor.execute(sql, (
movie['title'],
movie['score'],
movie['director_and_cast'],
movie['year_and_region']
))
conn.commit()
```
注意,这里的数据库连接信息需要根据实际情况进行修改。此外,还需要先在数据库中创建一个名为 `movies` 的表,用于存储电影信息。表结构如下:
```sql
CREATE TABLE movies (
id INT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(255),
score FLOAT,
director_and_cast VARCHAR(255),
year_and_region VARCHAR(255)
);
```
阅读全文