使用urllib、bs4、re、pymysql来爬取豆瓣250的列表页面信息的相关代码
时间: 2024-06-10 12:09:08 浏览: 104
【Python资源】使用urllib库和re模块,爬取豆瓣电影新片榜的源码
以下是使用urllib、bs4、re、pymysql来爬取豆瓣250的列表页面信息的代码示例:
```python
import urllib.request
from bs4 import BeautifulSoup
import pymysql
# 连接数据库
db = pymysql.connect(host='localhost', user='root', password='password', db='test', charset='utf8')
# 获取页面HTML代码
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
req = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
# 使用BeautifulSoup解析HTML代码
soup = BeautifulSoup(html, 'html.parser')
movie_list = soup.find('ol', class_='grid_view').find_all('li')
# 遍历电影列表,获取电影信息并保存到数据库
for movie in movie_list:
# 获取电影排名、电影名称、电影评分、电影链接
rank = movie.find('em').get_text()
title = movie.find('span', class_='title').get_text()
rating = movie.find('span', class_='rating_num').get_text()
link = movie.find('a')['href']
# 使用正则表达式获取电影年份、电影类型等信息
info = movie.find('p', class_='').get_text()
year = re.findall('\d{4}', info)[0]
types = re.findall('\s/\s(.*)', info)[0]
# 插入数据到数据库
cursor = db.cursor()
sql = "INSERT INTO movies(rank, title, rating, link, year, types) VALUES('%s', '%s', '%s', '%s', '%s', '%s')" % (rank, title, rating, link, year, types)
cursor.execute(sql)
db.commit()
# 关闭数据库连接
db.close()
```
希望能对您有所帮助!
阅读全文