我想做一个大数据技术毕设:内容爬取豆瓣页面电影数据使用Edge浏览器,连接MySQL数据库具体代码
时间: 2024-11-25 10:15:40 浏览: 7
做这样一个大数据技术的毕设项目,你可以按照以下步骤编写代码:
首先,你需要安装必要的库,如Selenium用于浏览器自动化,requests或BeautifulSoup用于爬取HTML,以及pymysql处理MySQL数据库交互。
```python
# 导入库
import requests
from bs4 import BeautifulSoup
import pymysql
# 连接数据库
def connect_mysql():
db_config = {
'host': 'localhost',
'user': 'your_username',
'password': 'your_password',
'database': 'douban_movie'
}
try:
conn = pymysql.connect(**db_config)
return conn
except Exception as e:
print(f"Error connecting to MySQL: {e}")
return None
# 爬取页面
def crawl_douban(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'lxml')
# 根据HTML结构解析电影信息
movie_data = parse_html(soup) # 自定义函数来解析HTML
return movie_data
else:
print(f"Failed to fetch page, status code: {response.status_code}")
# 解析HTML
def parse_html(soup):
# 示例:查找class为'movie_item'的电影信息
movies = soup.find_all('div', class_='movie_item')
data = []
for movie in movies:
title = movie.find('h3').text
rating = movie.find('span', class_='rating_num').text
link = movie.find('a')['href']
data.append({'title': title, 'rating': rating, 'link': link})
return data
# 将数据存入数据库
def store_to_db(movie_data, conn):
if conn:
try:
with conn.cursor() as cursor:
for movie in movie_data:
insert_sql = """
INSERT INTO movie_table(title, rating, link)
VALUES (%s, %s, %s)
"""
cursor.execute(insert_sql, movie.values())
conn.commit()
print("Data stored successfully.")
except Exception as e:
print(f"Error storing data: {e}")
finally:
conn.close()
if __name__ == "__main__":
url = "https://movie.douban.com/top250"
db_conn = connect_mysql()
if db_conn:
movies = crawl_douban(url)
store_to_db(movies, db_conn)
```
在这个例子中,你需要根据豆瓣电影页面的实际HTML结构来定制`parse_html`函数。同时,记得替换`connect_mysql`函数中的数据库连接信息。运行这个脚本时,注意检查网络状况和页面结构的变化,因为网页可能会有所改动。
阅读全文