爬取豆瓣电影数据并保存至本地csv文件并将数据导入MySQL数据库
时间: 2023-05-30 12:05:44 浏览: 547
Python爬取豆瓣top250电影数据,并导入MySQL,写入excel
5星 · 资源好评率100%
抱歉,作为AI语言模型,我无法执行爬虫操作和访问本地文件系统。但您可以参考以下步骤来完成该任务:
1. 安装Python的爬虫框架Scrapy和MySQL数据库连接库pymysql:
```
pip install scrapy
pip install pymysql
```
2. 创建一个Scrapy项目:
```
scrapy startproject douban_movie
```
3. 编写爬虫代码。在douban_movie/spiders目录下创建一个名为douban_spider.py的文件,实现对豆瓣电影top250页面的爬取,并将结果保存至本地csv文件:
```python
import scrapy
import csv
class DoubanSpider(scrapy.Spider):
name = 'douban'
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
for movie in response.css('.item'):
yield {
'title': movie.css('.title::text').get(),
'rating': movie.css('.rating_num::text').get(),
'comment': movie.css('.quote span::text').get()
}
next_page = response.css('.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
# 保存至本地csv文件
with open('douban_movie.csv', mode='a', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
for movie in response.css('.item'):
writer.writerow([
movie.css('.title::text').get(),
movie.css('.rating_num::text').get(),
movie.css('.quote span::text').get()
])
```
4. 运行爬虫并将结果导入MySQL数据库。在douban_movie目录下创建一个名为mysql_pipeline.py的文件,实现将csv文件中的数据导入MySQL数据库:
```python
import csv
import pymysql
class MysqlPipeline:
def __init__(self):
self.conn = pymysql.connect(
host='localhost',
port=3306,
user='root',
password='password',
db='douban_movie',
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
self.cursor.execute(
"INSERT INTO movie(title, rating, comment) VALUES (%s, %s, %s)",
(item['title'], item['rating'], item['comment'])
)
self.conn.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.conn.close()
if __name__ == '__main__':
with open('douban_movie.csv', mode='r', encoding='utf-8') as file:
reader = csv.reader(file)
next(reader) # 跳过表头
for row in reader:
pipeline = MysqlPipeline()
pipeline.process_item({
'title': row[0],
'rating': row[1],
'comment': row[2]
}, None)
```
5. 运行爬虫并导入数据:
```
scrapy crawl douban
python mysql_pipeline.py
```
注意:在运行mysql_pipeline.py文件之前,需要先创建MySQL数据库和movie表。可以使用以下SQL语句:
```
CREATE DATABASE douban_movie CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
USE douban_movie;
CREATE TABLE movie (
id INT(11) NOT NULL AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
rating FLOAT NOT NULL,
comment VARCHAR(255),
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
```
阅读全文