scrapy-redis分布式爬虫怎么把爬取的数据写进mysql数据库里
时间: 2023-12-31 21:52:26 浏览: 210
您好,关于将Scrapy-Redis分布式爬虫爬取的数据写入MySQL数据库,您可以按照以下步骤操作:
1. 在pip中安装PyMySQL库:pip install pymysql
2. 在Scrapy项目的settings.py文件中添加数据库配置信息:
```
MYSQL_HOST = 'localhost' # 你的mysql地址
MYSQL_DBNAME = 'test' # 数据库名字,请修改
MYSQL_USER = 'root' # 数据库账号,请修改
MYSQL_PASSWORD = '123456' # 数据库密码,请修改
MYSQL_PORT = 3306 # 数据库端口号
```
3. 在items.py文件中定义需要存储的字段,例如:
```
import scrapy
class NewsItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
date = scrapy.Field()
```
4. 在pipelines.py文件中编写MySQL存储Pipeline:
```
import pymysql
class MySQLPipeline(object):
def __init__(self, mysql_host, mysql_port, mysql_user, mysql_password, mysql_dbname):
self.host = mysql_host
self.port = mysql_port
self.user = mysql_user
self.password = mysql_password
self.dbname = mysql_dbname
@classmethod
def from_crawler(cls, crawler):
return cls(
mysql_host=crawler.settings.get('MYSQL_HOST'),
mysql_port=crawler.settings.get('MYSQL_PORT'),
mysql_user=crawler.settings.get('MYSQL_USER'),
mysql_password=crawler.settings.get('MYSQL_PASSWORD'),
mysql_dbname=crawler.settings.get('MYSQL_DBNAME')
)
def open_spider(self, spider):
self.db = pymysql.connect(
host=self.host,
port=self.port,
user=self.user,
password=self.password,
db=self.dbname
)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
data = dict(item)
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
table = 'news' # 数据库表名,请修改
sql = f'INSERT INTO {table} ({keys}) VALUES ({values})'
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
```
4. 在settings.py文件中启用Pipeline:
```
ITEM_PIPELINES = {
'myproject.pipelines.MySQLPipeline': 300
}
```
这样,当Scrapy爬取到数据时,就会通过MySQLPipeline将数据写入MySQL数据库中。
阅读全文