scrapy爬取图片存入mysql
时间: 2023-08-28 20:19:46 浏览: 193
要使用Scrapy爬取图片并将其存储到MySQL中,需要遵循以下步骤:
1. 在Scrapy项目中创建一个MySQL数据库连接。(可以使用PyMySQL库)
2. 创建一个Item类来存储图像链接和图像标题等信息。
3. 在spider中使用XPath或CSS选择器来提取图像链接和标题等信息,并通过Item将其传递给pipelines。
4. 在pipelines中,使用requests库下载图像并将其存储到本地文件系统中。
5. 然后使用Python的MySQL库将图像路径和其他相关信息插入到MySQL数据库中。
以下是一个简单的示例代码:
```python
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
import pymysql
from PIL import Image
class MySQLPipeline(object):
def __init__(self, db_host, db_port, db_user, db_password, db_name):
self.db_host = db_host
self.db_port = db_port
self.db_user = db_user
self.db_password = db_password
self.db_name = db_name
self.conn = None
self.cursor = None
@classmethod
def from_crawler(cls, crawler):
return cls(
db_host=crawler.settings.get('DB_HOST'),
db_port=crawler.settings.get('DB_PORT'),
db_user=crawler.settings.get('DB_USER'),
db_password=crawler.settings.get('DB_PASSWORD'),
db_name=crawler.settings.get('DB_NAME'),
)
def open_spider(self, spider):
self.conn = pymysql.connect(
host=self.db_host,
port=self.db_port,
user=self.db_user,
password=self.db_password,
db=self.db_name,
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
self.cursor = self.conn.cursor()
def close_spider(self, spider):
self.conn.close()
def process_item(self, item, spider):
try:
# 将图片下载到本地
image_path = item['image_urls'][0]
image_title = item['title']
image_extension = image_path.split('.')[-1]
image_name = f'{image_title}.{image_extension}'
image_path = image_path.replace('thumb180', 'large')
image_request = scrapy.Request(image_path)
image_response = scrapy.utils.python.get_val_from_func(
image_request,
'response',
spider=spider
)
image_content = image_response.body
image = Image.open(BytesIO(image_content))
image.save(f'{image_name}', quality=95)
# 将图片信息插入到数据库中
sql = "INSERT INTO images (title, path) VALUES (%s, %s)"
self.cursor.execute(sql, (image_title, image_name))
self.conn.commit()
except Exception as e:
print(e)
raise DropItem(f"Error processing item: {item['image_urls']}")
```
在Scrapy的settings.py文件中需要添加以下配置:
```python
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
'myproject.pipelines.MySQLPipeline': 2,
}
IMAGES_STORE = '/path/to/your/images/folder'
DB_HOST = 'localhost'
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWORD = 'password'
DB_NAME = 'database_name'
```
在spider中需要使用ImageItem来存储图像链接和标题等信息:
```python
from scrapy import Spider
from myproject.items import ImageItem
class MySpider(Spider):
name = 'myspider'
start_urls = ['http://example.com']
def parse(self, response):
image_link = response.css('img::attr(src)').extract_first()
image_title = response.css('img::attr(alt)').extract_first()
item = ImageItem()
item['image_urls'] = [image_link]
item['title'] = image_title
yield item
```
这样,当爬虫运行时,它将下载图像并将其信息插入到MySQL数据库中。
阅读全文