scrapy中爬取的数据存储到elasticsearch中
时间: 2023-11-22 19:54:35 浏览: 151
将Scrapy中爬取的数据存储到Elasticsearch中,可以使用Scrapy-Redis和Elasticsearch-py插件。
第一步,安装Scrapy-Redis插件和Elasticsearch-py插件:
```
pip install scrapy-redis
pip install elasticsearch
```
第二步,在Scrapy项目的settings.py文件中添加如下配置:
```
# 使用Scrapy-Redis调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 使用Redis的去重组件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 允许暂停,redis请求记录不丢失
SCHEDULER_PERSIST = True
# Redis连接信息
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# Elasticsearch连接信息
ELASTICSEARCH_SERVERS = ['http://localhost:9200/']
ELASTICSEARCH_INDEX = 'my_index'
ELASTICSEARCH_TYPE = 'my_type'
```
第三步,在Scrapy项目的pipelines.py文件中添加如下代码:
```
from elasticsearch import Elasticsearch
from scrapy.utils.project import get_project_settings
from scrapy_redis.pipelines import RedisPipeline
class ElasticsearchPipeline(object):
def __init__(self):
settings = get_project_settings()
self.es = Elasticsearch(settings['ELASTICSEARCH_SERVERS'])
self.index = settings['ELASTICSEARCH_INDEX']
self.type = settings['ELASTICSEARCH_TYPE']
def process_item(self, item, spider):
self.es.index(index=self.index, doc_type=self.type, body=dict(item))
return item
class MyPipeline(RedisPipeline):
def __init__(self, settings):
RedisPipeline.__init__(self)
self.elasticsearch = ElasticsearchPipeline()
def process_item(self, item, spider):
self.elasticsearch.process_item(item, spider)
return RedisPipeline.process_item(self, item, spider)
```
最后,在Scrapy项目的spiders中添加如下代码:
```
from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
name = 'my_spider'
redis_key = 'my_spider:start_urls'
def parse(self, response):
# 解析页面
pass
```
这样,Scrapy就会将爬取的数据存储到Elasticsearch中了。
阅读全文