将Scrapy中爬取的数据存储到Elasticsearch中 使用elasticsearch-dsl
时间: 2023-07-16 10:16:37 浏览: 168
使用elasticsearch-dsl可以很方便地将Scrapy爬取的数据存储到Elasticsearch中。
首先,需要在Scrapy项目中安装elasticsearch-dsl:
```
pip install elasticsearch-dsl
```
然后,在Scrapy项目的settings.py文件中添加以下代码:
```
ELASTICSEARCH_HOST = 'localhost'
ELASTICSEARCH_PORT = 9200
ELASTICSEARCH_USERNAME = ''
ELASTICSEARCH_PASSWORD = ''
ELASTICSEARCH_INDEX = 'my_index'
ELASTICSEARCH_TYPE = 'my_type'
```
这里需要设置Elasticsearch的主机名、端口号、用户名、密码、索引名称和类型名称。
接下来,在Scrapy项目中的pipelines.py文件中编写以下代码:
```
from elasticsearch_dsl.connections import connections
from elasticsearch_dsl import DocType, Text, Date, Integer
from scrapy.utils.project import get_project_settings
class MyItem(DocType):
title = Text()
content = Text()
publish_date = Date()
view_count = Integer()
class Meta:
index = get_project_settings().get('ELASTICSEARCH_INDEX')
doc_type = get_project_settings().get('ELASTICSEARCH_TYPE')
class ElasticsearchPipeline(object):
def __init__(self):
settings = get_project_settings()
self.es = connections.create_connection(
hosts=[{'host': settings.get('ELASTICSEARCH_HOST'), 'port': settings.get('ELASTICSEARCH_PORT')}],
http_auth=(settings.get('ELASTICSEARCH_USERNAME'), settings.get('ELASTICSEARCH_PASSWORD'))
)
def process_item(self, item, spider):
my_item = MyItem(title=item['title'], content=item['content'], publish_date=item['publish_date'], view_count=item['view_count'])
my_item.save(using=self.es)
return item
```
这里定义了一个MyItem类,包含了需要存储到Elasticsearch中的字段。ElasticsearchPipeline类则是对数据进行处理和存储的类,其中在初始化方法中连接Elasticsearch,将数据保存到Elasticsearch中的process_item方法中则是通过创建MyItem对象并调用save方法来完成的。
最后,在Scrapy项目中的settings.py文件中添加以下代码启用ElasticsearchPipeline:
```
ITEM_PIPELINES = {
'my_project.pipelines.ElasticsearchPipeline': 300,
}
```
这样,爬取到的数据就会自动存储到Elasticsearch中了。
阅读全文