如何在Python中使用Scrapy配合MongoDB进行12306爬虫?
时间: 2024-12-23 16:12:25 浏览: 14
在Python中使用Scrapy配合MongoDB进行12306爬虫的基本步骤如下:
1. **安装依赖**:
- 安装Scrapy:`pip install scrapy`
- 安装Scrapy-MongoDB适配器:`pip install scrapy-mongodb`
2. **创建Scrapy项目**:
```bash
scrapy startproject my_12306_spider
```
3. **配置settings.py文件**:
- 设置默认下载器中间件,启用MongoDB支持:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_mongodb.MongoMiddleware': 700,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': None,
}
MONGO_URI = 'mongodb://localhost:27017/mydb'
MONGO_COLLECTION_NAME = 'train_data'
```
4. **定义Item和Pipeline**:
- 创建一个名为`TrainInfoItem.py`的文件,定义你要抓取的数据项:
```python
import scrapy
class TrainInfoItem(scrapy.Item):
# 定义字段及其类型
train_no = scrapy.Field()
departure_time = scrapy.Field()
destination = scrapy.Field()
...
```
- 创建`MongoDBPipeline.py`,负责将Item存储到MongoDB:
```python
from pymongo import MongoClient
from scrapy.exceptions import DropItem
class MongoDBPipeline:
def __init__(self, uri, collection_name):
self.client = MongoClient(uri)
self.collection = self.client[collection_name]
def process_item(self, item, spider):
self.collection.insert_one(dict(item))
return item
```
5. **编写Spider**:
- 在`spiders`目录下创建`my_12306_spider.py`,定义start_urls,解析函数,以及item pipeline引用:
```python
import scrapy
from your_project.items import TrainInfoItem
class My12306Spider(scrapy.Spider):
name = "my_12306"
start_urls = ["https://www.12306.cn/"]
def parse(self, response):
# 解析HTML,提取数据并构造TrainInfoItem实例
for data in response.css('...'): # 根据实际网页结构定位元素
yield TrainInfoItem(train_no=data.css('...').get(), ...)
```
6. **运行爬虫**:
```bash
scrapy crawl my_12306
```
记得替换上述示例中的CSS选择器和字段名称为实际的12306页面元素。同时,遵守网站规则,并尊重用户隐私,合理使用爬虫工具。
阅读全文