scrapy+mongodb案例
时间: 2023-09-18 09:07:56 浏览: 116
scrapy+mongodb示例代码
下面是一个使用 Scrapy 和 MongoDB 的简单示例,它将爬取豆瓣电影 Top 250 的数据并存储到 MongoDB 数据库中:
1. 安装 Scrapy 和 pymongo:
```
pip install scrapy pymongo
```
2. 创建 Scrapy 项目:
```
scrapy startproject douban
```
3. 在 `settings.py` 文件中配置 MongoDB:
```
MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'douban'
MONGODB_COLLECTION = 'movies'
```
4. 创建一个名为 `items.py` 的文件,定义要爬取的数据字段:
```
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
rating = scrapy.Field()
director = scrapy.Field()
actors = scrapy.Field()
year = scrapy.Field()
country = scrapy.Field()
category = scrapy.Field()
```
5. 创建一个名为 `douban_spider.py` 的文件,定义爬虫:
```
import scrapy
from douban.items import DoubanItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class DoubanSpider(CrawlSpider):
name = 'douban'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
rules = (
Rule(LinkExtractor(allow=('subject/\d+/$')), callback='parse_item'),
Rule(LinkExtractor(allow=('top250\?start=\d+')), follow=True)
)
def parse_item(self, response):
item = DoubanItem()
item['title'] = response.css('h1 span::text').get()
item['rating'] = response.css('strong.rating_num::text').get()
item['director'] = response.css('a[rel="v:directedBy"]::text').get()
item['actors'] = response.css('a[rel="v:starring"]::text').getall()
item['year'] = response.css('span.year::text').get()
item['country'] = response.css('span[property="v:initialReleaseDate"]::text').re_first(r'(\S+)\s+\(\S+\)')
item['category'] = response.css('span[property="v:genre"]::text').getall()
yield item
```
6. 运行爬虫:
```
scrapy crawl douban
```
7. 在 MongoDB 中查看数据:
```
> use douban
> db.movies.find().pretty()
```
阅读全文