利用scrapy爬取豆瓣top250的电影的代码
时间: 2023-09-05 13:11:44 浏览: 116
豆瓣电影Top250,Scrapy爬取
好的,以下是利用Scrapy爬取豆瓣Top250电影的代码:
首先,在命令行中创建一个新的Scrapy项目:
```
scrapy startproject douban_movies
```
接着,在项目根目录下创建一个名为“movies”的Spider:
```
scrapy genspider movies movie.douban.com/top250
```
然后,打开项目根目录下的“items.py”文件,定义我们需要爬取的数据字段:
```python
import scrapy
class DoubanMovieItem(scrapy.Item):
# 电影名称
title = scrapy.Field()
# 电影评分
rating = scrapy.Field()
# 电影导演
director = scrapy.Field()
# 电影演员
actors = scrapy.Field()
# 电影类型
genre = scrapy.Field()
# 电影上映时间
release_date = scrapy.Field()
```
接下来,在“movies.py”文件中编写爬取逻辑:
```python
import scrapy
from douban_movies.items import DoubanMovieItem
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
for movie in response.css('.item'):
item = DoubanMovieItem()
item['title'] = movie.css('.title::text').get()
item['rating'] = movie.css('.rating_num::text').get()
item['director'] = movie.css('.bd p:nth-child(1)::text').get().split('导演: ')[-1].split('主演: ')[0].strip()
item['actors'] = movie.css('.bd p:nth-child(1)::text').get().split('主演: ')[-1].split('\n')[0].strip()
item['genre'] = movie.css('.bd p:nth-child(1)::text').get().split('\n')[-2].strip()
item['release_date'] = movie.css('.bd p:nth-child(1)::text').get().split('\n')[-1].strip()
yield item
next_page = response.css('.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
```
最后,在命令行中运行以下命令启动爬虫:
```
scrapy crawl movies -o movies.csv
```
这会将爬取到的数据保存为一个名为“movies.csv”的CSV文件。
阅读全文