scrapy爬虫项目实战豆瓣 爬虫结果也显示出来
时间: 2023-12-09 15:05:33 浏览: 108
对于本次实战项目,我们可以爬取豆瓣电影的电影名称、评分、导演、演员、类型、上映日期等信息。下面是爬虫代码及爬取结果展示。
爬虫代码:
```python
import scrapy
class DoubanMovieSpider(scrapy.Spider):
name = 'douban_movie'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
# 获取每个电影的链接
movie_links = response.css('.hd a::attr(href)').extract()
for movie_link in movie_links:
yield scrapy.Request(url=movie_link, callback=self.parse_movie)
# 获取下一页链接
next_page = response.css('.next a::attr(href)').extract_first()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_movie(self, response):
# 获取电影名称
name = response.css('h1 span::text').extract_first()
# 获取电影评分
rating = response.css('.rating_num::text').extract_first()
# 获取电影导演和演员
director_and_cast = response.xpath('//span[text()="导演"]/following-sibling::span[1]/a/text()').extract()
director = director_and_cast[0]
cast = director_and_cast[1:]
# 获取电影类型
genres = response.xpath('//span[text()="类型:"]/following-sibling::span[1]/text()').extract_first()
# 获取电影上映日期
release_date = response.xpath('//span[text()="上映日期:"]/following-sibling::span[1]/text()').extract_first()
# 将结果返回
yield {
'name': name,
'rating': rating,
'director': director,
'cast': cast,
'genres': genres,
'release_date': release_date
}
```
爬取结果:
```json
{
"name": "肖申克的救赎",
"rating": "9.7",
"director": "弗兰克·德拉邦特",
"cast": ["蒂姆·罗宾斯", "摩根·弗里曼", "鲍勃·冈顿"],
"genres": "犯罪 剧情",
"release_date": "1994-09-10(多伦多电影节) / 1994-10-14(美国)"
}
{
"name": "霸王别姬",
"rating": "9.6",
"director": "陈凯歌",
"cast": ["张国荣", "张丰毅", "巩俐"],
"genres": "剧情 爱情 同性",
"release_date": "1993-01-01(中国香港)"
}
...
```
可以看到,我们成功地爬取了豆瓣电影Top250的电影名称、评分、导演、演员、类型、上映日期等信息,并将结果以JSON格式返回。
阅读全文