python爬虫爬取豆瓣电影top250 1、排名 No.1 3、看过人数\d+ 、想看人数 代码+csv文件
时间: 2024-10-20 10:06:00 浏览: 88
基于Python的豆瓣电影TOP250爬虫数据分析设计源码
在Python中,使用Scrapy框架是一个常见的选择来进行网页爬取,特别是豆瓣电影数据。这里我会给你一个基本的步骤和代码示例,不过请注意,由于豆瓣网站有反爬机制,实际操作前需要检查其API政策以及可能存在的访问限制。
首先,安装必要的库(如果尚未安装):
```bash
pip install scrapy requests beautifulsoup4 pandas
```
然后创建一个新的Scrapy项目:
```bash
scrapy startproject douban_movie_spider
cd douban_movie_spider
```
接下来,编写`items.py`来定义CSV导出的数据结构:
```python
# items.py
import scrapy
class DoubanMovieItem(scrapy.Item):
rank = scrapy.Field()
title = scrapy.Field()
viewed_count = scrapy.Field()
want_to_watch_count = scrapy.Field()
```
在`spiders/movie_spider.py`中,实现爬虫逻辑:
```python
# spiders/movie_spider.py
import scrapy
from bs4 import BeautifulSoup
from ..items import DoubanMovieItem
class MovieSpider(scrapy.Spider):
name = 'douban_top250'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
# 找到电影列表部分
movie_list = soup.find('ol', class_='grid_view')
for item in movie_list.find_all('li'):
rank = item.find('span', class_='rating_num').text
title = item.h3.a['title']
viewed = item.find('span', class_='pl').text.split()[0]
want_to_watch = item.find('span', class_='nu').text
yield DoubanMovieItem(
rank=rank,
title=title,
viewed_count=viewed,
want_to_watch_count=want_to_watch
)
# 如果还有下一页,继续抓取
next_page = response.css('.next')
if next_page:
yield response.follow(next_page, self.parse)
```
最后,在`settings.py`中配置Scrapy生成CSV文件:
```python
# settings.py
FEED_FORMAT = 'csv'
FEED_URI = 'output/douban_movies.csv'
FEED_EXPORT_ENCODING = 'utf-8'
FEED_STRICT = True
```
运行爬虫:
```bash
scrapy crawl douban_top250
```
这将开始爬取并把结果保存到名为`output/douban_movies.csv`的CSV文件中。
阅读全文