爬虫爬取豆瓣上所有电影
时间: 2023-10-17 15:58:51 浏览: 48
您好!关于您的问题,我需要明确一下您的需求。您是想让我为您编写一个爬虫程序,从豆瓣网站上爬取所有电影的信息吗?如果是的话,我需要提醒您,豆瓣网站是有反爬虫机制的,如果您没有豆瓣网站的授权或者许可,那么这样的爬取行为可能是违法的。因此,我建议您在进行任何爬取行为之前,先了解豆瓣网站的相关规定,并且获取相应的授权或许可。
相关问题
爬虫爬取豆瓣电影排行
以下是爬虫爬取豆瓣电影排行的步骤:
1. 导入requests和BeautifulSoup库
```python
import requests
from bs4 import BeautifulSoup
```
2. 发送请求并获取HTML页面
```python
url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text
```
3. 解析HTML页面并提取数据
```python
soup = BeautifulSoup(html, 'html.parser')
movies = soup.find_all('div', class_='hd')
for movie in movies:
title = movie.a.span.text.strip()
print(title)
```
上述代码中,我们使用BeautifulSoup库解析HTML页面,并使用find_all()方法查找所有class属性为'hd'的div标签,然后遍历每个标签,提取电影名称并打印输出。
爬虫爬取豆瓣年度电影
以下是使用Python爬虫爬取豆瓣年度电影的步骤和代码:
1.导入requests和lxml库
```python
import requests
from lxml import etree
```
2.设置请求头,伪装成浏览器
```python
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
```
3.发送请求,获取网页源代码
```python
url = 'https://movie.douban.com/annual/2020'
response = requests.get(url, headers=headers)
html = response.text
```
4.使用xpath解析网页源代码,获取电影信息
```python
selector = etree.HTML(html)
movies = selector.xpath('//div[@class="list"]/div[@class="item"]')
for movie in movies:
name = movie.xpath('.//div[@class="title"]/a/text()')[0]
director = movie.xpath('.//div[@class="bd"]/p[1]/text()')[0]
actors = movie.xpath('.//div[@class="bd"]/p[2]/text()')[0]
score = movie.xpath('.//div[@class="bd"]/p[3]/span[@class="rating_nums"]/text()')[0]
print(name, director, actors, score)
```
5.完整代码如下:
```python
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://movie.douban.com/annual/2020'
response = requests.get(url, headers=headers)
html = response.text
selector = etree.HTML(html)
movies = selector.xpath('//div[@class="list"]/div[@class="item"]')
for movie in movies:
name = movie.xpath('.//div[@class="title"]/a/text()')[0]
director = movie.xpath('.//div[@class="bd"]/p[1]/text()')[0]
actors = movie.xpath('.//div[@class="bd"]/p[2]/text()')[0]
score = movie.xpath('.//div[@class="bd"]/p[3]/span[@class="rating_nums"]/text()')[0]
print(name, director, actors, score)
```