爬虫爬取豆瓣年度电影
时间: 2023-12-17 14:27:56 浏览: 33
以下是使用Python爬虫爬取豆瓣年度电影的步骤和代码:
1.导入requests和lxml库
```python
import requests
from lxml import etree
```
2.设置请求头,伪装成浏览器
```python
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
```
3.发送请求,获取网页源代码
```python
url = 'https://movie.douban.com/annual/2020'
response = requests.get(url, headers=headers)
html = response.text
```
4.使用xpath解析网页源代码,获取电影信息
```python
selector = etree.HTML(html)
movies = selector.xpath('//div[@class="list"]/div[@class="item"]')
for movie in movies:
name = movie.xpath('.//div[@class="title"]/a/text()')[0]
director = movie.xpath('.//div[@class="bd"]/p[1]/text()')[0]
actors = movie.xpath('.//div[@class="bd"]/p[2]/text()')[0]
score = movie.xpath('.//div[@class="bd"]/p[3]/span[@class="rating_nums"]/text()')[0]
print(name, director, actors, score)
```
5.完整代码如下:
```python
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://movie.douban.com/annual/2020'
response = requests.get(url, headers=headers)
html = response.text
selector = etree.HTML(html)
movies = selector.xpath('//div[@class="list"]/div[@class="item"]')
for movie in movies:
name = movie.xpath('.//div[@class="title"]/a/text()')[0]
director = movie.xpath('.//div[@class="bd"]/p[1]/text()')[0]
actors = movie.xpath('.//div[@class="bd"]/p[2]/text()')[0]
score = movie.xpath('.//div[@class="bd"]/p[3]/span[@class="rating_nums"]/text()')[0]
print(name, director, actors, score)
```