写一个Python程序实现爬取豆瓣电影top250
时间: 2023-12-31 12:05:13 浏览: 81
好的,以下是Python代码实现爬取豆瓣电影top250的步骤:
1. 导入必要的库
```python
import requests
from bs4 import BeautifulSoup
```
2. 定义函数,获取页面信息
```python
def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.text
return html
```
3. 定义函数,解析页面信息
```python
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
movie_list = soup.find('ol', class_='grid_view').find_all('li')
result = []
for movie in movie_list:
name = movie.find('span', class_='title').text
rating = float(movie.find('span', class_='rating_num').text)
comments = int(movie.find('div', class_='star').find_all('span')[3].text[:-3])
result.append((name, rating, comments))
return result
```
4. 定义函数,存储数据到文件
```python
def save_data(data):
with open('top250.txt', 'w', encoding='utf-8') as f:
for item in data:
f.write(item[0] + '\t' + str(item[1]) + '\t' + str(item[2]) + '\n')
```
5. 主函数,爬取top250并存储数据到文件
```python
if __name__ == '__main__':
url = 'https://movie.douban.com/top250'
data = []
for i in range(0, 10):
html = get_html(url + '?start=' + str(i * 25))
result = parse_html(html)
data += result
save_data(data)
```
这样就实现了爬取豆瓣电影top250的功能,数据存储在top250.txt文件中。
阅读全文