用Python编写爬虫程序
时间: 2023-12-09 17:26:31 浏览: 29
好的,下面是一个简单的Python爬虫程序示例(爬取豆瓣电影Top250):
```python
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
def get_html(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
response.encoding = response.apparent_encoding
return response.text
except Exception as e:
print(e)
return None
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
movies = []
for li in soup.find_all('li'):
if li.find('div', class_='hd'):
movie_name = li.find('span', class_='title').get_text()
movie_rank = li.find('em').get_text()
movie_info = li.find('div', class_='bd').find('p').get_text().strip()
movie_rating = li.find('span', class_='rating_num').get_text()
movie_comment = li.find('div', class_='star').find_all('span')[-1].get_text()
movie = {'rank': movie_rank, 'name': movie_name, 'info': movie_info, 'rating': movie_rating, 'comment': movie_comment}
movies.append(movie)
return movies
def main():
movies = []
for i in range(10):
url = 'https://movie.douban.com/top250?start=' + str(i * 25)
html = get_html(url)
movies += parse_html(html)
for movie in movies:
print(movie)
if __name__ == '__main__':
main()
```
这个程序使用requests库发送HTTP请求,获取豆瓣电影Top250的页面内容。然后使用BeautifulSoup库解析HTML内容,获取电影名称、排名、简介、评分和评论等信息。最后将获取到的电影信息输出到控制台。