获取豆瓣TOP250电影的中英文名、港台名、导演、上映年份、电影分类以及评分,将数据存入文档。
时间: 2023-10-20 09:30:05 浏览: 45
好的,这个问题需要使用爬虫技术和数据存储技术,我可以为你提供思路和代码。
首先,我们需要爬取豆瓣TOP250电影的页面,获取到每一部电影的详细信息。这里我们可以使用Python的requests和BeautifulSoup库来实现。具体代码如下:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.select('.item')
for movie in movies:
# 获取电影名称和评分
name = movie.select('.title')[0].get_text()
score = movie.select('.rating_num')[0].get_text()
# 获取电影详情页链接
detail_url = movie.select('.hd a')[0]['href']
detail_response = requests.get(detail_url, headers=headers)
detail_soup = BeautifulSoup(detail_response.text, 'html.parser')
# 获取导演、上映年份、电影分类等信息
infos = detail_soup.select('#info span')
director = infos[0].select('a')[0].get_text()
year = infos[1].get_text().strip().split('/')[0]
genres = infos[2].get_text().strip().split('/')
# 获取港台名
hk_name = detail_soup.select('.pl:contains("又名:")')[0].next_sibling.strip()
# 打印结果
print(name, score, director, year, genres, hk_name)
```
接下来,我们需要将获取到的数据存储到文档中。这里我们可以使用Python的csv模块来实现。具体代码如下:
```python
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.select('.item')
with open('top250.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f)
writer.writerow(['中文名', '英文名', '港台名', '导演', '上映年份', '电影分类', '评分'])
for movie in movies:
# 获取电影名称和评分
name = movie.select('.title')[0].get_text()
score = movie.select('.rating_num')[0].get_text()
# 获取电影详情页链接
detail_url = movie.select('.hd a')[0]['href']
detail_response = requests.get(detail_url, headers=headers)
detail_soup = BeautifulSoup(detail_response.text, 'html.parser')
# 获取导演、上映年份、电影分类等信息
infos = detail_soup.select('#info span')
director = infos[0].select('a')[0].get_text()
year = infos[1].get_text().strip().split('/')[0]
genres = infos[2].get_text().strip().split('/')
# 获取英文名
en_name = detail_soup.select('h1 span')[0].get_text()
# 获取港台名
hk_name = detail_soup.select('.pl:contains("又名:")')[0].next_sibling.strip()
# 写入CSV文件
writer.writerow([name, en_name, hk_name, director, year, genres, score])
```
以上就是获取豆瓣TOP250电影信息并存储到文档中的完整代码。希望可以帮助到你!