帮我写一段用python爬取电影数据代码,并进行重复值和缺失值分析
时间: 2023-06-08 13:07:13 浏览: 45
好的,以下是用Python爬取电影数据并进行重复值和缺失值分析的代码:
```
import pandas as pd
import requests
from bs4 import BeautifulSoup
# 爬取电影数据
def get_movie_data():
url = 'https://www.imdb.com/chart/top'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
movies = soup.select('td.titleColumn')
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name="ir"]')]
data = []
for i in range(len(movies)):
movie_string = movies[i].get_text().strip().replace("\n", "")
movie_title = (' '.join(movie_string.split())).replace(".", "")
movie_crew = crew[i]
movie_rating = ratings[i]
movie = {'Title': movie_title, 'Crew': movie_crew, 'Rating': movie_rating}
data.append(movie)
return data
# 创建数据框
data = pd.DataFrame(get_movie_data())
# 重复值分析
duplicates = data[data.duplicated()]
print('重复值数量:', len(duplicates))
# 缺失值分析
missing = data.isnull().sum()
print('缺失值数量:', missing)
# 输出数据框
print(data.head())
```
注意:此代码仅供参考,具体根据网站的结构和数据特征进行相应的修改和调整。