用python帮我写一个爬取链家网站,数据清理和数据可视化的代码
时间: 2024-03-25 17:40:31 浏览: 68
好的,我将为你提供一个基本的爬虫代码,数据清理和数据可视化需要根据你的具体需求进行编写。
1. 爬取链家网站
```python
import requests
from bs4 import BeautifulSoup
url = 'https://bj.lianjia.com/zufang/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
house_list = soup.find_all('div', class_='content__list--item--main')
for house in house_list:
title = house.find('p', class_='content__list--item--title').text.strip()
price = house.find('span', class_='content__list--item-price').text.strip()
area = house.find('p', class_='content__list--item--des').text.strip().split('/')[1]
print(title, price, area)
```
2. 数据清理
```python
import pandas as pd
df = pd.DataFrame({'Title': [title1, title2, title3, ...],
'Price': [price1, price2, price3, ...],
'Area': [area1, area2, area3, ...]})
# 去除重复数据
df.drop_duplicates(inplace=True)
# 处理缺失值
df.dropna(inplace=True)
# 格式化数据
df['Price'] = df['Price'].str.extract('(\d+)', expand=False).astype(float)
df['Area'] = df['Area'].str.extract('(\d+)', expand=False).astype(float)
```
3. 数据可视化
```python
import matplotlib.pyplot as plt
# 绘制柱状图
price_counts = df['Price'].value_counts().sort_index()
plt.bar(price_counts.index, price_counts.values)
# 绘制散点图
plt.scatter(df['Area'], df['Price'])
# 绘制折线图
df.groupby('Price')['Area'].mean().plot()
```
以上是一个简单的代码示例,需要根据你的具体需求进行修改和完善。