python爬虫爬取链家苏州二手房信息400条
时间: 2023-07-11 21:00:41 浏览: 182
python 二手房信息爬虫
好的,您可以按照以下步骤爬取链家苏州二手房信息:
1. 导入需要的库:requests、lxml、pandas、time
```python
import requests
from lxml import etree
import pandas as pd
import time
```
2. 构造请求头和URL链接
```python
url = "https://su.lianjia.com/ershoufang/pg{}/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
}
```
3. 定义一个函数,用于获取每一页的房源信息
```python
def get_info(url):
res = requests.get(url, headers=headers)
res.encoding = res.apparent_encoding
html = etree.HTML(res.text)
titles = html.xpath('//div[@class="title"]/a/text()')
areas = html.xpath('//div[@class="houseInfo"]/a[1]/text()')
layouts = html.xpath('//div[@class="houseInfo"]/a[2]/text()')
sizes = html.xpath('//div[@class="houseInfo"]/text()[3]')
years = html.xpath('//div[@class="houseInfo"]/text()[4]')
prices = html.xpath('//div[@class="totalPrice"]/span/text()')
infos = []
for i in range(len(titles)):
info = {}
info['title'] = titles[i]
info['area'] = areas[i]
info['layout'] = layouts[i]
info['size'] = sizes[i]
info['year'] = years[i]
info['price'] = prices[i]
infos.append(info)
return infos
```
4. 爬取多页信息,并将结果保存到CSV文件中
```python
infos = []
for i in range(1, 5):
url = url.format(i)
infos += get_info(url)
time.sleep(1)
df = pd.DataFrame(infos)
df.to_csv("lianjia.csv", index=False, encoding='utf-8-sig')
```
以上代码将爬取前4页的二手房信息,并将结果保存到lianjia.csv文件中。
注意:为了避免被反爬虫,建议添加适当的延时,如上述代码中的time.sleep(1)。同时,如果爬取过程中出现验证码等问题,可以考虑使用代理IP或者模拟登录等方式来解决。
阅读全文