首页python爬虫爬取链家苏州二手房信息400条

python爬虫爬取链家苏州二手房信息400条

时间: 2023-07-11 10:00:41 浏览: 195

好的，您可以按照以下步骤爬取链家苏州二手房信息： 1. 导入需要的库：requests、lxml、pandas、time ```python import requests from lxml import etree import pandas as pd import time ``` 2. 构造请求头和URL链接 ```python url = "https://su.lianjia.com/ershoufang/pg{}/" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36" } ``` 3. 定义一个函数，用于获取每一页的房源信息 ```python def get_info(url): res = requests.get(url, headers=headers) res.encoding = res.apparent_encoding html = etree.HTML(res.text) titles = html.xpath('//div[@class="title"]/a/text()') areas = html.xpath('//div[@class="houseInfo"]/a[1]/text()') layouts = html.xpath('//div[@class="houseInfo"]/a[2]/text()') sizes = html.xpath('//div[@class="houseInfo"]/text()[3]') years = html.xpath('//div[@class="houseInfo"]/text()[4]') prices = html.xpath('//div[@class="totalPrice"]/span/text()') infos = [] for i in range(len(titles)): info = {} info['title'] = titles[i] info['area'] = areas[i] info['layout'] = layouts[i] info['size'] = sizes[i] info['year'] = years[i] info['price'] = prices[i] infos.append(info) return infos ``` 4. 爬取多页信息，并将结果保存到CSV文件中 ```python infos = [] for i in range(1, 5): url = url.format(i) infos += get_info(url) time.sleep(1) df = pd.DataFrame(infos) df.to_csv("lianjia.csv", index=False, encoding='utf-8-sig') ``` 以上代码将爬取前4页的二手房信息，并将结果保存到lianjia.csv文件中。注意：为了避免被反爬虫，建议添加适当的延时，如上述代码中的time.sleep(1)。同时，如果爬取过程中出现验证码等问题，可以考虑使用代理IP或者模拟登录等方式来解决。

阅读全文