Python爬取国家统计局的省市区详细代码
时间: 2023-11-05 19:52:08 浏览: 105
以下是Python爬取国家统计局省市区详细数据的代码:
```python
import requests
import json
# 请求头信息
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'http://www.stats.gov.cn',
'Referer': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
# 省份列表请求URL
province_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html'
# 获取省份列表
def get_province():
response = requests.get(province_url, headers=headers)
response.encoding = 'gbk'
province_list = response.text.split('<td><a href="')[1:]
for province in province_list:
province_code, province_name = province.split('.html">')[0], province.split('.html">')[1].split('</a></td>')[0]
print(province_code, province_name)
# 城市、区、街道请求URL
city_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/{}/{}/{}.html'
# 获取城市、区、街道数据
def get_data(url):
response = requests.get(url, headers=headers)
response.encoding = 'gbk'
data_list = response.text.split('<tr class="')[1:]
for data in data_list:
if 'countytr' in data:
code, name = data.split('<td>')[0].split('href="')[1].split('.html')[0][-6:], data.split('<td>')[1].split('</td>')[0]
print(code, name)
elif 'towntr' in data:
code, name = data.split('<td>')[0].split('href="')[1].split('.html')[0][-9:], data.split('<td>')[1].split('</td>')[0]
print(code, name)
elif 'villagetr' in data:
code, name = data.split('<td>')[0].split('</a>')[0][-12:], data.split('<td>')[2].split('</td>')[0]
print(code, name)
# 爬取数据
def spider():
# 获取省份列表
response = requests.get(province_url, headers=headers)
response.encoding = 'gbk'
province_list = response.text.split('<td><a href="')[1:]
for province in province_list:
province_code = province.split('.html">')[0]
province_name = province.split('.html">')[1].split('</a></td>')[0]
print(province_code, province_name)
# 获取城市数据
city_code = province_code[:2]
city_url_now = city_url.format(city_code, province_code, city_code+province_code)
get_data(city_url_now)
# 获取区、街道数据
if city_code in {'11', '12', '31', '50'}:
area_code = province_code[:6]
area_url_now = city_url.format(city_code+province_code[:2], area_code, area_code)
get_data(area_url_now)
else:
city_list_url = city_url.format(city_code+province_code[:2], city_code+province_code, city_code+province_code)
response = requests.get(city_list_url, headers=headers)
response.encoding = 'gbk'
city_list = response.text.split('<td><a href="')[1:]
for city in city_list:
city_code_now = city.split('.html">')[0][-4:]
city_url_now = city_url.format(city_code+province_code[:2], city_code_now, city_code+city_code_now)
get_data(city_url_now)
if __name__ == '__main__':
spider()
```
代码中首先定义了请求头信息,然后定义了省份列表的请求URL,通过get_province()方法爬取省份列表。
之后定义了城市、区、街道的请求URL格式,再通过get_data()方法爬取数据。
最后在spider()方法中,先爬取省份列表,再根据省份代码获取城市数据,接着判断是否需要获取区、街道数据,并获取相应数据。
阅读全文