Python爬取国家统计局
时间: 2023-12-25 21:30:15 浏览: 78
以下是使用Python爬取国家统计局数据的步骤:
1. 导入所需的库:
```python
import requests
from bs4 import BeautifulSoup
```
2. 发送HTTP请求获取网页内容:
```python
url = "http://data.stats.gov.cn/easyquery.htm?cn=A01"
response = requests.get(url)
```
3. 使用BeautifulSoup解析网页内容:
```python
soup = BeautifulSoup(response.text, 'html.parser')
```
4. 查找所需数据的位置并提取:
```python
data = soup.find('div', class_='center_list_contlist').text
```
5. 对数据进行处理和清洗:
```python
data = data.strip().replace('\n', '')
```
6. 打印提取到的数据:
```python
print(data)
```
相关问题
Python爬取国家统计局的省市区
可以使用Python中的requests和BeautifulSoup库来实现爬取国家统计局的省市区信息。以下是示例代码:
```python
import requests
from bs4 import BeautifulSoup
url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html'
response = requests.get(url)
response.encoding = 'gbk'
soup = BeautifulSoup(response.text, 'html.parser')
province_list = soup.select('.provincetr a')
for province in province_list:
province_name = province.text.strip()
province_code = province['href'][:2]
print(province_code, province_name)
city_url = f'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/{province_code}.html'
city_response = requests.get(city_url)
city_response.encoding = 'gbk'
city_soup = BeautifulSoup(city_response.text, 'html.parser')
city_list = city_soup.select('.citytr a')
for city in city_list:
city_name = city.text.strip()
city_code = city['href'][3:5]
print('\t', city_code, city_name)
county_url = f'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/{province_code}/{city_code}.html'
county_response = requests.get(county_url)
county_response.encoding = 'gbk'
county_soup = BeautifulSoup(county_response.text, 'html.parser')
county_list = county_soup.select('.countytr a') if len(county_soup.select('.countytr a')) > 0 else county_soup.select('.towntr a')
for county in county_list:
county_name = county.text.strip()
county_code = county['href'][5:9]
print('\t\t', county_code, county_name)
```
以上代码可以输出各个省市区的名称和代码。注意需要指定网页编码为`gbk`。
Python爬取国家统计局的省市区详细代码
以下是Python爬取国家统计局省市区详细数据的代码:
```python
import requests
import json
# 请求头信息
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'http://www.stats.gov.cn',
'Referer': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
# 省份列表请求URL
province_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html'
# 获取省份列表
def get_province():
response = requests.get(province_url, headers=headers)
response.encoding = 'gbk'
province_list = response.text.split('<td><a href="')[1:]
for province in province_list:
province_code, province_name = province.split('.html">')[0], province.split('.html">')[1].split('</a></td>')[0]
print(province_code, province_name)
# 城市、区、街道请求URL
city_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/{}/{}/{}.html'
# 获取城市、区、街道数据
def get_data(url):
response = requests.get(url, headers=headers)
response.encoding = 'gbk'
data_list = response.text.split('<tr class="')[1:]
for data in data_list:
if 'countytr' in data:
code, name = data.split('<td>')[0].split('href="')[1].split('.html')[0][-6:], data.split('<td>')[1].split('</td>')[0]
print(code, name)
elif 'towntr' in data:
code, name = data.split('<td>')[0].split('href="')[1].split('.html')[0][-9:], data.split('<td>')[1].split('</td>')[0]
print(code, name)
elif 'villagetr' in data:
code, name = data.split('<td>')[0].split('</a>')[0][-12:], data.split('<td>')[2].split('</td>')[0]
print(code, name)
# 爬取数据
def spider():
# 获取省份列表
response = requests.get(province_url, headers=headers)
response.encoding = 'gbk'
province_list = response.text.split('<td><a href="')[1:]
for province in province_list:
province_code = province.split('.html">')[0]
province_name = province.split('.html">')[1].split('</a></td>')[0]
print(province_code, province_name)
# 获取城市数据
city_code = province_code[:2]
city_url_now = city_url.format(city_code, province_code, city_code+province_code)
get_data(city_url_now)
# 获取区、街道数据
if city_code in {'11', '12', '31', '50'}:
area_code = province_code[:6]
area_url_now = city_url.format(city_code+province_code[:2], area_code, area_code)
get_data(area_url_now)
else:
city_list_url = city_url.format(city_code+province_code[:2], city_code+province_code, city_code+province_code)
response = requests.get(city_list_url, headers=headers)
response.encoding = 'gbk'
city_list = response.text.split('<td><a href="')[1:]
for city in city_list:
city_code_now = city.split('.html">')[0][-4:]
city_url_now = city_url.format(city_code+province_code[:2], city_code_now, city_code+city_code_now)
get_data(city_url_now)
if __name__ == '__main__':
spider()
```
代码中首先定义了请求头信息,然后定义了省份列表的请求URL,通过get_province()方法爬取省份列表。
之后定义了城市、区、街道的请求URL格式,再通过get_data()方法爬取数据。
最后在spider()方法中,先爬取省份列表,再根据省份代码获取城市数据,接着判断是否需要获取区、街道数据,并获取相应数据。
相关推荐
![py](https://img-home.csdnimg.cn/images/20210720083646.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)