Python招聘网站数据分析源码
时间: 2023-11-05 14:54:02 浏览: 86
Python招聘数据分析系统源码
以下是一个使用 Python 进行爬虫和数据分析的招聘网站源码示例:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 定义爬虫函数
def get_job_list(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jobs = soup.find_all('div', {'class': 'job-primary'})
job_list = []
for job in jobs:
try:
title = job.find('div', {'class': 'job-title'}).text.strip()
except:
title = ''
try:
salary = job.find('span', {'class': 'red'}).text.strip()
except:
salary = ''
try:
company = job.find('div', {'class': 'company-text'}).a.text.strip()
except:
company = ''
try:
location = job.find('div', {'class': 'job-address'}).text.strip()
except:
location = ''
job_list.append([title, salary, company, location])
return job_list
# 爬取数据
url = 'https://www.zhipin.com/c100010000/?query=Python&page={}'
job_list = []
for i in range(1, 11):
url_page = url.format(i)
job_list += get_job_list(url_page)
# 转换数据为 DataFrame 格式
df = pd.DataFrame(job_list, columns=['Title', 'Salary', 'Company', 'Location'])
# 数据清洗和处理
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df['Salary_min'] = df.Salary.apply(lambda x: x.split('-')[0] if '-' in x else x)
df['Salary_max'] = df.Salary.apply(lambda x: x.split('-')[1] if '-' in x else x)
df['Salary_min'] = df.Salary_min.apply(lambda x: x.replace('K', '000').replace('以上', ''))
df['Salary_max'] = df.Salary_max.apply(lambda x: x.replace('K', '000').replace('以上', ''))
df['Salary_min'] = pd.to_numeric(df.Salary_min)
df['Salary_max'] = pd.to_numeric(df.Salary_max)
df['Salary_avg'] = (df.Salary_min + df.Salary_max) / 2
df['Location'] = df.Location.apply(lambda x: x.split('·')[0])
# 数据分析
location_group = df.groupby('Location')['Title'].count().reset_index().sort_values(by='Title', ascending=False)
salary_group = df.groupby('Location')['Salary_avg'].mean().reset_index().sort_values(by='Salary_avg', ascending=False)
# 结果输出
print('地区招聘数量排名:\n', location_group.head(10))
print('\n地区薪资水平排名:\n', salary_group.head(10))
```
这个示例使用 `requests` 和 `BeautifulSoup` 库进行网页爬取和解析,然后使用 `pandas` 库将数据转换为 DataFrame 格式进行清洗和处理,最后使用 DataFrame 的 groupby() 函数进行数据分析。输出结果为地区招聘数量排名和地区薪资水平排名。
阅读全文