写一个爬取拉勾大数据岗位信息的爬虫,要求采取一定反爬措施
时间: 2024-02-26 10:55:54 浏览: 26
好的,以下是一个可以避免被拉勾反爬虫机制拦截的示例代码:
```python
import requests
import json
import time
from fake_useragent import UserAgent
# 设置请求参数,可以自行更改
params = {
'city': '全国',
'needAddtionalResult': 'false',
'isSchoolJob': 0
}
# 设置请求头信息
headers = {
'User-Agent': UserAgent().random,
'Referer': 'https://www.lagou.com/jobs/list_%E5%A4%A7%E6%95%B0%E6%8D%AE?labelWords=&fromSearch=true&suginput=',
'Origin': 'https://www.lagou.com',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'X-Requested-With': 'XMLHttpRequest'
}
cookies = {
'user_trace_token': '20180920153734-45f1f8b8-bdca-11e8-bb9f-525400f775ce',
'LGUID': '20180920153734-45f1fbff-bdca-11e8-bb9f-525400f775ce',
'JSESSIONID': 'ABAAABAACDBABJB3FA1A09C4B08C5E2C8B8F7BDD6D3F11A',
'_gat': '1',
'PRE_UTM': '',
'PRE_HOST': '',
'PRE_SITE': 'https%3A%2F%2Fwww.lagou.com%2F',
'PRE_LAND': 'https%3A%2F%2Fwww.lagou.com%2F',
'_gid': 'GA1.2.1630791676.1537426655',
'index_location_city': '%E5%85%A8%E5%9B%BD',
'TG-TRACK-CODE': 'index_navigation',
'_ga': 'GA1.2.2005406544.1537426655',
'LGRID': '20180920164319-5d5c3a5d-bdd0-11e8-bb9f-525400f775ce'
}
# 发送请求并获取响应内容
def get_page(page):
data = {
'pn': str(page),
'kd': '大数据'
}
try:
response = requests.post('https://www.lagou.com/jobs/positionAjax.json', headers=headers, params=params, data=data, cookies=cookies, timeout=10)
if response.status_code == 200:
content = response.content.decode('utf-8')
result = json.loads(content)
if result['success']:
return result
except requests.exceptions.RequestException as e:
print('请求出错')
return None
# 打印岗位信息
def parse_result(result):
for job in result['content']['positionResult']['result']:
print('职位名称: ', job['positionName'])
print('公司名称: ', job['companyFullName'])
print('工作地点: ', job['city'])
print('薪资范围: ', job['salary'])
print('发布时间: ', job['createTime'])
print('------------------------------------')
# 主函数
def main():
for page in range(1, 10):
result = get_page(page)
if result:
parse_result(result)
time.sleep(5)
if __name__ == '__main__':
main()
```
在该代码中,采取了以下几个反爬措施:
1. User-Agent随机生成,避免使用相同的User-Agent;
2. Referer设置为请求的来源页面,避免Referer与请求页面不符;
3. Origin设置为请求的来源域名,避免Origin与请求域名不符;
4. 设置X-Anit-Forge-Code和X-Anit-Forge-Token,避免被防伪系统拦截;
5. 设置cookies,模拟登录状态;
6. 设置请求超时时间,防止请求长时间无响应;
7. 发送请求时添加延时,避免请求过于频繁。