反爬虫如果需要考虑'Referer'的话,要怎么写代码
时间: 2024-03-27 16:41:56 浏览: 20
如果需要考虑'Referer'来防止爬虫,可以在请求头中添加'Referer'字段,告诉服务器请求来自哪个页面。具体的代码如下:
```python
import requests
url = "http://example.com"
headers = {
"Referer": "http://example.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
```
在上面的代码中,我们先定义了请求的url和headers,其中'Referer'字段设置为请求的页面地址,'User-Agent'字段设置为浏览器的User-Agent,以免被服务器识别为爬虫。然后使用requests库发起get请求,将headers作为参数传入即可。
相关问题
含反反爬的爬虫豆瓣代码
这里有一段含有反反爬机制的爬虫豆瓣代码的示例:
```python
import requests
from fake_useragent import UserAgent
import time
headers = {
'User-Agent': UserAgent().random,
'Referer': 'https://www.douban.com/',
'Host': 'www.douban.com',
}
session = requests.Session()
# 首先发送登录请求,获取登录后的cookie信息
login_url = 'https://accounts.douban.com/passport/login'
data = {
'username': 'your_username',
'password': 'your_password',
}
session.post(login_url, headers=headers, data=data)
# 利用cookie进行后续的请求
url = 'https://www.douban.com/movies'
response = session.get(url, headers=headers)
# 处理反爬机制,比如验证码
if 'captcha_image' in response.text:
# 这里可以使用光学文字识别(OCR)库进行验证码识别
captcha_image_url = 'https://www.douban.com/misc/captcha?id=xxx' # 验证码图片的URL
captcha_solution = 'your_solution' # 识别验证码得到的结果
# 将验证码信息添加到请求头中
headers['captcha-id'] = 'xxx' # 验证码ID,从响应中获取
headers['captcha-solution'] = captcha_solution
# 再次发送请求,带上验证码信息
response = session.get(url, headers=headers)
# 处理其他反爬机制,如限制访问频率等
time.sleep(3) # 暂停3秒,模拟人的操作行为
# 这里可以继续解析页面,提取需要的数据
# ...
```
写一个python爬虫爬取QQ音乐歌曲的代码
好的,以下是一个简单的Python爬虫爬取QQ音乐歌曲的代码:
```python
import requests
import json
def get_song_list(keyword, page):
"""
获取歌曲列表
:param keyword: 搜索关键词
:param page: 页码
:return: 歌曲列表
"""
url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
headers = {
'referer': 'https://y.qq.com/portal/search.html',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
params = {
'ct': '24',
'qqmusic_ver': '1298',
'new_json': '1',
'remoteplace': 'sizer.yqq.song_next',
'searchid': '64405487069162918',
't': '0',
'aggr': '1',
'cr': '1',
'catZhida': '1',
'lossless': '0',
'flag_qc': '0',
'p': page,
'n': '20',
'w': keyword,
'g_tk': '5381',
'loginUin': '0',
'hostUin': '0',
'format': 'json',
'inCharset': 'utf8',
'outCharset': 'utf-8',
'notice': '0',
'platform': 'yqq.json',
'needNewCode': '0'
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
song_list = []
data = json.loads(response.text)
for song in data['data']['song']['list']:
song_info = {
'song_name': song['name'],
'singer': song['singer'][0]['name'],
'album': song['album']['name'],
'interval': song['interval'],
'song_id': song['mid']
}
song_list.append(song_info)
return song_list
else:
return None
def get_song_url(song_id):
"""
获取歌曲播放链接
:param song_id: 歌曲ID
:return: 歌曲播放链接
"""
url = 'https://u.y.qq.com/cgi-bin/musicu.fcg'
headers = {
'referer': 'https://y.qq.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
params = {
'-': 'getplaysongvkey' + str(int(round(time.time() * 1000))),
'g_tk': '5381',
'loginUin': '0',
'hostUin': '0',
'format': 'json',
'inCharset': 'utf8',
'outCharset': 'utf-8',
'notice': '0',
'platform': 'yqq.json',
'needNewCode': '0',
'data': '{"req":{"module":"CDN.SrfCdnDispatchServer","method":"GetCdnDispatch","param":{"guid":"1535846080","calltype":0,"userip":""}},"req_0":{"module":"vkey.GetVkeyServer","method":"CgiGetVkey","param":{"guid":"1535846080","songmid":["' + song_id + '"],"songtype":[0],"uin":"0","loginflag":1,"platform":"20"}},"comm":{"uin":0,"format":"json","ct":24,"cv":0}}'
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
data = json.loads(response.text)
if data['code'] == 0:
return data['req_0']['data']['midurlinfo'][0]['purl']
else:
return None
else:
return None
if __name__ == '__main__':
keyword = '周杰伦'
page = 1
song_list = get_song_list(keyword, page)
for song in song_list:
song_url = get_song_url(song['song_id'])
print(song['song_name'], song['singer'], song['album'], song_url)
```
通过以上代码,我们可以获取到指定关键词下的歌曲列表,并且获取每首歌曲的播放链接。
相关推荐
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)