python爬虫源代码
时间: 2023-11-08 10:50:50 浏览: 107
抓取网页数据是爬虫的关键任务之一,可以使用Python的urllib库来实现。可以使用urllib.request模块发送GET和POST请求,获取网页数据。例如,使用urllib.request.urlopen方法发送GET请求,并使用response.read()方法读取响应内容。通过设置超时时间,可以处理请求超时的情况。还可以使用urllib.request库来设置请求头,模拟浏览器访问。以下是一个简单的Python爬虫源代码示例:
```python
import urllib.request
# 发送GET请求
response = urllib.request.urlopen("https://www.example.com/")
html = response.read().decode("utf-8")
# 发送POST请求
data = bytes(urllib.parse.urlencode({"hello": "world"}), encoding="utf-8")
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
result = response.read().decode("utf-8")
# 超时处理
try:
response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.01)
html = response.read().decode("utf-8")
except urllib.error.URLError as e:
print("请求超时!")
# 设置请求头
url = "https://www.example.com/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"
}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
```
阅读全文