抓取网页self.url
时间: 2024-01-16 18:04:07 浏览: 144
抓取网页的代码可以使用 Python 的 requests 库,具体实现如下:
```python
import requests
url = "https://www.example.com"
response = requests.get(url)
html = response.content.decode("utf-8")
print(html)
```
其中,url 是你想要抓取的网页链接,requests.get(url) 发送 GET 请求获取网页内容,response.content 返回二进制内容,使用 decode("utf-8") 转换为字符串类型,最后打印出来。
相关问题
Traceback (most recent call last): File "D:\Python\pythonProject_pras\004_抓取知乎图片.py", line 27, in <module> urllib.request.urlretrieve(content_list[n], local) File "D:\Python\python39\lib\urllib\request.py", line 239, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "D:\Python\python39\lib\urllib\request.py", line 214, in urlopen return opener.open(url, data, timeout) File "D:\Python\python39\lib\urllib\request.py", line 501, in open req = Request(fullurl, data) File "D:\Python\python39\lib\urllib\request.py", line 320, in __init__ self.full_url = url File "D:\Python\python39\lib\urllib\request.py", line 346, in full_url self._parse() File "D:\Python\python39\lib\urllib\request.py", line 375, in _parse raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: '" style=' 这个错误该怎么解决
这个错误是由于`content_list`中的某些字符串不是有效的URL引起的。可以尝试使用正则表达式进一步筛选出有效的URL,或者检查`content_list`中的数据是否正确。另外,还要注意确保网络连接正常,以便能够正确访问URL。
以下是一种可能的解决方法,使用正则表达式进行URL筛选:
```python
import re
# ... 其他代码 ...
# 使用正则表达式筛选有效的URL
valid_urls = [url for url in content_list if re.match(r'^https?://', url)]
n = 0
while n < len(valid_urls):
i = time.time()
local = (r"%s.jpg" % (i))
urllib.request.urlretrieve(valid_urls[n], local)
print("编号:" + str(i))
n = n + 1
```
这样可以确保只有以`http://`或`https://`开头的URL才会被保存和下载。
from mitmproxy import http from mitmproxy.tools.main import mitmdump class AddHeader: def __init__(self): self.num = 0 def request(self, flow: http.HTTPFlow) -> None: self.num += 1 flow.request.headers["My-Header"] = f"Request {self.num}" print("Request Headers: ") for name, value in flow.request.headers.items(): print(f"{name}: {value}") def response(self, flow: http.HTTPFlow) -> None: self.num += 1 flow.response.headers["My-Header"] = f"Response {self.num}" print("Response Headers: ") for name, value in flow.response.headers.items(): print(f"{name}: {value}") addons = [ AddHeader() ] if __name__ == "__main__": mitmdump(['-p', '8081', '-s', __file__]) 我使用以上代码进行抓包,我如何才能将抓取到的内容保存到本地,请帮我完善代码,我需要保存的是request而不是response
你可以使用Python中的`requests`库来将抓取到的内容保存到本地文件中。在AddHeader类中,在`request`方法中添加以下代码即可将抓取到的request内容保存到本地文件中:
```python
import requests
def save_request(flow: http.HTTPFlow):
with open(f"request_{self.num}.txt", "w") as f:
f.write(f"{flow.request.method} {flow.request.url}\n")
for name, value in flow.request.headers.items():
f.write(f"{name}: {value}\n")
f.write("\n")
if flow.request.method == "POST":
f.write(flow.request.content.decode("utf-8"))
class AddHeader:
def __init__(self):
self.num = 0
def request(self, flow: http.HTTPFlow) -> None:
self.num += 1
flow.request.headers["My-Header"] = f"Request {self.num}"
print("Request Headers: ")
for name, value in flow.request.headers.items():
print(f"{name}: {value}")
save_request(flow)
def response(self, flow: http.HTTPFlow) -> None:
self.num += 1
flow.response.headers["My-Header"] = f"Response {self.num}"
print("Response Headers: ")
for name, value in flow.response.headers.items():
print(f"{name}: {value}")
addons = [
AddHeader()
]
if __name__ == "__main__":
mitmdump(['-p', '8081', '-s', __file__])
```
在以上代码中,`save_request`方法会将抓取到的request内容以文本形式保存到以`request_序号.txt`命名的文件中。其中,`序号`为请求的序号,即第几个请求。如果请求是POST方法,则会将请求的content也保存到文件中。
阅读全文