将下列代码改为对上海证券交易所网站公告爬取from concurrent.futures import ThreadPoolExecutor import requests headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.159 Safari/537.36" } def download_pdf(url, code, num, date): print(f'开始下载 data/{code}_{date}_{num}.pdf') resp = requests.get(url, headers=headers) with open(f'E:/深交所pdf/{code}_{date}_{num}.pdf', 'wb') as f: f.write(resp.content) resp.close() print(f'E:/深交所pdf/{code}_{date}_{num}.pdf 下载完毕！') if __name__ == '__main__': domain = 'http://www.sse.cn' with ThreadPoolExecutor(30) as t: with open('target.csv', 'r') as f: lines = f.readlines() for line in lines: param = list(line.split()) form = { 'seDate': [param[3], param[3]], 'stock': [param[0]], 'channelCode': ['listedNotice_disc'], 'pageSize': '50', 'pageNum': '1' } # 获取文件列表的url get_file_list_url = 'http://www.sse.com.cn/disclosure/listedinfo/announcement/json/announce_type.json?v=0.9715488799747511' resp = requests.post(get_file_list_url, headers=headers, json=form) # resp.encoding = 'utf-8' # print(resp.json()) js = resp.json() resp.close() tot = 0 for data in js['data']: tot += 1 download_url = domain + f'/api/disc/info/download?id={data["id"]}' t.submit(download_pdf, url=download_url, code=param[0], num=tot, date=param[3]) print("下载完毕！！！") # doc_id = '' # download_url = domain + f'/api/disc/info/download?id={"c998875f-9097-403e-a682-cd0147ce10ae"}' # resp = requests.get(download_url, headers=headers) # with open(f'{"c998875f-9097-403e-a682-cd0147ce10ae"}.pdf', 'wb') as f: # f.write(resp.content) # resp.close()

import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor url_template = 'https://book.douban.com/tag/编程?start={}&type=T' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} def get_book_list(start): url = url_template.format(start) response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') book_list = soup.find_all('li', class_='subject-item') return book_list def get_book_info(book): title = book.find('div', class_='info').a.get_text().strip() rating = book.find('span', class_='rating_nums').get_text().strip() return title, rating if name == 'main': with ThreadPoolExecutor(max_workers=10) as executor: futures = [] for start in range(0, 100, 20): futures.append(executor.submit(get_book_list, start)) books = [] for future in futures: books.extend(future.result()) futures = [] for book in books: futures.append(executor.submit(get_book_info, book)) for future in futures: title, rating = future.result() print(title, rating)改成正确代码

from concurrent.futures import ThreadPoolExecutor url_template = 'https://book.douban.com/tag/编程?start={}&type=T' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/...

【处理数据加速】：concurrent.futures在functools中的全面介绍

concurrent.futures模块概述在现代的软件开发中，对计算密集型任务和IO密集型任务的处理提出了更高的要求。Python作为一门高级编程语言，虽然自身是单线程的，但是为了更好地利用多核处理器的能力，提供了多种...

【Python编程宝典】：requests库实战演练 - 构建高效API交互和爬虫

![【Python编程宝典】：requests库实战演练 - 构建高效API...本章节将为你介绍如何快速入门并掌握requests库的基础使用。 ## 简介 requests库是Python编程语言中用于处理HTTP请求的一个库，它具有简洁的API设计，使

Python Requests库与机器学习携手：从Web获取数据用于训练模型，事半功倍

![Python Requests库与机器学习携手：从Web获取数据用于训练模型，事半功倍]... Python Requests库简介** Requests库是一个功能强大的Python HTTP库，用于发送HTTP请求并获取响应。它简化了发送HTTP请求的过程，提供

揭秘requests-html库

![揭秘requests-html库](https://i0.hdslb.com/bfs/article/banner/0c184c9c9a4fe26f8809e220bf64491bd90f04bb.png) # 1.... 在当今的网络信息时代，Web自动化和数据抓取变得越来越重要。...相较于传统使用requests库

Python Requests库性能优化指南：提升HTTP请求速度和效率，事半功倍

![Python Requests库性能优化指南：提升HTTP请求速度和效率，事半功倍](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/3b7ce8d03b3c479ea3b13b1b3b6cfe5c~tplv-k3u1fbpfcp-zoom-in-crop-...Requests库提供了一

【Python Requests库高级应用】：构建专业的HTTP请求解决方案

![【Python Requests库高级应用】：构建专业的HTTP请求解决方案](https://img-blog.csdnimg.cn/20200223002339879.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,...Python Requests库是一个强大的HT

requests-html库高级应用

![requests-html库高级应用](https://www.lambdatest.com/blog/wp-content/uploads/2023/06/login2520method.png...本章将为大家介绍requests-html的基本概念、安装方法，以及如何快速上手进行基础使用。首先，我们

网络请求库选择与优化：requests vs. urllib

[网络请求库选择与优化：requests vs. urllib](https://img-blog.csdnimg.cn/direct/c12e0f7bfd4b4264bc0a3a9c52e7242c.png) # 1. 网络请求库的基本概念网络请求库在实际开发中扮演着至关重要的角色。通过网络请求...

【requests库完全手册】：从入门到精通，提升网络请求性能和安全性

![【requests库完全手册】：从入门到精通，提升...本章我们将介绍requests库的基本使用方法，为后面的高级应用和优化打下坚实的基础。首先，我们来了解如何安装requests库。在终端中运行以下命令即可完成安装：

Python爬虫优化技巧：如何提升爬取效率？

[Python爬虫优化技巧：如何提升爬取效率？](https://oss.juliangip.com/attachment/20230207/2a79be9b8aa740c0876c1019fd8bf515.jpeg) # 1. **介绍** 在网络爬虫领域，爬虫优化是指通过一系列技术手段和方法，提升...

requests库核心解读：构建高效安全的网络请求脚本（实战必备）

![requests库核心解读：构建高效安全的网络请求脚本（实战必备）]... requests库简介与安装配置 Python的requests库是专注于易用性和简洁性的HTTP库，

Python Requests库在金融科技中的应用：处理敏感的金融数据

![python安装requests](https://softuni.org/wp-content/uploads/2022/07/HTTP-Request-Methods-e1657276635747.png) # 1. Python Requests库简介 ...Requests库具有以下特点： ...- **功能丰富：**Requests库

Python Requests库：深入剖析HTTP请求处理的秘密

![Python Requests库：深入剖析HTTP请求处理的秘密]...Requests库还提供了对请求头、请求体和响应的处理，以及对Cookies、会话管理和SSL证书验证的支持。 Requests库的设计目的是使HTTP请求处理变得

import requestsfrom html.parser import HTMLParserimport argparsefrom concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completedimport multiprocessingprefix = "save/"readed_path = multiprocessing.Manager().Queue()cur_path = multiprocessing.Manager().Queue()new_path = multiprocessing.Manager().Queue()lock = multiprocessing.Lock()class MyHttpParser(HTMLParser): def init(self): super().init() self.tag = [] self.href = "" self.txt = "" def handle_starttag(self, tag, attrs): self.tag.append(tag) if tag == "a": for att in attrs: if att[0] == 'href': self.href = att[1] def handle_endtag(self, tag): if tag == "a" and len(self.tag) > 2 and self.tag[-2] == "div": print("in div, link txt is %s ." % self.txt) print("in div, link url is %s ." % self.href) if not self.href in readed_path.queue: readed_path.put(self.href) new_path.put(self.href) self.tag.pop(-1) def handle_data(self, data): if len(self.tag) >= 1 and self.tag[-1] == "a": self.txt = datadef LoadHtml(path, file_path): if len(file_path) == 0: file_path = "/" url = f"http://{path}{file_path}" try: response = requests.get(url) print(response.status_code, response.reason, response.raw.version) data = response.content.decode("utf-8") if response.status_code == 301: data = response.headers["Location"] if not data in readed_path.queue: new_path.put(data) data = "" return data except Exception as e: print(e.args)def ParseArgs(): parser = argparse.ArgumentParser() parser.add_argument("-p", "--path", help="domain name") parser.add_argument("-d", "--deep", type=int, help="recursion depth") args = parser.parse_args() return argsdef formatPath(path): path = path.removeprefix("https://") path = path.removeprefix("http://") path = path.removeprefix("//") return pathdef doWork(path): path = formatPath(path) m = path.find("/") if m == -1: m = len(path) data = LoadHtml(path[:m], path[m:]) with open(prefix + path[:m] + ".html", "w+", encoding="utf-8") as f: f.write(data) parse.feed(data)def work(maxdeep): args = ParseArgs() cur_path.put(formatPath(args.path)) readed_path.put(formatPath(args.path)) parse = MyHttpParser() with ProcessPoolExecutor(max_workers=4) as executor: for i in range(args.deep): size = cur_path.qsize() futures = [executor.submit(doWork, cur_path.get()) for _ in range(size)] for future in as_completed(futures): try: future.result() except Exception as e: print(e) cur_path.queue.clear() while not new_path.empty(): cur_path.put(new_path.get()) print(i)if name == 'main': work(5)此代码出现Unresolved reference 'parse'

这是因为您在 doWork 函数中使用...您需要将 parse 变量作为参数传递给 doWork 函数，或者将其定义为全局变量，以便在 doWork 函数中使用它。如果您决定使用全局变量，请确保在使用它之前已经对其进行了定义。

优化代码：import requests from bs4 import BeautifulSoup import csv # 请求URL url = "https://pvp.qq.com/web201605/herodetail/527.shtml" # 请求头部信息 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.57"} # 发送请求 response = requests.get(url, headers=headers) # 解析HTML soup = BeautifulSoup(response.content, "html.parser") # 获取所有英雄的链接 hero_links = [] for hero in soup.select(".herolist > li > a"): hero_links.append(hero["href"]) # 爬取每个英雄的属性 heroes = [] for link in hero_links: response = requests.get(link, headers=headers) soup = BeautifulSoup(response.content, "html.parser") # 获取英雄属性 name = soup.select(".cover-name")[0].text survive = soup.select(".")[0].text attack = soup.select(".cover-list-bar data-bar2 fl")[0].text skill = soup.select(".skill")[0].text difficulty = soup.select(".difficulty")[0].text # 保存英雄属性 heroes.append({"name": name, "survive": survive, "attack": attack, "skill": skill, "difficulty": difficulty}) # 将数据写入CSV文件 with open("heroes.csv", "w", newline="", encoding="utf-8-sig") as csvfile: fieldnames = ["name", "survive", "attack", "skill", "difficulty"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) # 写入表头 writer.writeheader() # 写入数据 for hero in heroes: writer.writerow(hero)

可以尝试使用多线程或异步的方式来提高爬取效率，比如使用concurrent.futures库中的ThreadPoolExecutor来实现多线程爬取。同时，可以对代码进行优化，减少不必要的HTTP请求，比如将hero_links列表中的链接去重...

import http.client from html.parser import HTMLParser import argparse from concurrent.futures import ThreadPoolExecutor import multiprocessing.pool prefix = "save/" readed_path = multiprocessing.Manager().list() cur_path = multiprocessing.Manager().list() new_path = multiprocessing.Manager().list() lock = multiprocessing.Lock() class MyHttpParser(HTMLParser): def init(self): HTMLParser.init(self) self.tag = [] self.href = "" self.txt = "" def handle_starttag(self, tag, attrs): self.tag.append(tag) # print("start tag in list :" + str(self.tag)) if tag == "a": for att in attrs: if att[0] == 'href': self.href = att[1] def handle_endtag(self, tag): if tag == "a" and len(self.tag) > 2 and self.tag[-2] == "div": print("in div, link txt is %s ." % self.txt) print("in div, link url is %s ." % self.href) lock.acquire() if not self.href in readed_path: readed_path.append(self.href) new_path.append(self.href) # print("end tag in list :" + str(self.tag)) lock.release() self.tag.pop(-1) def handle_data(self, data): if len(self.tag) >= 1 and self.tag[-1] == "a": self.txt = data def LoadHtml(path, file_path): if len(file_path) == 0: file_path = "/" conn = http.client.HTTPConnection(path) try: conn.request("GET", file_path) response = conn.getresponse() print(response.status, response.reason, response.version) data = response.read().decode("utf-8") if response.status == 301: data = response.getheader("Location") lock.acquire() new_path.append(data) lock.release() data = "" #print(data) conn.close() return data except Exception as e: print(e.args) def ParseArgs(): # 初始化解析器 parser = argparse.ArgumentParser() # 定义参数 parser.add_argument("-p", "--path", help="域名") parser.add_argument("-d", "--deep", type=int, help="递归深度") # 解析 args = parser.parse_args() return args def formatPath(path): path = path.removeprefix("https://") path = path.removeprefix("http://") path = path.removeprefix("//") return path def doWork(path): path = formatPath(path) m = path.find("/") if m == -1: m = len(path) data = LoadHtml(path[:m], path[m:]) with open(prefix + path[:m] + ".html", "w+", encoding="utf-8") as f: f.write(data) parse.feed(data) def work(deep,maxdeep): if deep > maxdeep: return args = ParseArgs() cur_path.append(formatPath(args.path)) readed_path.append(formatPath(args.path)) parse = MyHttpParser() e = multiprocessing.Pool(4) for i in range(args.deep): size = len(cur_path) e.map(doWork,cur_path) cur_path[:]=[] for p in new_path: cur_path.append(p) new_path[:]=[] print(i)优化此代码能在windows下运行

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed import multiprocessing prefix = "save/" readed_path = multiprocessing.Manager().Queue() cur_path = multiprocessing...

相关推荐

flask-executor:向 Flask 添加 concurrent.futures 支持

Python并发concurrent.futures和asyncio实例

Python concurrent.futures模块使用实例

【处理数据加速】：concurrent.futures在functools中的全面介绍

【Python编程宝典】：requests库实战演练 - 构建高效API交互和爬虫

Python Requests库与机器学习携手：从Web获取数据用于训练模型，事半功倍

揭秘requests-html库

Python Requests库性能优化指南：提升HTTP请求速度和效率，事半功倍

【Python Requests库高级应用】：构建专业的HTTP请求解决方案

requests-html库高级应用

网络请求库选择与优化：requests vs. urllib

【requests库完全手册】：从入门到精通，提升网络请求性能和安全性

Python爬虫优化技巧：如何提升爬取效率？

requests库核心解读：构建高效安全的网络请求脚本（实战必备）

Python Requests库在金融科技中的应用：处理敏感的金融数据

Python Requests库：深入剖析HTTP请求处理的秘密

最新推荐

ta-lib-0.5.1-cp312-cp312-win32.whl

全国江河水系图层shp文件包下载

管理建模和仿真的文件

Keras模型压缩与优化：减小模型尺寸与提升推理速度

MTK 6229 BB芯片在手机中有哪些核心功能，OTG支持、Wi-Fi支持和RTC晶振是如何实现的？

点云二值化测试数据集的详细解读

"互动学习：行动中的多样性与论文攻读经历"

Keras正则化技术应用：L1_L2与Dropout的深入理解

在Python中使用xarray和cfgrib库处理GRIB数据时，如何有效解决遇到的DatasetBuildError错误？

JDiskCat：跨平台开源磁盘目录工具