import requests import re def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" # 整个程序的关键 # 用了正则表达式而没有用BeautifulSoup def parsePage(ilt, html): try: plt = re.findall(r'\"view_price\"\:\"[\d\.]\"', html) tlt = re.findall(r'\"raw_title\"\:\".?\"', html) for i in range(len(plt)): price = eval(plt[i].split(':')[1]) # eval函数将字符串的最外层的单引号或双引号去掉 title = eval(tlt[i].split(':')[1]) ilt.append([price, title]) except: print("") def printGoodsList(ilt): tplt = "{:4}\t{:8}\t{:16}" print(tplt.format('序号', '价格', '商品名称')) count = 0 for q in ilt: count = count + 1 print(tplt.format(count, g[0], g[1])) def main(): goods = '书包' depth = 2 start_url = 'https://s.taobao.com/search?q=' + goods infoList = [] for i in range(depth): try: url = start_url + '&s=' + str(44 * i) html = getHTMLText(url) parsePage(infolist, html) except: # 如果某一个页面的解析出了问题，那我们继续下一个页面的解析而不影响整个程序的执行 continue printGoodsList(infoList) main()

时间: 2023-06-08 10:04:13 浏览: 175

这是一段Python代码，主要功能是通过requests库访问指定URL，获取网页内容并返回。如果访问失败，返回空字符串。同时，还使用了re库来处理网页内容。其中，getHTMLText()函数接受一个参数URL，表示要访问的页面地址。在函数体中，使用requests库的get()方法来访问URL，timeout参数为30秒，表示最多等待30秒后如果还没有响应将会抛出异常。如果访问成功，使用r.encoding属性来确定网页编码方式，然后返回r.text内容。如果访问失败，则返回空字符串。

requests爬虫使用代理IP 1、新建程序爬取豆瓣250，使用requests 2、代理的使用：使用免费代理、使用自己的代理 3、数据要存到数据库

在使用requests爬虫时，如果需要通过代理IP访问网站，可以按照以下步骤操作： 1. 新建Python程序： ```python import requests import json def get_douban_movies_with_proxy(proxy_url=None): headers = {'User-Agent': 'Mozilla/5.0'} proxies = {'http': proxy_url} if proxy_url else {} try: # 使用代理IP访问豆瓣Top250页面 response = requests.get("https://movie.douban.com/top250", headers=headers, proxies=proxies) # 检查请求状态 response.raise_for_status() # 解析并返回JSON数据 data = json.loads(response.text) top_movies = data['subjects'] return top_movies except Exception as e: print(f"获取数据出错：{e}") return [] # 例子：使用免费代理（如：free-proxy-list.net提供的API） def get_free_proxy(): free_proxy_url = "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http&timeout=10000&country=all" response = requests.get(free_proxy_url) if response.status_code == 200: proxy_data = response.json() proxy = proxy_data[0]['ip_port'] return proxy else: print("无法获取免费代理") return None # 使用自定义代理 custom_proxy = "your_custom_proxy_url:port" movies = get_douban_movies_with_proxy(custom_proxy) ``` 2. 数据存储：为了将抓取的数据存入数据库，首先需要安装一个合适的ORM库，比如SQLAlchemy。假设我们选择SQLite作为数据库： ```python from sqlalchemy import create_engine, Column, Integer, String from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker Base = declarative_base() class Movie(Base): __tablename__ = 'douban_movies' id = Column(Integer, primary_key=True) title = Column(String) rating = Column(Integer) # 创建数据库引擎 engine = create_engine('sqlite:///douban_movies.db') # 创建Session Session = sessionmaker(bind=engine) session = Session() movies_to_store = [Movie(title=m['title'], rating=m['rate']) for m in movies] session.add_all(movies_to_store) session.commit()

爬取上外主页http://www.shisu.edu.cn/ 的前几条新闻的标题和对应的超链接，并按每行"标题超链接"的格式保存到txt 文件中建议：模块化编程，分为几个函数实现： 1)获取html 页面内容：gethtmltxt(url) 2)数据提取过滤：getContent(html), 把抓取的数据保存到字典中，返字典类型 3)保存数据：saveData(dic)// 把字典的数据保存到文件中。

好的，我会根据您的要求进行操作。以下是实现代码： ```python import requests from bs4 import BeautifulSoup def gethtmltxt(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def getContent(html): newsDict = {} soup = BeautifulSoup(html, 'html.parser') newsList = soup.find_all('div', class_='article')[0].find_all('li') for news in newsList: title = news.a.string link = 'http://www.shisu.edu.cn' + news.a['href'] newsDict[title] = link return newsDict def saveData(dic): with open('news.txt', 'w', encoding='utf-8') as f: for key, value in dic.items(): f.write(key + ' ' + value + '\n') def main(): url = 'http://www.shisu.edu.cn/' html = gethtmltxt(url) newsDict = getContent(html) saveData(newsDict) if __name__ == '__main__': main() ``` 运行程序后，将会在当前目录下生成一个名为`news.txt`的文件，其中包含了上外主页前几条新闻的标题和对应的超链接。

阅读全文

requests爬虫使用代理IP 1、新建程序爬取豆瓣250，使用requests 2、代理的使用：使用免费代理、使用自己的代理 3、数据要存到数据库

相关推荐

Python 使用requests模块发送GET和POST请求的实现代码

详解Python requests 超时和重试的方法

Python Requests异常处理从入门到精通：错误管理不求人

【Django源码剖析】：深入分析django.core.management.base的代码实现

【GeoDjango错误处理终极指南】：优雅解决django.contrib.gis.geos.error的技巧

【Python资源管理的异常处理技巧】：使用try catch进行高效资源管理

Python Requests与服务器端Gzip压缩：提升数据传输效率的策略

Ripr0-v5曰主题8.3开心版适用于知识付费资源素材博客

预计2030年全球扫地机器人市场规模将达到87.8亿美元

基于springboot+vue的在线宠物用品交易网站的设计与实现（Java毕业设计，附源码，部署教程）.zip

VM17的密钥，亲测有效的，用的多了可能就没了

大家在看

PCIE2.0总线规范，用于PCIE开发参考.zip

基于自适应权重稀疏典范相关分析的人脸表情识别

微电子实验器件课件21

计算机网络_自顶向下方法_第四版_课后习题答案

香港地铁的安全风险管理 (2007年)

最新推荐

Ripr0-v5曰主题8.3开心版适用于知识付费资源素材博客

探索zinoucha-master中的0101000101奥秘

【Qt与OpenGL集成】：提升框选功能图形性能，OpenGL的高效应用案例

ffmpeg 指定屏幕输出

个人网站技术深度解析：Haskell构建、黑暗主题、并行化等

Qt框选功能的国际化实践：支持多语言界面的核心技术解析

内网如何运行docker pull mysql:5.7

ImgToString开源工具：图像转字符串轻松实现

Qt框选功能安全性增强指南：防止恶意操作的有效策略

在ros平台中实现人脸识别