python爬虫爬取淘宝商品mysql储存
时间: 2024-12-27 08:13:17 浏览: 14
### 使用Python编写爬虫程序抓取淘宝商品信息并存储至MySQL
#### 创建Scrapy项目
为了高效地完成此任务,推荐使用Scrapy框架来构建爬虫。首先需确保已正确安装Scrapy环境。
```bash
pip install scrapy
```
接着,在命令行中执行如下指令创建新的Scrapy项目:
```bash
scrapy startproject taobao_spider
cd taobao_spider/spiders/
scrapy genspider taobao_items www.taobao.com
```
上述操作会初始化一个新的Scrapy工程,并生成一个名为`taobao_items`的基础Spider模板[^2]。
#### 编写Item定义
编辑位于`items.py`中的类以定义要提取的商品字段,例如名称、价格等属性。
```python
import scrapy
class TaobaoItem(scrapy.Item):
title = scrapy.Field() # 商品标题
price = scrapy.Field() # 单价
sales_volume = scrapy.Field() # 销量
shop_name = scrapy.Field() # 店铺名
location = scrapy.Field() # 地址
```
#### 配置Settings
调整配置文件`settings.py`内的参数,比如启用管道功能以及设定下载延迟时间防止触发反爬机制。
```python
ITEM_PIPELINES = {
'taobao_spider.pipelines.TaobaoPipeline': 300,
}
DOWNLOAD_DELAY = 1 # 设置请求间隔时间为1秒
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
ROBOTSTXT_OBEY = False # 不遵循robots协议
```
#### 开发Pipelines模块
在`pipelines.py`里加入逻辑处理函数,负责接收来自Spiders传递过来的数据项并将它们保存到预先建立好的MySQL表内。
```python
from itemadapter import ItemAdapter
import pymysql.cursors
class TaobaoPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_crawler(cls, crawler):
return cls(
dbpool=pymysql.connect(host='localhost',
user='root',
password='',
database='ecommerce_db')
)
def process_item(self, item, spider):
try:
with self.dbpool.cursor() as cursor:
sql = """INSERT INTO products(title,price,sales_volume,shop_name,location)
VALUES (%s,%s,%s,%s,%s)"""
val = (
item['title'],
float(item['price']),
int(item['sales_volume'].replace('人付款','')),
item['shop_name'],
item['location']
)
cursor.execute(sql,val)
self.dbpool.commit()
except Exception as e:
print(f"Error occurred while inserting data into DB: {e}")
finally:
return item
def close_spider(self, spider):
self.dbpool.close()
```
#### Spider开发
最后一步是在对应的Spider脚本(`taobao_items.py`)里面实现具体页面解析规则,定位所需元素并通过XPath/CSS选择器抽取有效数据填充给Items实例对象。
```python
import scrapy
from ..items import TaobaoItem
class TaobaoSpider(scrapy.Spider):
name = "taobao"
allowed_domains=["www.taobao.com"]
start_urls=['https://list.tmall.com/search_product.htm?q=手机']
def parse(self,response):
items=TaoBaoItem()
all_products=response.css(".product-iWrap")
for product in all_products:
items["title"]=product.xpath('.//div[@class="productTitle"]/a/text()').get().strip()
items["price"]=product.css(".productPrice em::text").re_first(r'\d+\.\d*')
items["sales_volume"]=product.css(".item-sell-num::text").extract()[0].split(" ")[-1]
items["shop_name"]=product.css(".storeName a span::text").get()
items["location"]="China"
yield items
next_page_url = response.css('#content div.pagination-next-page a::attr(href)').get()
if next_page_url is not None:
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
```
以上即为完整的流程介绍,值得注意的是实际部署前还需考虑更多细节优化如异常捕获、日志记录等功能完善度提升等问题。
阅读全文