用python爬虫京东
时间: 2024-01-21 16:14:21 浏览: 89
以下是使用Python爬虫京东的步骤:
1. 导入所需的库和模块:
```python
import scrapy
from scrapy import Request
```
2. 创建一个Spider类,并继承scrapy.Spider:
```python
class JDSpider(scrapy.Spider):
name = 'jd_spider'
allowed_domains = ['jd.com']
start_urls = ['https://www.jd.com/']
def parse(self, response):
# 在这里编写解析网页的代码
pass
```
3. 在parse方法中编写解析网页的代码,提取所需的数据:
```python
def parse(self, response):
# 提取商品链接
product_links = response.css('.gl-item .p-img a::attr(href)').extract()
for link in product_links:
yield Request(link, callback=self.parse_product)
def parse_product(self, response):
# 提取商品信息
title = response.css('.sku-name::text').extract_first().strip()
price = response.css('.p-price .price::text').extract_first().strip()
image_url = response.css('#spec-img::attr(src)').extract_first()
# 在这里可以将数据保存到数据库或下载图片等操作
pass
```
4. 在settings.py文件中配置数据库连接信息:
```python
MYSQL_HOST = 'localhost'
MYSQL_PORT = 3306
MYSQL_DATABASE = 'jd_data'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'password'
```
5. 在pipelines.py文件中编写保存数据到数据库的代码:
```python
import pymysql
class JDPipeline(object):
def __init__(self, host, port, database, user, password):
self.host = host
self.port = port
self.database = database
self.user = user
self.password = password
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
port=crawler.settings.get('MYSQL_PORT'),
database=crawler.settings.get('MYSQL_DATABASE'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD')
)
def open_spider(self, spider):
self.conn = pymysql.connect(
host=self.host,
port=self.port,
database=self.database,
user=self.user,
password=self.password,
charset='utf8'
)
self.cursor = self.conn.cursor()
def close_spider(self, spider):
self.conn.close()
def process_item(self, item, spider):
# 将数据保存到数据库
sql = "INSERT INTO jd_products (title, price) VALUES (%s, %s)"
self.cursor.execute(sql, (item['title'], item['price']))
self.conn.commit()
return item
```
6. 运行爬虫:
```shell
scrapy crawl jd_spider
```
阅读全文