利用Scrapy对接Seleniumr的方法实现爬取网站(新浪财经上海A股股票) 的股票信息,将爬取到的数据保存到csv文件并输出。爬取的数据包括:代码,名称,最新价,买入,卖出,昨收,今开,最高,最低,成交量/手,成交额/万,振幅,换手率,市净率,市盈率,总股本,流通股。爬取网址:https://vip.stock.finance.sina.com.cn/mkt/#sh_a 要求如下: (1) 创建Scrapy项目,命名为Stock,并在PyCharm中打开。 (2) 在items.py文件中,创建一个StockItem类添加Item数据; (3) 使用模板创建一个爬虫stock.py文件; (4) 在stock.py文件中,修改url地址和parse方法爬取数据; (5) 在middleware.py中添加process_request方法使用Selenium爬取网站信息; (6) 在settings.py文件中,添加管道信息、下载器信息,以及默认的请求头信息。 (7) 运行“scrapy crawl stock -o stock.csv”命令输出信息到文件。
时间: 2023-06-19 13:03:52 浏览: 216
对接Scrapy和Selenium,实现爬取新浪财经上海A股股票信息,可以按照以下步骤进行:
1. 创建Scrapy项目
打开终端,执行以下命令创建Scrapy项目:
```
scrapy startproject Stock
```
2. 在items.py文件中添加Item数据
打开Stock项目,进入items.py文件,添加以下代码:
```
import scrapy
class StockItem(scrapy.Item):
code = scrapy.Field() # 代码
name = scrapy.Field() # 名称
latest_price = scrapy.Field() # 最新价
buy = scrapy.Field() # 买入
sell = scrapy.Field() # 卖出
yesterday_close = scrapy.Field() # 昨收
today_open = scrapy.Field() # 今开
highest = scrapy.Field() # 最高
lowest = scrapy.Field() # 最低
volume = scrapy.Field() # 成交量/手
turnover = scrapy.Field() # 成交额/万
amplitude = scrapy.Field() # 振幅
turnover_rate = scrapy.Field() # 换手率
pb_ratio = scrapy.Field() # 市净率
pe_ratio = scrapy.Field() # 市盈率
total_capital = scrapy.Field() # 总股本
circulating_capital = scrapy.Field() # 流通股
```
3. 创建爬虫文件
在Stock项目中,执行以下命令创建爬虫文件:
```
scrapy genspider stock https://vip.stock.finance.sina.com.cn/mkt/#sh_a
```
生成的stock.py文件中,修改parse方法如下:
```
def parse(self, response):
# 获取所有股票代码和名称
codes = response.xpath('//div[@id="quotesearch"]/ul[@class="stockUL"]/li/a/text()')
for code in codes:
item = StockItem()
item['code'] = code.extract().split(' ')[0]
item['name'] = code.extract().split(' ')[1]
# 构造股票信息的url
url = 'https://finance.sina.com.cn/realstock/company/{}/nc.shtml'.format(item['code'])
# 构造SeleniumRequest
yield SeleniumRequest(url=url, callback=self.parse_stock, meta={'item': item})
def parse_stock(self, response):
item = response.meta['item']
# 解析股票信息
item['latest_price'] = response.xpath('//div[@class="stock-bets"]/div[@class="price"]/strong/text()').get()
item['buy'] = response.xpath('//dt[text()="买入"]/following-sibling::dd[1]/text()').get()
item['sell'] = response.xpath('//dt[text()="卖出"]/following-sibling::dd[1]/text()').get()
item['yesterday_close'] = response.xpath('//dt[text()="昨收"]/following-sibling::dd[1]/text()').get()
item['today_open'] = response.xpath('//dt[text()="今开"]/following-sibling::dd[1]/text()').get()
item['highest'] = response.xpath('//dt[text()="最高"]/following-sibling::dd[1]/text()').get()
item['lowest'] = response.xpath('//dt[text()="最低"]/following-sibling::dd[1]/text()').get()
item['volume'] = response.xpath('//dt[text()="成交量"]/following-sibling::dd[1]/text()').get()
item['turnover'] = response.xpath('//dt[text()="成交额"]/following-sibling::dd[1]/text()').get()
item['amplitude'] = response.xpath('//dt[text()="振幅"]/following-sibling::dd[1]/text()').get()
item['turnover_rate'] = response.xpath('//dt[text()="换手率"]/following-sibling::dd[1]/text()').get()
item['pb_ratio'] = response.xpath('//dt[text()="市净率"]/following-sibling::dd[1]/text()').get()
item['pe_ratio'] = response.xpath('//dt[text()="市盈率"]/following-sibling::dd[1]/text()').get()
item['total_capital'] = response.xpath('//dt[text()="总股本"]/following-sibling::dd[1]/text()').get()
item['circulating_capital'] = response.xpath('//dt[text()="流通股"]/following-sibling::dd[1]/text()').get()
yield item
```
4. 添加middleware
打开Stock项目,进入middlewares.py文件,添加以下代码:
```
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
class SeleniumMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def spider_opened(self, spider):
options = Options()
options.add_argument('--headless')
self.driver = webdriver.Chrome(options=options)
def spider_closed(self, spider):
self.driver.quit()
def process_request(self, request, spider):
self.driver.get(request.url)
time.sleep(2)
return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8')
```
5. 修改settings.py
打开Stock项目,进入settings.py文件,添加以下代码:
```
ITEM_PIPELINES = {
'Stock.pipelines.StockPipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
'Stock.middlewares.SeleniumMiddleware': 543,
}
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Accept-Language': 'en',
}
```
6. 添加pipeline
打开Stock项目,进入pipelines.py文件,添加以下代码:
```
import csv
class StockPipeline(object):
def __init__(self):
self.file = open('stock.csv', 'w', newline='', encoding='utf-8')
self.writer = csv.writer(self.file)
self.writer.writerow(['代码', '名称', '最新价', '买入', '卖出', '昨收', '今开', '最高', '最低', '成交量/手', '成交额/万', '振幅', '换手率', '市净率', '市盈率', '总股本', '流通股'])
def process_item(self, item, spider):
self.writer.writerow([item['code'], item['name'], item['latest_price'], item['buy'], item['sell'], item['yesterday_close'], item['today_open'], item['highest'], item['lowest'], item['volume'], item['turnover'], item['amplitude'], item['turnover_rate'], item['pb_ratio'], item['pe_ratio'], item['total_capital'], item['circulating_capital']])
return item
def close_spider(self, spider):
self.file.close()
```
7. 运行爬虫
在Stock项目中,执行以下命令运行爬虫:
```
scrapy crawl stock -o stock.csv
```
运行结果会保存在当前目录下的stock.csv文件中。
阅读全文
相关推荐
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![docx](https://img-home.csdnimg.cn/images/20241231044901.png)
![.zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![docx](https://img-home.csdnimg.cn/images/20241231044901.png)
![pptx](https://img-home.csdnimg.cn/images/20241231044947.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![-](https://img-home.csdnimg.cn/images/20241231044955.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)