1. 利用Scrapy对接Seleniumr的方法实现爬取网站(新浪财经上海A股股票) 的股票信息,将爬取到的数据保存到csv文件并输出。爬取的数据包括:代码,名称,最新价,买入,卖出,昨收,今开,最高,最低,成交量/手,成交额/万,振幅,换手率,市净率,市盈率,总股本,流通股。爬取网址:https://vip.stock.finance.sina.com.cn/mkt/#sh_a 要求如下: (1) 创建Scrapy项目,命名为Stock,并在PyCharm中打开。 (2) 在items.py文件中,创建一个StockItem类添加Item数据; (3) 使用模板创建一个爬虫stock.py文件; (4) 在stock.py文件中,修改url地址和parse方法爬取数据; (5) 在middleware.py中添加process_request方法使用Selenium爬取网站信息; (6) 在settings.py文件中,添加管道信息、下载器信息,以及默认的请求头信息。 (7) 运行“scrapy crawl stock -o stock.csv”命令输出信息到文件。
时间: 2023-06-18 22:03:12 浏览: 292
很高兴为您解答这个问题!这里是实现该需求的步骤:
步骤1:创建Scrapy项目
在命令行中输入以下命令创建一个名为Stock的Scrapy项目:
```
scrapy startproject Stock
```
步骤2:创建StockItem类
在Stock/Stock/items.py文件中,添加以下代码:
```python
import scrapy
class StockItem(scrapy.Item):
code = scrapy.Field()
name = scrapy.Field()
latest_price = scrapy.Field()
buy = scrapy.Field()
sell = scrapy.Field()
pre_close = scrapy.Field()
today_open = scrapy.Field()
highest = scrapy.Field()
lowest = scrapy.Field()
volume = scrapy.Field()
turnover = scrapy.Field()
amplitude = scrapy.Field()
turnover_rate = scrapy.Field()
pb_ratio = scrapy.Field()
pe_ratio = scrapy.Field()
total_shares = scrapy.Field()
circulation_shares = scrapy.Field()
```
步骤3:创建爬虫stock.py文件
在Stock/Stock/spiders/目录下创建一个名为stock.py的文件,并添加以下代码:
```python
import scrapy
from Stock.items import StockItem
class StockSpider(scrapy.Spider):
name = 'stock'
allowed_domains = ['vip.stock.finance.sina.com.cn']
start_urls = ['https://vip.stock.finance.sina.com.cn/mkt/#sh_a']
def parse(self, response):
for row in response.xpath('//table[@id="dataTable"]/tbody/tr'):
item = StockItem()
item['code'] = row.xpath('td[1]/a/text()').get()
item['name'] = row.xpath('td[2]/a/text()').get()
item['latest_price'] = row.xpath('td[3]/text()').get()
item['buy'] = row.xpath('td[4]/text()').get()
item['sell'] = row.xpath('td[5]/text()').get()
item['pre_close'] = row.xpath('td[6]/text()').get()
item['today_open'] = row.xpath('td[7]/text()').get()
item['highest'] = row.xpath('td[8]/text()').get()
item['lowest'] = row.xpath('td[9]/text()').get()
item['volume'] = row.xpath('td[10]/text()').get()
item['turnover'] = row.xpath('td[11]/text()').get()
item['amplitude'] = row.xpath('td[12]/text()').get()
item['turnover_rate'] = row.xpath('td[13]/text()').get()
item['pb_ratio'] = row.xpath('td[14]/text()').get()
item['pe_ratio'] = row.xpath('td[15]/text()').get()
item['total_shares'] = row.xpath('td[16]/text()').get()
item['circulation_shares'] = row.xpath('td[17]/text()').get()
yield item
```
步骤4:添加Selenium中间件
在Stock/Stock/middleware.py文件中,添加以下代码:
```python
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.webdriver.chrome.options import Options
class StockDownloaderMiddleware(object):
def __init__(self):
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
self.browser = webdriver.Chrome(chrome_options=chrome_options)
def process_request(self, request, spider):
self.browser.get(request.url)
body = self.browser.page_source
return HtmlResponse(self.browser.current_url, body=body, encoding='utf-8', request=request)
def __del__(self):
self.browser.close()
```
步骤5:添加管道和下载器
在Stock/Stock/settings.py文件中,添加以下代码:
```python
ITEM_PIPELINES = {
'Stock.pipelines.StockPipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
'Stock.middlewares.StockDownloaderMiddleware': 543,
}
```
步骤6:添加默认请求头
在Stock/Stock/settings.py文件中,添加以下代码:
```python
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
```
步骤7:运行爬虫并输出结果
在命令行中输入以下命令运行爬虫并输出结果:
```
scrapy crawl stock -o stock.csv
```
这样,就完成了对新浪财经上海A股股票信息的爬取并将结果保存到csv文件中。
阅读全文