1. 利用Scrapy对接Seleniumr的方法实现爬取网站(新浪财经上海A股股票) 的股票信息,将爬取到的数据保存到csv文件并输出。爬取的数据包括:代码,名称,最新价,买入,卖出,昨收,今开,最高,最低,成交量/手,成交额/万,振幅,换手率,市净率,市盈率,总股本,流通股。爬取网址:https://vip.stock.finance.sina.com.cn/mkt/#sh_a 要求如下: (1) 创建Scrapy项目,命名为Stock,并在PyCharm中打开。 (2) 在items.py文件中,创建一个StockItem类添加Item数据; (3) 使用模板创建一个爬虫stock.py文件; (4) 在stock.py文件中,修改url地址和parse方法爬取数据; (5) 在middleware.py中添加process_request方法使用Selenium爬取网站信息; (6) 在settings.py文件中,添加管道信息、下载器信息,以及默认的请求头信息。 (7) 运行“scrapy crawl stock -o stock.csv”命令输出信息到文件。
时间: 2023-06-17 07:05:42 浏览: 398
很好,这是一个非常具体的问题,我将逐步为您解答。
首先,您需要安装Scrapy和Selenium模块,以及对应的浏览器驱动程序,如Chrome驱动程序。
然后,您需要创建一个Scrapy项目,命名为Stock,并在PyCharm中打开。可以使用以下命令:
```
scrapy startproject Stock
```
接下来,您需要在items.py文件中创建一个StockItem类添加Item数据。根据要求,该类应包括以下字段:
```
code = scrapy.Field()
name = scrapy.Field()
last_price = scrapy.Field()
buy = scrapy.Field()
sell = scrapy.Field()
yesterday_close = scrapy.Field()
today_open = scrapy.Field()
highest = scrapy.Field()
lowest = scrapy.Field()
volume = scrapy.Field()
turnover = scrapy.Field()
amplitude = scrapy.Field()
turnover_rate = scrapy.Field()
pb_ratio = scrapy.Field()
pe_ratio = scrapy.Field()
total_equity = scrapy.Field()
circulating_equity = scrapy.Field()
```
接下来,使用模板创建一个爬虫stock.py文件。使用以下命令:
```
scrapy genspider stock vip.stock.finance.sina.com.cn
```
在stock.py文件中,修改start_urls和parse方法来爬取数据。代码如下:
```python
import scrapy
from Stock.items import StockItem
class StockSpider(scrapy.Spider):
name = 'stock'
allowed_domains = ['vip.stock.finance.sina.com.cn']
start_urls = ['https://vip.stock.finance.sina.com.cn/mkt/#sh_a']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, meta={'driver': self.driver})
def __init__(self, *args, **kwargs):
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
self.driver = webdriver.Chrome(options=options)
super().__init__(*args, **kwargs)
def parse(self, response):
self.driver.get(response.url)
stocks = response.xpath('//table[@id="dataTable"]/tbody/tr')
for stock in stocks:
item = StockItem()
item['code'] = stock.xpath('./td[1]/a/text()').get()
item['name'] = stock.xpath('./td[2]/a/text()').get()
item['last_price'] = stock.xpath('./td[3]/text()').get()
item['buy'] = stock.xpath('./td[4]/text()').get()
item['sell'] = stock.xpath('./td[5]/text()').get()
item['yesterday_close'] = stock.xpath('./td[6]/text()').get()
item['today_open'] = stock.xpath('./td[7]/text()').get()
item['highest'] = stock.xpath('./td[8]/text()').get()
item['lowest'] = stock.xpath('./td[9]/text()').get()
item['volume'] = stock.xpath('./td[10]/text()').get()
item['turnover'] = stock.xpath('./td[11]/text()').get()
item['amplitude'] = stock.xpath('./td[12]/text()').get()
item['turnover_rate'] = stock.xpath('./td[13]/text()').get()
item['pb_ratio'] = stock.xpath('./td[14]/text()').get()
item['pe_ratio'] = stock.xpath('./td[15]/text()').get()
item['total_equity'] = stock.xpath('./td[16]/text()').get()
item['circulating_equity'] = stock.xpath('./td[17]/text()').get()
yield item
```
接下来,在middleware.py中添加process_request方法使用Selenium爬取网站信息。代码如下:
```python
from scrapy import signals
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
class StockDownloaderMiddleware(object):
def __init__(self, timeout=None):
self.timeout = timeout
super().__init__()
@classmethod
def from_crawler(cls, crawler):
return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'))
def process_request(self, request, spider):
if request.meta.get('driver'):
spider.driver.get(request.url)
try:
WebDriverWait(spider.driver, self.timeout).until(EC.presence_of_element_located((By.XPATH, '//table[@id="dataTable"]/tbody/tr')))
return scrapy.HtmlResponse(url=spider.driver.current_url, body=spider.driver.page_source, encoding='utf-8', request=request)
except TimeoutException:
return scrapy.HtmlResponse(url=spider.driver.current_url, status=500, request=request)
```
最后,在settings.py文件中,添加管道信息、下载器信息,以及默认的请求头信息。代码如下:
```python
BOT_NAME = 'Stock'
SPIDER_MODULES = ['Stock.spiders']
NEWSPIDER_MODULE = 'Stock.spiders'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800,
'Stock.middlewares.StockDownloaderMiddleware': 543,
}
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'path/to/chromedriver'
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
ITEM_PIPELINES = {
'Stock.pipelines.StockPipeline': 300,
}
FEED_FORMAT = 'csv'
FEED_URI = 'stock.csv'
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
```
现在,您可以运行以下命令来爬取数据并将其保存到csv文件中:
```
scrapy crawl stock -o stock.csv
```
希望这对您有所帮助。
阅读全文