scrapy爬取北京公交车信息 爬取北京公交车信息(https://beijing.8684.cn):公交车名称(lineName),运行时间(time),票价信息(price),所属公司(campony),往返线路(upline和downline),并将其保存在mysql数据库(bus_information数据库,information表)中。
时间: 2023-06-27 22:00:51 浏览: 119
好的,需要安装Scrapy和mysql-connector-python库。可以通过以下命令进行安装:
```pip install scrapy```
```pip install mysql-connector-python```
接下来,我们创建一个新的Scrapy项目:
```scrapy startproject bus_spider```
然后进入该项目目录,创建一个新的Spider:
```scrapy genspider bus beijing.8684.cn```
然后,在spiders/bus.py文件中编写爬虫代码如下:
```python
import scrapy
import mysql.connector
class BusSpider(scrapy.Spider):
name = 'bus'
allowed_domains = ['beijing.8684.cn']
start_urls = ['https://beijing.8684.cn/']
def parse(self, response):
# 获取所有公交车的链接
bus_links = response.css('#con_site_1 a::attr(href)').extract()
for link in bus_links:
yield scrapy.Request(response.urljoin(link), callback=self.parse_bus)
def parse_bus(self, response):
# 获取公交车信息
lineName = response.css('.bus_i_t1 h1::text').extract_first().strip()
time = response.css('.bus_i_content p::text')[0].extract().strip()
price = response.css('.bus_i_content p::text')[1].extract().strip()
campony = response.css('.bus_i_content p::text')[2].extract().strip()
upline = response.css('.bus_line_site .bus_line_top .bus_line_txt::text').extract()
downline = response.css('.bus_line_site .bus_line_bottom .bus_line_txt::text').extract()
# 保存到mysql数据库
cnx = mysql.connector.connect(user='root', password='password', host='127.0.0.1', database='bus_information')
cursor = cnx.cursor()
add_information = ("INSERT INTO information "
"(lineName, time, price, campony, upline, downline) "
"VALUES (%s, %s, %s, %s, %s, %s)")
information_data = (lineName, time, price, campony, ';'.join(upline), ';'.join(downline))
cursor.execute(add_information, information_data)
cnx.commit()
cursor.close()
cnx.close()
```
在这段代码中,我们首先使用CSS选择器获取所有公交车的链接,并对每个链接发送一个Request请求。然后,在parse_bus函数中,我们使用CSS选择器获取公交车名称、运行时间、票价信息、所属公司、往返线路,并将这些信息保存到mysql数据库中。
最后,需要在settings.py文件中设置mysql数据库连接信息:
```python
MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'bus_information'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'password'
ITEM_PIPELINES = {
'bus_spider.pipelines.BusSpiderPipeline': 300,
}
```
然后,在bus_spider/pipelines.py文件中编写一个Pipeline,用于将数据保存到mysql数据库中:
```python
import mysql.connector
class BusSpiderPipeline:
def __init__(self, mysql_host, mysql_dbname, mysql_user, mysql_password):
self.mysql_host = mysql_host
self.mysql_dbname = mysql_dbname
self.mysql_user = mysql_user
self.mysql_password = mysql_password
@classmethod
def from_crawler(cls, crawler):
return cls(
mysql_host=crawler.settings.get('MYSQL_HOST'),
mysql_dbname=crawler.settings.get('MYSQL_DBNAME'),
mysql_user=crawler.settings.get('MYSQL_USER'),
mysql_password=crawler.settings.get('MYSQL_PASSWORD')
)
def open_spider(self, spider):
self.cnx = mysql.connector.connect(
user=self.mysql_user,
password=self.mysql_password,
host=self.mysql_host,
database=self.mysql_dbname
)
def close_spider(self, spider):
self.cnx.close()
def process_item(self, item, spider):
cursor = self.cnx.cursor()
add_information = ("INSERT INTO information "
"(lineName, time, price, campony, upline, downline) "
"VALUES (%s, %s, %s, %s, %s, %s)")
information_data = (item['lineName'], item['time'], item['price'], item['campony'], item['upline'], item['downline'])
cursor.execute(add_information, information_data)
self.cnx.commit()
cursor.close()
return item
```
最后,在settings.py文件中启用Pipeline:
```python
ITEM_PIPELINES = {
'bus_spider.pipelines.BusSpiderPipeline': 300,
}
```
现在,可以通过以下命令运行爬虫:
```scrapy crawl bus```
爬虫会将所有公交车的信息保存到mysql数据库中。
阅读全文