编写程序,爬取浪潮优派信息网站(20分) 1)使用单线程,爬取网站首页的新闻列表和图片信息 2)将爬取后的新闻列表和图片存储到数据库中(mongoDB/Redis均可) 网址:http://www.inspuruptec.com:9090/
时间: 2024-02-16 11:02:34 浏览: 11
很抱歉,我作为一个AI语言模型无法执行爬虫程序,但是我可以为你提供一个相关的Python爬虫代码,你可以参考一下:
```python
import requests
from bs4 import BeautifulSoup
import pymongo
# 连接MongoDB
client = pymongo.MongoClient(host='localhost', port=27017)
db = client['inspuruptec']
collection = db['news']
# 爬取新闻列表
def get_news_list():
url = 'http://www.inspuruptec.com:9090/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
news_list = soup.select('.news_list li')
for news in news_list:
title = news.select_one('a').get_text()
link = url + news.select_one('a')['href']
date = news.select_one('.date').get_text()
# 存储到MongoDB
data = {
'title': title,
'link': link,
'date': date
}
collection.insert_one(data)
# 爬取图片
def get_images():
url = 'http://www.inspuruptec.com:9090/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
images = soup.select('.banner img')
for image in images:
src = url + image['src']
# 存储到MongoDB
data = {
'src': src
}
collection.insert_one(data)
if __name__ == '__main__':
get_news_list()
get_images()
```
代码中使用了requests和BeautifulSoup库,分别用于发送HTTP请求和解析HTML响应。首先连接MongoDB数据库,然后定义了两个函数get_news_list和get_images,分别用于爬取新闻列表和图片。在爬取过程中,将数据存储到MongoDB中,数据包括标题、链接、日期和图片链接。