简单爬虫代码的模块化实现
时间: 2023-10-26 07:16:35 浏览: 157
实现一个简单的爬虫代码时,可以将代码分成以下几个模块:
1. URL管理模块:负责管理需要爬取的URL队列、已经爬取过的URL集合,以及判断一个URL是否可以爬取等功能;
2. HTML下载模块:负责下载指定URL对应的HTML页面,并将其返回给爬虫;
3. HTML解析模块:负责解析HTML页面,提取需要的数据和新的URL,并将它们返回给爬虫;
4. 数据存储模块:负责将爬取到的数据保存到本地文件或数据库中。
下面是一个简单的爬虫代码的模块化实现:
```python
# url_manager.py
class UrlManager:
def __init__(self):
self.new_urls = set()
self.old_urls = set()
def has_new_url(self):
return len(self.new_urls) != 0
def add_new_url(self, url):
if url not in self.new_urls and url not in self.old_urls:
self.new_urls.add(url)
def get_new_url(self):
url = self.new_urls.pop()
self.old_urls.add(url)
return url
# html_downloader.py
import urllib.request
class HtmlDownloader:
def download(self, url):
if url is None:
return None
response = urllib.request.urlopen(url)
if response.getcode() != 200:
return None
return response.read()
# html_parser.py
from bs4 import BeautifulSoup
import re
import urllib.parse
class HtmlParser:
def parse(self, url, html_content):
if url is None or html_content is None:
return None
soup = BeautifulSoup(html_content, 'html.parser')
new_urls = self._get_new_urls(url, soup)
new_data = self._get_new_data(url, soup)
return new_urls, new_data
def _get_new_urls(self, url, soup):
new_urls = set()
links = soup.find_all('a', href=re.compile(r'^/item/'))
for link in links:
new_url = urllib.parse.urljoin(url, link['href'])
new_urls.add(new_url)
return new_urls
def _get_new_data(self, url, soup):
data = {}
data['url'] = url
title = soup.find('dd', class_='lemmaWgt-lemmaTitle-title').find('h1').get_text()
data['title'] = title
summary = soup.find('div', class_='lemma-summary').get_text()
data['summary'] = summary
return data
# html_outputer.py
class HtmlOutputer:
def __init__(self):
self.datas = []
def collect_data(self, data):
if data is None:
return
self.datas.append(data)
def output_html(self):
fout = open('output.html', 'w', encoding='utf-8')
fout.write('<html>')
fout.write('<head><meta charset="utf-8"></head>')
fout.write('<body>')
fout.write('<table>')
for data in self.datas:
fout.write('<tr>')
fout.write('<td>%s</td>' % data['url'])
fout.write('<td>%s</td>' % data['title'])
fout.write('<td>%s</td>' % data['summary'])
fout.write('</tr>')
fout.write('</table>')
fout.write('</body>')
fout.write('</html>')
fout.close()
# spider_main.py
class SpiderMain:
def __init__(self):
self.urls = UrlManager()
self.downloader = HtmlDownloader()
self.parser = HtmlParser()
self.outputer = HtmlOutputer()
def crawl(self, root_url):
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url = self.urls.get_new_url()
html_content = self.downloader.download(new_url)
new_urls, new_data = self.parser.parse(new_url, html_content)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
except Exception as e:
print('crawl failed:', e)
self.outputer.output_html()
```
这个爬虫程序中,每个模块都有自己的功能,且它们之间相互独立。这种模块化的实现方式可以让代码更加清晰,易于维护和扩展。
阅读全文