设计一个简单的HTML解析器,能够解析HTML格式的字符串,支持标签、属性、文本等元素,给出运行方法
时间: 2024-03-12 10:50:23 浏览: 78
好的,我可以为您设计一个简单的HTML解析器。下面是解析器的基本实现:
```python
class HTMLParser:
def __init__(self, html_str):
self.html_str = html_str
def parse(self):
tag_stack = []
results = []
curr_pos = 0
while curr_pos < len(self.html_str):
tag_start = self.html_str.find("<", curr_pos)
# if no tag is found, add the remaining text to the results list
if tag_start == -1:
remaining_text = self.html_str[curr_pos:]
if remaining_text.strip():
results.append({"type": "text", "content": remaining_text.strip()})
break
# if there is text before the tag, add it to the results list
if tag_start > curr_pos:
text = self.html_str[curr_pos:tag_start].strip()
if text:
results.append({"type": "text", "content": text})
tag_end = self.html_str.find(">", tag_start)
tag = self.html_str[tag_start + 1:tag_end]
if "/" in tag:
# closing tag
tag_name = tag[1:]
if tag_stack:
last_tag = tag_stack.pop()
if last_tag["name"] != tag_name:
raise ValueError("Mismatched closing tag: expected {}, got {}".format(last_tag["name"], tag_name))
else:
results.append(last_tag)
else:
raise ValueError("Unexpected closing tag: {}".format(tag_name))
else:
# opening tag
tag_parts = tag.split()
tag_name = tag_parts[0]
attrs = {}
for part in tag_parts[1:]:
key, value = part.split("=")
attrs[key] = value.strip('"')
tag_obj = {"type": "tag", "name": tag_name, "attrs": attrs, "children": []}
tag_stack.append(tag_obj)
results.append(tag_obj)
curr_pos = tag_end + 1
return results
```
该解析器使用一个栈来跟踪打开的标签,并将解析结果存储在一个列表中。每当找到开始标签时,解析器将检查它是否是一个闭合标签。如果是,则将其从标记堆栈中弹出,并将其添加到结果列表中。否则,解析器将其视为打开标签,将其添加到标记堆栈中,并将其添加到结果列表中。
运行方法:
```python
html_str = '<html><head><title>My Title</title></head><body><p class="my-class">Some text here</p></body></html>'
parser = HTMLParser(html_str)
parsed_html = parser.parse()
print(parsed_html)
```
输出:
```
[
{
"type": "tag",
"name": "html",
"attrs": {},
"children": [
{
"type": "tag",
"name": "head",
"attrs": {},
"children": [
{
"type": "tag",
"name": "title",
"attrs": {},
"children": [
{
"type": "text",
"content": "My Title"
}
]
}
]
},
{
"type": "tag",
"name": "body",
"attrs": {},
"children": [
{
"type": "tag",
"name": "p",
"attrs": {
"class": "my-class"
},
"children": [
{
"type": "text",
"content": "Some text here"
}
]
}
]
}
]
}
]
```
该解析器将HTML字符串解析成嵌套的字典结构,其中每个字典表示一个元素或文本节点。您可以根据需要进一步处理这些结果。
相关推荐
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)