首页python如何用正则表达式除去html标签

python如何用正则表达式除去html标签

时间: 2024-10-24 11:17:43 浏览: 48

在Python中，你可以使用`re`模块（Regular Expression库）配合`BeautifulSoup`库来去除HTML标签。首先安装BeautifulSoup库，如果还没有的话： ```bash pip install beautifulsoup4 ``` 然后，你可以编写如下的函数来移除HTML标签： ```python from bs4 import BeautifulSoup import re def remove_html_tags(html_string): # 使用BeautifulSoup解析HTML字符串 soup = BeautifulSoup(html_string, 'lxml') # 使用正则表达式提取纯文本内容 clean_text = re.compile(r'<.*?>') # 匹配HTML标签 text_only = soup.get_text() # 获取无标签的文本 # 使用正则表达式再次清理，确保所有标签都被替换为空格 cleaned_text = re.sub(clean_text, '', text_only) return cleaned_text.strip() # 示例 html_content = "<p>Hello, <strong>world!</strong></p>" cleaned_text = remove_html_tags(html_content) print(cleaned_text) # 输出 "Hello, world!"

阅读全文