python文本分析预处理
时间: 2023-11-28 16:45:42 浏览: 94
以下是Python文本分析预处理的一些常见方法:
1. 去除停用词
```python
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
text = "This is an example sentence to remove stop words from."
words = text.split()
filtered_words = [word for word in words if word.lower() not in ENGLISH_STOP_WORDS]
filtered_text = ' '.join(filtered_words)
print(filtered_text) # 输出:"example sentence remove stop words."
```
2. 去除标点符号和数字
```python
import string
text = "This is an example sentence with 123 numbers and punctuation!@#$"
translator = str.maketrans('', '', string.punctuation + string.digits)
text = text.translate(translator)
print(text) # 输出:"This is an example sentence with numbers and punctuation"
```
3. 去除HTML标签
```python
import re
text = "<p>This is an example sentence with <strong>HTML tags</strong>.</p>"
clean_text = re.sub('<[^<]+?>', '', text)
print(clean_text) # 输出:"This is an example sentence with HTML tags."
```
4. 去除特殊字符和多余空格
```python
import re
text = " This is an example sentence with special characters and extra spaces. "
clean_text = re.sub('[^A-Za-z0-9]+', ' ', text).strip()
print(clean_text) # 输出:"This is an example sentence with special characters and extra spaces."
```
阅读全文