实现划分句子的决策树算法,完成对所输入的英文文本进行分句,并输出分句结果(不包括句末符号),其中 text 通过 input获取。英文分句规则为 !, ? are relatively unambiguous Period ; “.” is quite ambiguous ; Sentence boundary ; Abbreviations like Inc. or Dr. ; Numbers like .02% or 4.3
时间: 2023-11-28 10:51:38 浏览: 46
以下是一个基于决策树算法的英文文本分句 Python 实现:
```python
import re
def split_sentences(text):
# 句末符号
end_punctuations = ['!', '?']
# 缩写和数字
abbreviations = ["Inc.", "Dr.", "Mr.", "Mrs.", "Ms.", "Jr.", "Sr.", "e.g.", "i.e.", "vs.", "etc.", "no.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.", "Jul.", "Aug.", "Sept.", "Oct.", "Nov.", "Dec."]
numbers = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "%"]
# 辅助函数:判断是否为句末符号
def is_end_punctuation(char):
return char in end_punctuations
# 辅助函数:判断是否为缩写
def is_abbreviation(word):
return word in abbreviations
# 辅助函数:判断是否为数字
def is_number(word):
return any(char in numbers for char in word)
# 决策树算法
sentences = []
sentence = ""
for i, char in enumerate(text):
if is_end_punctuation(char) and i < len(text) - 1:
# 句末符号
sentence += char
sentences.append(sentence)
sentence = ""
elif char == "." and i < len(text) - 1:
# “.”的判断
next_char = text[i+1]
if next_char == " " and i < len(text) - 2 and text[i+2].isupper() and not is_abbreviation(text[i-2:i+3]):
sentence += char
sentences.append(sentence)
sentence = ""
elif next_char.isdigit() and not is_number(text[i-2:i+3]):
sentence += char
elif next_char.isalpha() and not is_abbreviation(text[i-1:i+3]):
sentence += char
else:
sentence += char
sentence += " "
else:
sentence += char
if sentence:
sentences.append(sentence)
return sentences
```
使用方法:
```python
text = input("Enter some text: ")
sentences = split_sentences(text)
for i, sentence in enumerate(sentences):
print(f"Sentence {i+1}: {sentence}")
```
输入示例:
```
Enter some text: Hello! How are you? I'm doing well. What about you?
```
输出示例:
```
Sentence 1: Hello!
Sentence 2: How are you?
Sentence 3: I'm doing well.
Sentence 4: What about you?
```
相关推荐
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![-](https://csdnimg.cn/download_wenku/file_type_column_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)