分词与评价 任务1、利用Jieba对CDIAL-BIAS-race文件进行分词,并抽取词表。 任务2、利用上述词表,编写一个正向最大匹配分词程序。 任务3、编写一个评价程序,以结巴分词结果为标准答案,计算最大匹配分词结果的准确率、召回率、F测度。
时间: 2024-03-23 07:36:15 浏览: 120
任务1代码:
```python
import jieba
# 读取CDIAL-BIAS-race数据集
with open("CDIAL-BIAS-race.txt", "r", encoding="utf-8") as f:
text = f.read()
# 使用结巴分词进行分词,并抽取词表
seg_list = jieba.cut(text)
word_set = set(seg_list)
# 将词表写入文件
with open("CDIAL-BIAS-race-vocab.txt", "w", encoding="utf-8") as f:
f.write("\n".join(word_set))
```
任务2代码:
```python
# 读取CDIAL-BIAS-race数据集和词表
with open("CDIAL-BIAS-race.txt", "r", encoding="utf-8") as f:
text = f.read()
with open("CDIAL-BIAS-race-vocab.txt", "r", encoding="utf-8") as f:
word_set = set(line.strip() for line in f)
# 实现正向最大匹配分词算法
def max_match(text, word_set):
seg_list = []
while text:
for i in range(len(text), 0, -1):
if text[:i] in word_set:
seg_list.append(text[:i])
text = text[i:]
break
else:
seg_list.append(text[0])
text = text[1:]
return seg_list
# 对CDIAL-BIAS-race数据集进行分词
seg_list = max_match(text, word_set)
# 将分词结果写入文件
with open("CDIAL-BIAS-race-seg-maxmatch.txt", "w", encoding="utf-8") as f:
f.write(" ".join(seg_list))
```
任务3代码:
```python
# 读取结巴分词结果和自己实现的正向最大匹配分词结果
with open("CDIAL-BIAS-race-seg.txt", "r", encoding="utf-8") as f:
jieba_result = f.read().split()
with open("CDIAL-BIAS-race-seg-maxmatch.txt", "r", encoding="utf-8") as f:
max_match_result = f.read().split()
# 计算准确率、召回率、F测度
tp = 0
fp = 0
fn = 0
for word in max_match_result:
if word in jieba_result:
tp += 1
else:
fp += 1
for word in jieba_result:
if word not in max_match_result:
fn += 1
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_measure = 2 * precision * recall / (precision + recall)
# 输出评价结果
print("准确率:", precision)
print("召回率:", recall)
print("F测度:", f_measure)
```
阅读全文
相关推荐
![-](https://img-home.csdnimg.cn/images/20241231044955.png)
![-](https://img-home.csdnimg.cn/images/20241231045021.png)
![-](https://img-home.csdnimg.cn/images/20241231044955.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![-](https://img-home.csdnimg.cn/images/20241226111658.png)
![](https://csdnimg.cn/download_wenku/file_type_ask_c1.png)
![-](https://img-home.csdnimg.cn/images/20241231045021.png)
![-](https://img-home.csdnimg.cn/images/20241231045053.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pptx](https://img-home.csdnimg.cn/images/20241231044947.png)
![zip](https://img-home.csdnimg.cn/images/20241231045053.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)
![pdf](https://img-home.csdnimg.cn/images/20241231044930.png)