在python环境下进行,编写出实验代码 第一次实验:分词与统计 任务1:针对CDIAL-BIAS-race数据集,使用结巴分词工具对文件进行分词,输出:分过词的文件。 任务2:统计数据中可能导致种族文化偏见的敏感词(race.txt)的词频。 第2次实验:分词与评价 任务1、利用Jieba对CDIAL-BIAS-race文件进行分词,并抽取词表。 任务2、利用上述词表,编写一个正向最大匹配分词程序。 任务3、编写一个评价程序,以结巴分词结果为标准答案,计算最大匹配分词结果的准确率、召回率、F测度。
时间: 2024-03-23 18:36:15 浏览: 70
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题
任务1代码:
```python
import jieba
# 读取CDIAL-BIAS-race数据集
with open("CDIAL-BIAS-race.txt", "r", encoding="utf-8") as f:
text = f.read()
# 使用结巴分词进行分词
seg_list = jieba.cut(text)
# 将分词结果写入文件
with open("CDIAL-BIAS-race-seg.txt", "w", encoding="utf-8") as f:
f.write(" ".join(seg_list))
```
任务2代码:
```python
# 读取CDIAL-BIAS-race数据集和敏感词列表
with open("CDIAL-BIAS-race.txt", "r", encoding="utf-8") as f:
text = f.read()
with open("race.txt", "r", encoding="utf-8") as f:
sensitive_words = [line.strip() for line in f]
# 使用结巴分词进行分词
seg_list = jieba.cut(text)
# 统计词频
word_count = {}
for word in seg_list:
if word in sensitive_words:
word_count[word] = word_count.get(word, 0) + 1
# 输出词频结果
for word, count in word_count.items():
print(word, count)
```
任务3代码:
```python
# 读取结巴分词结果和自己实现的正向最大匹配分词结果
with open("CDIAL-BIAS-race-seg.txt", "r", encoding="utf-8") as f:
jieba_result = f.read().split()
with open("CDIAL-BIAS-race-seg-maxmatch.txt", "r", encoding="utf-8") as f:
max_match_result = f.read().split()
# 计算准确率、召回率、F测度
tp = 0
fp = 0
fn = 0
for word in max_match_result:
if word in jieba_result:
tp += 1
else:
fp += 1
for word in jieba_result:
if word not in max_match_result:
fn += 1
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_measure = 2 * precision * recall / (precision + recall)
# 输出评价结果
print("准确率:", precision)
print("召回率:", recall)
print("F测度:", f_measure)
```
阅读全文