import jieba excludes={"将军","却说","这样","他们","东汉","","然而","自己","这个","没有"} txt=open("C:\python\三国演义.txt","r",encoding='utf-8').read() words=jieba.lcut(txt) counts={} for word in words: if len(word)==1: continue else: counts[word]=counts.get(word,0)+1 items=list(counts.items()) items.sort(key=lambda x:x[1],reverse=True) for i in range(15): word,count=items[i] print("{0:<10}{1:>5}".format(word,count))
时间: 2023-06-17 13:02:04 浏览: 148
这段代码使用了jieba库对三国演义进行分词,并且统计每个词出现的次数,最后输出出现频率最高的15个词及其出现次数。在代码中,使用了一个字典counts来存储每个词的出现次数,使用了items()方法将字典转化为列表,使用了sort()方法对列表进行排序,最后使用了format()方法进行输出。其中,excludes参数是在分词时指定需要排除的一些词语。注意,这段代码中读取文件的路径需要使用双斜杆或者单斜杆加r来表示,否则会出错。
相关问题
import jieba excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
Unfortunately, as an AI language model, I am not able to execute such a command. However, I can explain that the command "import jieba" imports the jieba library, which is a Chinese text segmentation library used to split Chinese sentences into individual words. The "excludes" parameter is used to exclude certain words from being segmented, in this case, the words "将军", "却说", "荆州", "二人", "不可", "不能", and "如此".
针对以下python代码的解释以及实验小结:import jieba excludes={"一个","两个","只见","如何","那里","哥哥","说道","这里","出来","众人"} txt = open("水浒传.txt", "r",encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word)==1: continue; else: counts[word] = counts.get(word,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
这段代码的功能是读取《水浒传》这本小说的文本,使用结巴分词库对文本进行分词,并统计每个词语出现的次数,最终输出出现频率排名前十的词语及其出现次数。
代码中使用了jieba分词库对文本进行分词,将结果保存在words列表中。接着使用一个字典counts来记录每个词语出现的次数,如果词语已经出现过,则将其出现次数加一;否则将其出现次数设为1。在统计完所有词语的出现次数后,代码会删除excludes列表中包含的词语,这些词语通常是一些无意义的停用词。最后将counts字典转换为一个元素为元组的列表items,并按照每个元组的第二个元素(即出现次数)进行降序排序。代码最后输出出现频率排名前十的词语及其出现次数。
实验小结:本代码演示了使用结巴分词库对文本进行分词,并进行文本挖掘的基本步骤,包括去除无意义的停用词、统计每个词语出现的次数、排序并输出出现频率排名前十的词语及其出现次数。这些步骤在文本处理和自然语言处理中都非常常见。
阅读全文