首页词袋预处理python实现

词袋预处理python实现

时间: 2023-07-11 19:50:43 浏览: 43

词袋模型是自然语言处理中的一种简单模型，将文本表示为词汇表中的单词出现的频率。下面是一个简单的Python实现： ```python import re from collections import Counter def preprocess_text(text): # 将字符串转换为小写 text = text.lower() # 移除非字母字符 text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # 分割字符串为单词列表 words = text.split() # 移除停用词（可选） stopwords = ["a", "an", "the", "in", "on", "at", "to", "of", "for", "with", "by", "that", "this", "these", "those"] words = [word for word in words if word not in stopwords] # 计算每个单词的出现次数 word_counts = Counter(words) # 返回单词计数器 return word_counts ``` 该函数接受一个字符串作为输入，执行以下操作： 1. 将字符串转换为小写 2. 移除非字母字符 3. 分割字符串为单词列表 4. 移除停用词（可选） 5. 计算每个单词的出现次数该函数返回一个计数器（Counter）对象，其中包含每个单词的出现次数。您可以使用此计数器对象构建单词向量。