首页中文数据清洗代码

中文数据清洗代码

时间: 2023-08-28 09:03:29 浏览: 46

好的，以下是Python中用于中文数据清洗的代码示例： ```python import pandas as pd import jieba import re # 读取CSV文件 data = pd.read_csv('data.csv') # 删除重复行 data.drop_duplicates(inplace=True) # 删除缺失值 data.dropna(inplace=True) # 分词 def cut_words(text): return ' '.join(jieba.cut(text)) data['text'] = data['text'].apply(cut_words) # 移除停用词 with open('stopwords.txt', 'r', encoding='utf-8') as f: stop_words = f.read().split() def remove_stopwords(text): tokens = text.split() filtered_tokens = [token for token in tokens if token not in stop_words] filtered_text = ' '.join(filtered_tokens) return filtered_text data['text'] = data['text'].apply(remove_stopwords) # 移除标点符号和特殊字符 def remove_punctuations(text): text = re.sub(r'[^\w\s]', '', text) text = re.sub(r'\d+', '', text) return text data['text'] = data['text'].apply(remove_punctuations) # 保存处理后的数据 data.to_csv('cleaned_data.csv', index=False) ``` 此代码使用了jieba库对中文文本进行分词，并使用了一个停用词列表来移除常见的无意义词汇。它还使用了正则表达式来移除标点符号和数字。你可以根据你的数据类型和清洗需求进行相应的更改和调整。