首页使用jieba分词将csv文件进行分词、去停用词、和特征化向量化后，分别进行保存结果

使用jieba分词将csv文件进行分词、去停用词、和特征化向量化后，分别进行保存结果

时间: 2024-03-22 18:37:25 浏览: 436

好的，可以按照以下步骤来实现： 1. 导入必要的库 ```python import pandas as pd import jieba from sklearn.feature_extraction.text import TfidfVectorizer ``` 2. 读取csv文件并分词 ```python data = pd.read_csv('data.csv') data['text_cut'] = data['text'].apply(lambda x: " ".join(jieba.cut(x))) ``` 3. 加载停用词表并去除停用词 ```python stopwords = pd.read_csv('stopwords.txt', names=['stopword'], encoding='utf-8') stopwords = stopwords['stopword'].tolist() data['text_cut'] = data['text_cut'].apply(lambda x: " ".join([word for word in x.split() if word not in stopwords])) ``` 4. 特征化向量化 ```python tfidf = TfidfVectorizer() features = tfidf.fit_transform(data['text_cut']) features_df = pd.DataFrame(features.toarray(), columns=tfidf.get_feature_names()) ``` 5. 保存分词后的结果到csv文件 ```python data.to_csv('data_cut.csv', index=False) ``` 6. 保存去除停用词后的结果到csv文件 ```python data.to_csv('data_cut_stopwords.csv', index=False) ``` 7. 保存特征化向量化后的结果到csv文件 ```python features_df.to_csv('data_features.csv', index=False) ``` 这样就完成了分词、去除停用词、特征化向量化后的结果保存。

阅读全文