python数据预处理代码
时间: 2023-08-30 08:10:48 浏览: 85
这是一个基本的python数据预处理代码模板:
```python
import pandas as pd
import numpy as np
# 读取数据
data = pd.read_csv('data.csv')
# 删除重复行
data.drop_duplicates(inplace=True)
# 处理缺失值
data.dropna(subset=['col1', 'col2'], inplace=True)
data['col3'].fillna(data['col3'].mean(), inplace=True)
# 数据变换
data['col4'] = np.log(data['col4'])
# 数据归一化/标准化
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
data['col5'] = scaler.fit_transform(data[['col5']])
scaler2 = StandardScaler()
data['col6'] = scaler2.fit_transform(data[['col6']])
# 特征提取
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
text = cv.fit_transform(data['col7'])
# 特征选择
from sklearn.feature_selection import SelectKBest, chi2
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
skb = SelectKBest(chi2, k=5)
X_new = skb.fit_transform(X, y)
# 切分数据集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 其他数据处理操作...
```
以上代码中的注释已经很详细了,可以根据自己的需求添加或删除相应的操作。
阅读全文