数据预处理代码python
时间: 2024-09-03 12:04:12 浏览: 79
数据预处理是数据分析或机器学习项目中的重要步骤,它包括清洗、转换和规范化数据以提高数据质量,并使之适合于模型的训练。以下是一些在Python中进行数据预处理的常用方法和代码示例:
1. 导入必要的库:
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
```
2. 数据清洗:
- 处理缺失值:
```python
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True) # 用平均值填充数值型列的缺失值
df.dropna(inplace=True) # 删除含有缺失值的行
```
- 删除重复值:
```python
df.drop_duplicates(inplace=True)
```
3. 数据转换:
- 特征编码:
```python
df['categorical_column'] = df['categorical_column'].astype('category').cat.codes # 将类别型数据转换为数值型
```
- 特征生成:
```python
df['feature_interaction'] = df['feature1'] * df['feature2'] # 创建两个特征的交叉项
```
4. 数据规范化:
- 标准化(使数据的均值为0,标准差为1):
```python
scaler = StandardScaler()
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])
```
- 归一化(将数值型特征缩放到0和1之间):
```python
df['feature'] = (df['feature'] - df['feature'].min()) / (df['feature'].max() - df['feature'].min())
```
5. 数据划分:
- 划分训练集和测试集:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)
```
阅读全文