数据中有名为STATE的列,其中值为诸如CA,CT,MA等大写英文字母,但存在个别值为乱码以及缺失值,怎么用python代码实现数据清洗
时间: 2023-09-23 07:09:05 浏览: 26
可以使用Python的pandas库来进行数据清洗。
首先,读取数据并将STATE列转换为大写字母:
```python
import pandas as pd
df = pd.read_csv('data.csv')
df['STATE'] = df['STATE'].str.upper()
```
接着,找出乱码和缺失值并用NaN代替:
```python
import numpy as np
# 找出乱码
df['STATE'] = df['STATE'].apply(lambda x: np.nan if not x.isalpha() else x)
# 找出缺失值
df['STATE'] = df['STATE'].replace('', np.nan)
```
最后,删除包含NaN的行:
```python
df.dropna(subset=['STATE'], inplace=True)
```
整个代码如下:
```python
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df['STATE'] = df['STATE'].str.upper()
df['STATE'] = df['STATE'].apply(lambda x: np.nan if not x.isalpha() else x)
df['STATE'] = df['STATE'].replace('', np.nan)
df.dropna(subset=['STATE'], inplace=True)
```