没有合适的资源?快使用搜索试试~ 我知道了~
首页数据清洗之 数据离散化
数据离散化 数据离散化就是分箱 一把你常用分箱方法是等频分箱或者等宽分箱 一般使用pd.cut或者pd.qcut函数 pandas.cut(x, bins, right=True, labels) x: 数据 bins: 离散化的数目,或者切分的区间 labels: 离散化后各个类别的标签 right: 是否包含区间右边的值 import pandas as pd import numpy as np import os os.getcwd() 'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据' os.chdir('D:\\Jupyter\\noteboo
资源详情
资源评论
资源推荐

数据清洗之数据清洗之 数据离散化数据离散化
数据离散化数据离散化
数据离散化就是分箱
一把你常用分箱方法是等频分箱或者等宽分箱
一般使用pd.cut或者pd.qcut函数
pandas.cut(x, bins, right=True, labels)
x: 数据
bins: 离散化的数目,或者切分的区间
labels: 离散化后各个类别的标签
right: 是否包含区间右边的值
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\Jupyter\notebook\Python数据清洗实战\数据'
os.chdir('D:\Jupyter\notebook\Python数据清洗实战\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
def f(x):
if '$' in str(x):
x = str(x).strip('$')
x = str(x).replace(',', '')
else:
x = str(x).replace(',', '')
return float(x)
df['Price'] = df['Price'].apply(f)
df['Mileage'] = df['Mileage'].apply(f)
df.head(5)
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model … Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status
0 Used
mint!!! very low
miles
11412.0
McHenry,
Illinois,
United
States
2013.0 16000.0 Black
Harley-
Davidson
Unspecified Touring … NaN FALSE 8.1 NaN 2427 Private Seller
1 Used Perfect condition 17200.0
Fort
Recovery,
Ohio,
United
States
2016.0 60.0 Black
Harley-
Davidson
Vehicle has
an existing
warranty
Touring … NaN FALSE 100 17 657 Private Seller
2 Used NaN 3872.0
Chicago,
Illinois,
United
States
1970.0 25763.0 Silver/Blue BMW
Vehicle does
NOT have
an existing
warranty
R-
Series
… NaN FALSE 100 NaN 136 NaN
3 Used
CLEAN TITLE
READY TO
RIDE HOME
6575.0
Green
Bay,
Wisconsin,
United
States
2009.0 33142.0 Red
Harley-
Davidson
NaN Touring … NaN FALSE 100 NaN 2920 Dealer
4 Used NaN 10000.0
West
Bend,
Wisconsin,
United
States
2012.0 17800.0 Blue
Harley-
Davidson
NO
WARRANTY
Touring … NaN FALSE 100 13 271 OWNER
5 rows × 22 columns
df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))
# 计算频数
df['Price_bin'].value_counts()
0 6762
1 659
2 50
3 20
4 2
Name: Price_bin, dtype: int64
%matplotlib inline
df['Price_bin'].value_counts().plot(kind='bar')
df['Price_bin'].hist()










安全验证
文档复制为VIP权益,开通VIP直接复制

评论0