Pandas数据分析速查表

需积分: 10 131 浏览量更新于2024-09-10 收藏 407KB PDF 举报

“pandas_cheat_sheet” Pandas是Python中一个强大的数据分析库，它提供了两种主要的数据结构：Series（一维数据结构）和DataFrame（二维数据结构）。本资源是一个Pandas速查表，由Arianne Colton和Sean Chen创建，用于帮助用户快速理解和操作这两个数据结构。 1. Series（一维数据结构） Series是一种类似于数组的对象，包含一组数据（可以是任何NumPy数据类型）以及与之关联的数据标签，也称为索引。如果未指定索引，则会创建一个默认的整数索引，从0开始直到数据长度减1。 - 创建Series： ```python series1 = pd.Series([1, 2], index=['a', 'b']) series1 = pd.Series(dict1) # 如果dict1是{'a': 1, 'b': 2}这样的字典 ``` - 获取Series值： ```python series1.values # 获取所有数值 series1['a'] # 通过索引获取单个值 series1[['b', 'a']] # 通过索引列表获取多个值 ``` - 获取Series索引： ```python series1.index # 获取索引数组 ``` - 设置或获取名称属性： ```python series1.name # 获取Series名称 series1.index.name # 获取索引名称 ``` - 常见操作： - 加法操作：Series之间的加法会自动对齐不同索引的数据。 ```python series1 + series2 ``` - 唯一值：使用`unique()`方法获取Series中的唯一值。 ```python series2 = series1.unique() ``` - 视作有序字典：Series可以被用作很多期望字典参数的函数的替代品。 2. DataFrame（二维数据结构） DataFrame是一个表格型的数据结构，包含有序的列集合，每列可以是不同的值类型。DataFrame可以被理解为一系列Series的字典。 - 创建DataFrame： ```python df = pd.DataFrame.from_dict(data, orient='columns') # 从字典创建 ``` 或者 ```python df = pd.DataFrame({'column1': [1, 2], 'column2': [3, 4]}) ``` - DataFrame操作： - 数据选择：通过列名、行索引或切片访问数据。 ```python df['column1'] # 选择一列 df.loc[0] # 通过行索引选择一行 df.iloc[0, 1] # 通过位置选择单元格 ``` - 数据过滤和条件查询：使用布尔索引。 ```python df[df['column1'] > 1] ``` - 数据操作：包括合并、连接、计算等。 ```python df.add(df2, fill_value=0) # 合并，用0填充缺失值 df.groupby('category').mean() # 按类别求平均值 ``` Pandas的这些基本操作构成了数据分析的基础，其高效性和易用性使得它成为处理和分析大量数据的首选工具。通过理解和掌握这个速查表中的内容，用户可以更有效地进行数据清洗、转换、合并、聚合等任务，从而实现更复杂的数据分析。

Data Analysis with PANDAS

CHEAT SHEET

Created By: arianne Colton and Sean Chen

DATA STruCTurES

DATA STruCTurES ConTinuED

SERIES (1D)

One-dimensional array-like object containing an array of

data (of any NumPy data type) and an associated array

of data labels, called its “index”. If index of data is not

specied, then a default one consisting of the integers 0

through N-1 is created.

Create Series

series1 = pd.Series ([1,

2], index = ['a', 'b'])

series1 = pd.Series(dict1)*

Get Series Values

series1.values

Get Values by Index

series1['a']

series1[['b','a']]

Get Series Index

series1.index

Get Name Attribute

(None is default)

series1.name

series1.index.name

** Common Index

Values are Added

series1 + series2

Unique But Unsorted

series2 = series1.unique()

* Can think of Series as a xed-length, ordered

dict. Series can be substitued into many

functions that expect a dict.

** Auto-align differently-indexed data in arithmetic

operations

DATAFRAME (2D)

Tabular data structure with ordered collections of

columns, each of which can be different value type.

Data Frame (DF) can be thought of as a dict of Series.

Create DF

(from a dict of

equal-length lists

or NumPy arrays)

dict1 = {'state': ['Ohio',

'CA'], 'year': [2000, 2010]}

df1 = pd.DataFrame(dict1)

# columns are placed in sorted order

df1 = pd.DataFrame(dict1,

index = ['row1', 'row2']))

# specifying index

df1 = pd.DataFrame(dict1,

columns = ['year', 'state'])

# columns are placed in your given order

* Create DF

(from nested dict

of dicts)

The inner keys as

row indices

dict1 = {'col1': {'row1': 1,

'row2': 2}, 'col2': {'row1':

3, 'row2': 4} }

df1 = pd.DataFrame(dict1)

* DF has a “to_panel()” method which is the

inverse of “to_frame()”.

** Hierarchical indexing makes N-dimensional

arrays unnecessary in a lot of cases. Aka

prefer to use Stacked DF, not Panel data.

INDEX OBJECTS

Immutable objects that hold the axis labels and other

metadata (i.e. axis name)

• i.e. Index, MultiIndex, DatetimeIndex, PeriodIndex

• Any sequence of labels used when constructing

Series or DF internally converted to an Index.

• Can functions as xed-size set in additional to being

array-like.

HIERARCHICAL INDEXING

Multiple index levels on an axis : A way to work with

higher dimensional data in a lower dimensional form.

MultiIndex :

series1 = Series(np.random.randn(6), index =

[['a', 'a', 'a', 'b', 'b', 'b'], [1, 2, 3,

1, 2, 3]])

series1.index.names = ['key1', 'key2']

Series Partial

Indexing

series1['b'] # Outer Level

series1[:, 2] # Inner Level

DF Partial

Indexing

df1['outerCol3','InnerCol2']

df1['outerCol3']['InnerCol2']

Swaping and Sorting Levels

Swap Level (level

interchanged) *

swapSeries1 = series1.

swaplevel('key1', 'key2')

Sort Level

series1.sortlevel(1)

# sorts according to rst inner level

MiSSing DATA

Python NaN - np.nan(not a number)

Pandas *

NaN or python built-in None mean

missing/NA values

Use pd.isnull(), pd.notnull() or

series1/df1.isnull() to detect missing data.

FILTERING OUT MISSING DATA

dropna() returns with ONLY non-null data, source

data NOT modied.

df1.dropna() # drop any row containing missing value

df1.dropna(axis = 1) # drop any column

containing missing values

df1.dropna(how = 'all') # drop row that are all

missing

df1.dropna(thresh = 3) # drop any row containing

< 3 number of observations

FILLING IN MISSING DATA

df2 = df1.llna(0) # ll all missing data with 0

df1.llna('inplace = True') # modify in-place

Use a different ll value for each column :

df1.llna({'col1' : 0, 'col2' : -1})

Only forward ll the 2 missing values in front :

df1.llna(method = 'fll', limit = 2)

i.e. for column1, if row 3-6 are missing. so 3 and 4 get lled

with the value from 2, NOT 5 and 6.

Get Columns and

Row Names

df1.columns

df1.index

Get Name

Attribute

(None is default)

df1.columns.name

df1.index.name

Get Values

df1.values

# returns the data as a 2D ndarray, the

dtype will be chosen to accomandate all of

the columns

** Get Column as

Series

df1['state'] or df1.state

** Get Row as

Series

df1.ix['row2'] or df1.ix[1]

Assign a column

that doesn’t exist

will create a new

column

df1['eastern'] = df1.state

== 'Ohio'

Delete a column

del df1['eastern']

Switch Columns

and Rows

df1.T

* Dicts of Series are treated the same as Nested

dict of dicts.

** Data returned is a ‘view’ on the underlying

data, NOT a copy. Thus, any in-place

modicatons to the data will be reected in df1.

PANEL DATA (3D)

Create Panel Data : (Each item in the Panel is a DF)

import pandas_datareader.data as web

panel1 = pd.Panel({stk : web.get_data_

yahoo(stk, '1/1/2000', '1/1/2010')

for stk in ['AAPL', 'IBM']})

# panel1 Dimensions : 2 (item) * 861 (major) * 6 (minor)

“Stacked” DF form : (Useful way to represent panel data)

panel1 = panel1.swapaxes('item', 'minor')

panel1.ix[:, '6/1/2003', :].to_frame() *

=> Stacked DF (with hierarchical indexing **) :

# Open High Low Close Volume Adj-Close

# major minor

# 2003-06-01 AAPL

# IBM

# 2003-06-02 AAPL

# IBM

Common Ops :

Swap and Sort **

series1.swaplevel(0,

1).sortlevel(0)

# the order of rows also change

* The order of the rows do not change. Only the

two levels got swapped.

** Data selection performance is much better if

the index is sorted starting with the outermost

level, as a result of calling sortlevel(0) or

sort_index().

Summary Statistics by Level

Most stats functions in DF or Series have a “level”

option that you can specify the level you want on an

axis.

Sum rows (that

have same ‘key2’

value)

df1.sum(level = 'key2')

Sum columns ..

df1.sum(level = 'col3', axis

= 1)

• Under the hood, the functionality provided here

utilizes panda’s “groupby”.

DataFrame’s Columns as Indexes

DF’s “set_index” will create a new DF using one or more

of its columns as the index.

New DF using

columns as index

df2 = df1.set_index(['col3',

'col4']) *

‡

# col3 becomes the outermost index, col4

becomes inner index. Values of col3, col4

become the index values.

* "reset_index" does the opposite of "set_index",

the hierarchical index are moved into columns.

‡

By default, 'col3' and 'col4' will be removed

from the DF, though you can leave them by

option : 'drop = False'.

下载后可阅读完整内容，剩余3页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

qq_42058517

粉丝: 0

Pandas数据分析速查表

Python机器学习速查表：常用包与方法大全

Python编程效率提升50%的cheatsheet

数据科学与机器学习速查表合集

Pandas_Cheat_Sheet.pdf

Pandas_Cheat_Sheet--Python_For_Data_Science

5.Pandas_Cheat_Sheet.pdf

beginners_python_cheat_sheet_pcc_python_

beginners_python_cheat_sheet_pcc_matplotlib_pythonbook_

各类速查表汇总-Python_Seaborn_Cheat_Sheet

python机器学习速查表_cheat_sheet.rar

最新资源