pandas: powerful Python data analysis toolkit, Release 0.25.2
• Intuitive merging and joining data sets直观的合并和连接数据集
• Flexible reshaping and pivoting of data sets灵活的改变数据集的形状,或者旋转数据集
• Hierarchical labeling of axes (possible to have multiple labels per tick)对坐标轴进行分层标记,每个
刻度可能有多个标签
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving /
loading data from the ultrafast HDF5 format强大的输入输出工具包括:包括单一的文件,excel文
件,数据库,还可以从HDF5形式的文件中导入和保存
• Time series-specific functionality: date range generation and frequency conversion, moving window
statistics, moving window linear regressions, date shifting and lagging, etc.
Many of these principles are here to address the shortcomings frequently experienced using other languages
/ scientic research environments. For data scientists, working with data is typically divided into multiple
stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into
a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.
Some other notes
• pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code.
However, as with anything else generalization usually sacrices performance. So if you focus on one
feature for your application you may be able to create a faster specialized tool.
• pandas is a dependency of statsmodels, making it an important part of the statistical computing
ecosystem in Python.
• pandas has been used extensively in production in nancial applications.
3.1.1 Data structures
Dimensions Name Description
1 Series 1D labeled homogeneously-typed array
2 DataFrame General 2D labeled, size-mutable tabular structure with po-
tentially heterogeneously-typed column
Why more than one data structure?
The best way to think about the pandas data structures is as exible containers for lower dimensional data.
For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be
able to insert and remove objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the
typical orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-
dimensional data, a burden is placed on the user to consider the orientation of the data set when writing
functions; axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters
for performance). In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a
particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the
amount of mental eort required to code up data transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows)
and the columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results
in more readable code:
for col in df.columns:
series = df[col]
# do something with series
12 Chapter 3. Getting started
特殊的时间序列功能:日期范围的生成与转换频率转换,移动数据窗口,移动窗口线性回归,日期移位和滞后
这里的许多原则去加入是为了弥补其他的编程语言和科学调查环境使用过程中经常出现的不足,对于一个数据科学家,数据
分析工作会被分为几个典型的阶段,清洗处理数据,分析数据,使之成为模型,并将分析结果组织成容易绘图的格式或者用
表格显示pandas就是做这些事情的理想工具。
padas是非常快速的。
许多低水平的算法字
节被大量的调整在
cyrthon代码中然而其
他的都使用同一种形
式,损失了性能,所
以对于你的设备你只
关注一个特征,你可
以创造出快速专用的
工具。
pandas依赖statsmodels模型,并且使他成为统计计算系统的重要组成部分
pandas已经广泛用于金融产品
数据结构
维度
名称
描述
序列
数据帧
一维带有标记的想同类型的数组
一般的二维带有标记的,大小可修改的表格结构
为什么有多个数据结构?
看待pandas数据结构最好的方式就是将它作为一个低维度的数据的容器,例如DataFrame是一个序列的容器,序列是一个标量的容器我们能够从这个容
器中插入或者移除对象5通过类似字典的方式
而且我们可
以合理的默
认共同的API
函数的行
为,这些函
数考虑了时
间序列和横
截面数据
集。当用数
组来存储2或
3维的数据,
考虑数据集
的方向对于
用户来说是
一个麻烦当
写一个函数
时,坐标轴
一般是不相
等的。在
pandas
坐标
轴将会给数
更多的含
义,对于一
个流行的数
据集将会有
一个好的方
法来确定数
据的方向。
这个目标是减少下游函数中编写数据转换所需要的脑力
劳动
例如,表格形式的
数据,在语义上对
于理解行和列更有
帮助而不是0轴和1
轴,通过数据帧的
列来进行迭代可以
产生更加具有可读
性的代码: