pandas 0.25.2官方参考手册：Python数据分析利器

需积分: 9 176 浏览量更新于2024-07-14 收藏 14.14MB PDF 举报

"pandas官方参考手册，版本0.25.2，由Wes McKinney和PyData开发团队发布，日期为2019年10月23日。" pandas是Python编程语言中一个强大的开源数据处理库，遵循BSD许可证。它提供高性能、易于使用的数据结构和数据分析工具。这个库包含各种数据处理的功能，适用于数据清洗、预处理、统计分析和数据可视化等多种任务。在pandas 0.25.2版本中，有一些关键更新和改进： 1. **Python 3.8兼容性**：此版本增加了对Python 3.8的兼容性，这意味着用户现在可以在Python 3.8环境下使用pandas而不会遇到任何问题（GH28147）。 2. **Bug修复**： - **索引**：在DataFrame.reindex()函数中修复了一个回归问题，该问题导致limit参数未被正确遵循（GH28631）。这确保了重新索引操作将根据用户的限制进行。 - **RangeIndex**：修复了在RangeIndex.get_indexer()中针对递减RangeIndex时，目标值排序错误的问题。这改进了索引操作的准确性。 pandas的核心数据结构包括DataFrame、Series和Index。DataFrame是一个二维表格型数据结构，可以存储许多不同类型的数据（如整数、字符串、浮点数等），并且具备列名和行索引。Series则类似于一维数组，每个Series都有一个与之关联的索引。Index对象是数据结构的基础，用于标识和访问数据。 pandas提供了丰富的功能，例如数据导入导出（如CSV、Excel、SQL数据库等）、数据清洗（缺失值处理、数据类型转换）、数据合并（join、merge）、时间序列分析以及数据分组和聚合操作。此外，pandas还支持各种统计方法，如描述性统计、线性回归和时间序列分析等。对于机器学习而言，pandas是预处理数据的关键工具。用户可以使用pandas进行数据清洗，处理缺失值，转换数据格式，以及创建特征工程所需的衍生变量。通过pandas与Scikit-learn等机器学习库的结合，可以构建完整的数据分析和建模流程。在使用pandas时，用户还可以利用它提供的各种便利功能，如条件过滤、切片和选择特定数据子集、以及高效地合并多个数据源。pandas的易用性和灵活性使其成为Python数据科学领域不可或缺的一部分。 pandas是一个功能强大的数据处理库，其0.25.2版本的更新进一步提升了其稳定性和兼容性，使得数据分析师和机器学习工程师能够更高效地进行数据操作和分析。

pandas: powerful Python data analysis toolkit, Release 0.25.2

• Intuitive merging and joining data sets直观的合并和连接数据集

• Flexible reshaping and pivoting of data sets灵活的改变数据集的形状，或者旋转数据集

• Hierarchical labeling of axes (possible to have multiple labels per tick)对坐标轴进行分层标记，每个

刻度可能有多个标签

• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving /

loading data from the ultrafast HDF5 format强大的输入输出工具包括：包括单一的文件，excel文

件，数据库，还可以从HDF5形式的文件中导入和保存

• Time series-specific functionality: date range generation and frequency conversion, moving window

statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages

/ scientic research environments. For data scientists, working with data is typically divided into multiple

stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into

a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

• pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code.

However, as with anything else generalization usually sacrices performance. So if you focus on one

feature for your application you may be able to create a faster specialized tool.

• pandas is a dependency of statsmodels, making it an important part of the statistical computing

ecosystem in Python.

• pandas has been used extensively in production in nancial applications.

3.1.1 Data structures

Dimensions Name Description

1 Series 1D labeled homogeneously-typed array

2 DataFrame General 2D labeled, size-mutable tabular structure with po-

tentially heterogeneously-typed column

Why more than one data structure?

The best way to think about the pandas data structures is as exible containers for lower dimensional data.

For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be

able to insert and remove objects from these containers in a dictionary-like fashion.

Also, we would like sensible default behaviors for the common API functions which take into account the

typical orientation of time series and cross-sectional data sets. When using ndarrays to store 2- and 3-

dimensional data, a burden is placed on the user to consider the orientation of the data set when writing

functions; axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters

for performance). In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a

particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the

amount of mental eort required to code up data transformations in downstream functions.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows)

and the columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results

in more readable code:

for col in df.columns:

series = df[col]

# do something with series

12 Chapter 3. Getting started

特殊的时间序列功能：日期范围的生成与转换频率转换，移动数据窗口，移动窗口线性回归，日期移位和滞后

这里的许多原则去加入是为了弥补其他的编程语言和科学调查环境使用过程中经常出现的不足，对于一个数据科学家，数据

分析工作会被分为几个典型的阶段，清洗处理数据，分析数据，使之成为模型，并将分析结果组织成容易绘图的格式或者用

表格显示pandas就是做这些事情的理想工具。

padas是非常快速的。

许多低水平的算法字

节被大量的调整在

cyrthon代码中然而其

他的都使用同一种形

式，损失了性能，所

以对于你的设备你只

关注一个特征，你可

以创造出快速专用的

工具。

pandas依赖statsmodels模型，并且使他成为统计计算系统的重要组成部分

pandas已经广泛用于金融产品

数据结构

维度

名称

描述

序列

数据帧

一维带有标记的想同类型的数组

一般的二维带有标记的，大小可修改的表格结构

为什么有多个数据结构？

看待pandas数据结构最好的方式就是将它作为一个低维度的数据的容器，例如DataFrame是一个序列的容器，序列是一个标量的容器我们能够从这个容

器中插入或者移除对象5通过类似字典的方式

而且我们可

以合理的默

认共同的API

函数的行

为，这些函

数考虑了时

间序列和横

截面数据

集。当用数

组来存储2或

3维的数据，

考虑数据集

的方向对于

用户来说是

一个麻烦当

写一个函数

时，坐标轴

一般是不相

等的。在

pandas

坐标

轴将会给数

更多的含

义，对于一

个流行的数

据集将会有

一个好的方

法来确定数

据的方向。

这个目标是减少下游函数中编写数据转换所需要的脑力

劳动

例如，表格形式的

数据，在语义上对

于理解行和列更有

帮助而不是0轴和1

轴，通过数据帧的

列来进行迭代可以

产生更加具有可读

性的代码：

pandas: powerful Python data analysis toolkit, Release 0.25.2

3.1.2 Mutability and copying of data

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-

mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a

DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched.

In general we like to favor immutability where sensible.

3.1.3 Getting support

The rst stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas

community experts can answer through Stack Overow.

3.1.4 Community

pandas is actively supported today by a community of like-minded individuals around the world who con-

tribute their valuable time and energy to help make open source pandas possible. Thanks to all of our

contributors.

If you’re interested in contributing, please visit the contributing guide.

pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as

a world-class open-source project, and makes it possible to donate to the project.

3.1.5 Project governance

The governance process that pandas project has used informally since its inception in 2008 is formalized in

Project Governance documents. The documents clarify how decisions are made and how the various elements

of our community interact, including the relationship between open source collaborative development and

work that may be funded by for-prot or non-prot entities.

Wes McKinney is the Benevolent Dictator for Life (BDFL).

3.1.6 Development team

The list of the Core Team members and more detailed information can be found on the people’s page of the

governance repo.

3.1.7 Institutional partners

The information about current institutional partners can be found on pandas website page.

3.1.8 License

BSD 3-Clause License

,→Development Team

(continues on next page)

3.1. Package overview 13

pandas: powerful Python data analysis toolkit, Release 0.25.2

(continued from previous page)

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this

list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,

this list of conditions and the following disclaimer in the documentation

and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its

contributors may be used to endorse or promote products derived from

this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"

AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE

FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,

OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

3.2 10 minutes to pandas十分钟快速介绍pandas

This is a short introduction to pandas, geared mainly for new users. You can see more complex

recipes in the Cookbook.这是一个简短的介绍，适合于pandas的新用户，你可以看更复杂的方法在Cookbook

Customarily, we import as follows:通常我们这么去导入

In [1]: import numpy as np

In [2]: import pandas as pd

3.2.1 Object creation

See the Data Structure Intro section.

Creating a Series by passing a list of values, letting pandas create a default integer index:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s

Out[4]:

0 1.0

1 3.0

(continues on next page)

14 Chapter 3. Getting started

创造一个序列通过传入一个含有值的列表，来让pandas建立一个完整的索引

剩余2914页未读，继续阅读

m0_46427273

粉丝: 5
资源: 11

pandas 0.25.2官方参考手册：Python数据分析利器

scipy-ref-0.18.1.pdf

numpy-ref-1.17.0.pdf

HowToThink-Python-ref.rar_Python_

numpy-html-1.16.1 numpy-ref-1.16.1 numpy-user-1.16.1

Ref.Me：RefMeSite

Python-中文LaTeX手册

python库用户手册.zip

python办公自动化源码集锦-自动解压压缩文件

【Python编程中docutils.parsers.rst的角色与重要性】：提升代码文档质量的必备技能

【docutils.parsers.rst在数据处理中的应用案例分析】：从实际案例学习docutils的强大数据处理能力

最新资源