探索科学计算基石：优雅的SciPy高级应用

需积分: 9 184 浏览量更新于2024-07-18 收藏 11.9MB PDF 举报

《优雅的SciPy》（Elegant SciPy）是一本由Juan Nunez-Iglesias、Stéfan van der Walt和Harriet Dashnow合著的专著，旨在探索科学计算中的关键工具——NumPy数组，以及如何利用SciPy进行高级数据处理和分析。这本书将带你进入科学计算的世界，教你如何： 1. **理解NumPy数组**：作为数值科学计算的基础，NumPy数组提供了高效的多维数组操作和数学函数，帮助读者高效地处理大量数据。 2. **量化归一化**：通过这个技术，可以确保测量数据符合特定的分布，这对于数据预处理和比较至关重要，特别是在实验设计或数据分析中。 3. **图像区域表示**：书中介绍了如何使用Region Adjacency Graph（RAG）来区分和表示图像中的不同区域，有助于图像分析和特征提取。 4. **快速傅里叶变换（FFT）**：对于时域或空间数据，你可以学习如何将其转换到频域，这对于信号处理和频谱分析非常重要。 5. **解决稀疏矩阵问题**：SciPy的稀疏模块提供了强大的工具来处理图像分割等应用中遇到的大型稀疏矩阵问题。 6. **线性代数应用**：作者会介绍如何利用SciPy的包进行矩阵运算，包括求解线性方程组、特征值和特征向量等。 7. **图像对齐（注册）**：在图像处理中，优化模块提供了解决图像配准问题的方法，这在医疗成像、计算机视觉等领域尤为关键。 8. **Python数据流处理**：通过Python的数据流工具和Toolz库，可以处理大规模数据集，实现高效的在线分析和实时处理。 9. **版权和出版信息**：《优雅的SciPy》由O'Reilly Media出版，享有版权，适合教育、商业或销售推广使用。电子版也广泛提供。书中还包含了编辑、生产编辑、校对和索引员等相关工作人员的贡献。这本书不仅适合科研人员和工程师，也适合任何想要深入了解Python科学计算的开发者和数据分析师。通过阅读本书，读者将获得实践经验和理论知识，以提升在实际项目中的工作效率。

# Create a "slice" of x

y = x[:2]

print(y)

[1 2]

# Set the first element of y to be 6

y[0] = 6

print(y)

[6 2]

Notice that although we edited y, x has also changed, because y was referencing the

same data!

# Now the first element in x has changed to 6!

print(x)

[6 2 3]

This does mean you have to be careful with array references. If you want to manipu‐

late the data without touching the original, it’s easy to make a copy:

y = np.copy(x[:2])

Vectorization

Earlier we talked about the speed of operations on arrays. Once of the tricks Numpy

uses to speed things up is vectorization. Vectorization is where you apply a calculation

to each element in an array, without having to use a for loop. In addition to speeding

things up, this can result in more natural, readable code. Let’s look at some examples.

x = np.array([1, 2, 3, 4])

print(x * 2)

[2 4 6 8]

Here, we have x, an array of 4 values, and we have implicitly multiplied every element

in x by 2, a single value.

y = np.array([0, 1, 2, 1])

print(x + y)

[1 3 5 5]

Now, we have added together each element in x to its corresponding element in y, an

array of the same shape.

Both of these operations are simple and, we hope, intuitive examples of vectorization.

NumPy also makes them very fast, much faster than iterating over the arrays man‐

ually. (Feel free to play with this yourself using the %%timeit IPython magic.)

14 | Chapter 1: Elegant NumPy: The Foundation of Scientic Python

This was accomplished by NumPy’s broadcasting rules, which implicitly expand

dimensions of size 1 in one array to match the corresponding dimension of the other

array. Don’t worry, we will talk about these rules in more detail later in this chapter.

As we will see in the rest of the chapter, as we explore real data, broadcasting is

extremely valuable to perform real-world calculations on arrays of data. It allows us

to express complex operations concisely and efficiently.

Exploring a gene expression data set

The data set that we’ll be using is an RNAseq experiment of skin cancer samples from

The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/). In Chap‐

ter 2 we will be using this gene expression data to predict mortality in skin cancer

patients, reproducing a simplified version of Figures 5A and 5B of a paper from the

TCGA consortium. But first we need to get our heads around the biases in our data,

and think about how we could improve it.

Downloading the data

[Links to data!]

We’re first going to use Pandas to read in the table of counts. Pandas is a Python

library for data manipulation and analysis, with particular emphasis on tabular and

time series data. Here, we will use it here to read in tabular data of mixed type. It uses

the DataFrame type, which is a flexible tabular format based on the data frame object

in R. For example the data we will read has a column of gene names (strings) and

multiple columns of counts (integers), so reading it into a homogeneous array of

numbers would be the wrong approach. Although NumPy has some support for

mixed data types (called “structured arrays”), it is not primarily designed for this use

case, which makes subsequent operations harder than they need to be.

By reading the data in as a Pandas DataFrame we can let Pandas do all the parsing,

then extract out the relevant information and store it in a more efficient data type.

Here we are just using Pandas briefly to import data. In later chapters we will give you

some more insight into the world of Pandas.

import numpy as np

import pandas as pd

# Import TCGA melanoma data

filename = 'data/counts.txt'

with open(filename, 'rt') as f:

data_table = pd.read_csv(f, index_col=0) # Parse file with pandas

print(data_table.iloc[:5, :5])

16 | Chapter 1: Elegant NumPy: The Foundation of Scientic Python

00624286-41dd-476f-a63b-d2a5f484bb45 TCGA-FS-A1Z0 TCGA-D9-A3Z1 \

A1BG 1272.36 452.96 288.06

A1CF 0.00 0.00 0.00

A2BP1 0.00 0.00 0.00

A2LD1 164.38 552.43 201.83

A2ML1 27.00 0.00 0.00

02c76d24-f1d2-4029-95b4-8be3bda8fdbe TCGA-EB-A51B

A1BG 400.11 420.46

A1CF 1.00 0.00

A2BP1 0.00 1.00

A2LD1 165.12 95.75

A2ML1 0.00 8.00

We can see that Pandas has kindly pulled out the header row and used it to name the

columns. The first column gives the name of the gene, and the remaining columns

represent individual samples.

We will also needs some corresponding metadata, including the sample information

and the gene lengths.

# Sample names

samples = list(data_table.columns)

We will need some information about the lengths of the genes for our normalization.

So that we can take advantage of some fancy pandas indexing, we’re going to set the

index of the pandas table to be the gene names in the first column.

# Import gene lengths

filename = 'data/genes.csv'

with open(filename, 'rt') as f:

gene_info = pd.read_csv(f, index_col=0) # Parse file with pandas, index by GeneSymbol

print(gene_info.iloc[:5, :])

GeneID GeneLength

GeneSymbol

CPA1 1357 1724

GUCY2D 3000 3623

UBC 7316 2687

C11orf95 65998 5581

ANKMY2 57037 2611

Let’s check how well our gene length data matches up with our count data.

print("Genes in data_table: ", data_table.shape[0])

print("Genes in gene_info: ", gene_info.shape[0])

Genes in data_table: 20500

Genes in gene_info: 20503

There are more genes in our gene length data than were actually measured in the

experiment. Let’s filter so we only get the relevant genes, and we want to make sure

they are in the same order as in our count data. This is where pandas indexing comes

Exploring a gene expression data set | 17

剩余234页未读，继续阅读

ashther

粉丝: 3
资源: 32

探索科学计算基石：优雅的SciPy高级应用

SciPy入门教程：英文版数据处理与算法库指南

Python科学计算库scipy最新版本发布

Scipy 1.7.3版本发布：Python重要科学计算包

Elegant SciPy

scipy教程，英文版

scipy完整版.rar

Elegant SciPy The Art of Scientific Python

SciPy最新版本安装包下载指南

Scipy 1.5.2版本Python依赖包简介

Python依赖包Scipy 0.17.1版本发布

最新资源