This was accomplished by NumPy’s broadcasting rules, which implicitly expand
dimensions of size 1 in one array to match the corresponding dimension of the other
array. Don’t worry, we will talk about these rules in more detail later in this chapter.
As we will see in the rest of the chapter, as we explore real data, broadcasting is
extremely valuable to perform real-world calculations on arrays of data. It allows us
to express complex operations concisely and efficiently.
Exploring a gene expression data set
The data set that we’ll be using is an RNAseq experiment of skin cancer samples from
The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/). In Chap‐
ter 2 we will be using this gene expression data to predict mortality in skin cancer
patients, reproducing a simplified version of Figures 5A and 5B of a paper from the
TCGA consortium. But first we need to get our heads around the biases in our data,
and think about how we could improve it.
Downloading the data
[Links to data!]
We’re first going to use Pandas to read in the table of counts. Pandas is a Python
library for data manipulation and analysis, with particular emphasis on tabular and
time series data. Here, we will use it here to read in tabular data of mixed type. It uses
the DataFrame type, which is a flexible tabular format based on the data frame object
in R. For example the data we will read has a column of gene names (strings) and
multiple columns of counts (integers), so reading it into a homogeneous array of
numbers would be the wrong approach. Although NumPy has some support for
mixed data types (called “structured arrays”), it is not primarily designed for this use
case, which makes subsequent operations harder than they need to be.
By reading the data in as a Pandas DataFrame we can let Pandas do all the parsing,
then extract out the relevant information and store it in a more efficient data type.
Here we are just using Pandas briefly to import data. In later chapters we will give you
some more insight into the world of Pandas.
import numpy as np
import pandas as pd
# Import TCGA melanoma data
filename = 'data/counts.txt'
with open(filename, 'rt') as f:
data_table = pd.read_csv(f, index_col=0) # Parse file with pandas
print(data_table.iloc[:5, :5])
16 | Chapter 1: Elegant NumPy: The Foundation of Scientic Python