掌握生物信息学数据技能：实现可复现与稳健研究

5星 · 超过95%的资源需积分: 33 122 浏览量更新于2024-07-21 2 收藏 7.88MB PDF 举报

"《生物信息学：数据技能、可重复性和稳健研究》是一本实用指南，专为科学家提供在处理大型测序数据集时所需的技能，以确保生物发现的可重复性和稳定性。传统生物信息学教材往往侧重于算法和理论，而这本书则以实践为主，全面介绍了基因组学分析中的技术、工具和最佳实践，强调数据驱动的方法。作者将重点放在了现代数据处理技巧上，而不是过时的理论概念，使读者能够适应不断发展的领域。书中的内容分为三个部分： 1. 理念：第一部分探讨了数据技能对于建立稳健和可重复生物信息学分析的基础，包括第一章“如何学习生物信息学”，引导读者理解学科本质并树立正确的学习态度。 2. 预备：第二部分是基础知识，包括如何设置和管理生物信息学项目（如项目组织和版本控制）、提升Unix Shell能力、远程机器操作、Git在科研中的应用，以及对生物信息学数据的深入理解。 3. 实践：实践篇涵盖了丰富的技能训练，如Unix数据工具的使用、R语言基础、不同类型数据（范围数据、序列数据和比对数据）的处理、编写生物信息学脚本、构建工作流和并行任务处理，以及应对大数据量的内存受限方法（如Tabix和SQLite）。这本书不仅教会读者如何进行生物信息学分析，还强调选择和实施最适合工作的工具的重要性，帮助他们发展成为能够解决复杂问题的生物信息学家。通过本书，无论是研究生、博士后、教师还是爱好者，都能获得宝贵的实践经验，以便在当前这个生命科学高度依赖数据技能的时代开展研究工作。定价合理，对于那些希望通过开源工具推动可重复和稳健科研的读者来说，是一份极具价值的资源。"

this new data. Bioinformatics Data Skills is written to provide you with training in

these core tools and help you develop these same skills.

The Approach of This Book

Many biologists starting out in bioinformatics tend to equate “learning bioinformat‐

ics” with “learning how to run bioinformatics software.” This is an unfortunate and

misinformed idea of what bioinformaticians actually do. This is analogous to think‐

ing “learning molecular biology” is just “learning pipetting.” Other than a few simple

examples used to generate data in Chapter 11, this book doesn’t cover running bioin‐

formatics software like aligners, assemblers, or variant callers. Running bioinformat‐

ics software isn’t all that difficult, doesn’t take much skill, and it doesn’t embody any

of the significant challenges of bioinformatics. I don’t teach how to run these types of

bioinformatics applications in Bioinformatics Data Skills for the following reasons:

• It’s easy enough to figure out on your own

• The material would go rapidly out of date as new versions of software or entirely

new programs are used in bioinformatics

• The original manuals for this software will always be the best, most up-to-date

resource on how to run a program

Instead, the approach of this book is to focus on the skills bioinformaticians use to

explore and extract meaning from complex, large bioinformatics datasets. Exploring

and extracting information from these datasets is the fun part of bioinformatics

research. The goal of Bioinformatics Data Skills is to teach you the computational

tools and data skills you need to explore these large datasets as you please. These data

skills give you freedom; you’ll be able to look at any bioinformatics data—in any for‐

mat, and files of any size—and begin exploring data to extract biological meaning.

Throughout Bioinformatics Data Skills, I emphasize working in a robust and reprodu‐

cible manner. I believe these two qualities—reproducibility and robustness—are too

often overlooked in modern computational work. By robust, I mean that your work is

resilient against silent errors, confounders, software bugs, and messy or noisy data. In

contrast, a fragile approach is one that does not decrease the odds of some type of

error adversely affecting your results. By reproducible, I mean that your work can be

repeated by other researchers and they can arrive at the same results. For this to be

the case, your work must be well documented, and your methods, code, and data all

need to be available so that other researchers have the materials to reproduce every‐

thing. Reproducibility also relies on your work being robust—if a workflow run on a

different machine yields a different outcome, it is neither robust nor fully reproduci‐

ble. I introduce these concepts in more depth in Chapter 2, and these are themes that

reappear throughout the book.

xiv | Preface

Why This Book Focuses on Sequencing Data

Bioinformatics is a broad discipline, and spans subfields like proteomics, metabolo‐

mics, structure bioinformatics, comparative genomics, machine learning, and image

processing. Bioinformatics Data Skills focuses primarily on handling sequencing data

for a few reasons.

First, sequencing data is abundant. Currently, no other “omics” data is as abundant as

high-throughput sequencing data. Sequencing data has broad applications across

biology: variant detection and genotyping, transcriptome sequencing for gene expres‐

sion studies, protein-DNA interaction assays like ChIP-seq, and bisulfite sequencing

for methylation studies just to name a few examples. The ways in which sequencing

data can be used to answer biological questions will only continue to increase.

Second, sequencing data is terrific for honing your data skills. Even if your goal is to

analyze other types of data in the future, sequencing data serves as great example data

to learn with. Developing the text-processing skills necessary to work with sequenc‐

ing data will be applicable to working with many other data types.

Third, other subfields of bioinformatics are much more domain specific. The wide

availability and declining costs of sequencing have allowed scientists from all disci‐

plines to use genomics data to answer questions in their systems. In contrast, bioin‐

formatics subdisciplines like proteomics or high-throughput image processing are

much more specialized and less widespread. Still, if you’re interested in these fields,

Bioinformatics Data Skills will teach you useful computational and data skills that will

be helpful in your research.

Audience

In my experience teaching bioinformatics to friends, colleagues, and students of an

intensive week-long course taught at UC Davis, most people wishing to learn bioin‐

formatics are either biologists, or computer scientists/programmers. Biologists wish

to develop the computational skills necessary to analyze their own data, while the

programmers and computer scientists wish to apply their computational skills to bio‐

logical problems. Although these two groups differ considerably in biological knowl‐

edge and computational experience, Bioinformatics Data Skills covers material that

should be helpful to both.

If you’re a biologist, Bioinformatics Data Skills will teach you the core data skills you

need to work with bioinformatics data. It’s important to note that Bioinformatics Data

Skills is not a how-to bioinformatics book; such a book on bioinformatics would

quickly go out of date or be too narrow in focus to help the majority of biologists. You

will need to supplement this book with knowledge of your specific research and sys‐

tem, as well as the modern statistical and bioinformatics methods that your subfield

Preface | xv

uses. For example, if your project involves aligning sequencing reads to a reference

genome, this book won’t tell you the newest and best alignment software for your

particular system. But regardless of which aligner you use, you will need to have a

thorough understanding of alignment formats and how to manipulate alignment data

—a topic covered in Chapter 11. Throughout this book, these general computational

and data skills are meant to be a solid, widely applicable foundation on which the

majority of biologists can build.

If you’re a computer scientist or programmer, you are likely already familiar with

some of the computational tools I teach in this book. While the material presented in

Bioinformatics Data Skills may overlap knowledge you already have, you will still

learn about the specific formats, tools, and approaches bioinformaticians use in their

work. Also, working through the examples in this book will give you good practice in

applying your computational skills to genomics data.

The

Diculty Level of

Bioinformatics Data Skills

Bioinformatics Data Skills is designed to be a thorough—and in parts, dense—book.

When I started writing this book, I decided the greatest misdeed I could do would be

to treat bioinformatics as a subject that’s easier than it truly is. Working as a professio‐

nal bioinformatician, I routinely saw how very subtle issues could crop up and

adversely change the outcome of the analysis had they not been caught. I don’t want

your bioinformatics work to be incorrect because I’ve made a topic artificially simple.

The depth at which I cover topics in Bioinformatics Data Skills is meant to prepare

you to catch similar issues in your own work so your results are robust.

The result is that sections of this book are quite advanced and will be difficult for

some readers. Don’t feel discouraged! Like most of science, this material is hard, and

may take a few reads before it fully sinks in. Throughout the book, I try to indicate

when certain sections are especially advanced so that you can skip over these and

return to them later.

Lastly, I often use technical jargon throughout the book. I don’t like using jargon, but

it’s necessary to communicate technical concepts in computing. Primarily it will help

you search for additional resources and help. It’s much easier to Google successfully

for “left outer join” than “data merge where null records are included in one table.”

Assumptions This Book Makes

Bioinformatics Data Skills is meant to be an intermediate book on bioinformatics. To

make sure everyone starts out on the same foot, the book begins with a few simple

chapters. In Chapter 2, I cover the basics of setting up a bioinformatics project, and in

Chapter 3 I teach some remedial Unix topics meant to ensure that you have a solid

xvi | Preface

grasp of Unix (because Unix is a large component in later chapters). Still, as an inter‐

mediate book, I make a few assumptions about you:

You know a scripting language

This is the biggest assumption of the book. Except for a few Python programs

and the R material (R is introduced in Chapter 8), this book doesn’t directly rely

on using lots of scripting. However, in learning a scripting language, you’ve

already encountered many important computing concepts such as working with

a text editor, running and executing programs on the command line, and basic

programming. If you do not know a scripting language, I would recommend

learning Python while reading this book. Books like Bioinformatics Programming

Using Python by Mitchell L. Model (O’Reilly, 2009), Learning Python, 5th Edition,

by Mark Lutz (O’Reilly, 2013), and Python in a Nutshell, 2nd, by Alex Martelli

(O’Reilly, 2006) are great to get started. If you know a scripting language other

than Python (e.g., Perl or Ruby), you’ll be prepared to follow along with most

examples (though you will need to translate some examples to your scripting lan‐

guage of choice).

You know how to use a text editor

It’s essential that you know your way around a text editor (e.g., Emacs, Vim, Text‐

Mate2, or Sublime Text). Using a word processor (e.g., Microsoft Word) will not

work, and I would discourage using text editors such as Notepad or OS X’s Tex‐

tEdit, as they lack syntax highlighting support for common programming lan‐

guages.

You have basic Unix command-line skills

For example, I assume you know the difference between a terminal and a shell,

understand how to enter commands, what command-line options/flags and

arguments are, and how to use the up arrow to retrieve your last entered com‐

mand. You should also have a basic understanding of the Unix file hierarchy

(including concepts like your home directory, relative versus absolute directories,

and root directories). You should also be able to move about and manipulate the

directories and files in Unix with commands like cd, ls, pwd, mv, rm, rmdir, and

mkdir. Finally, you should have a basic grasp of Unix file ownership and permis‐

sions, and changing these with chown and chmod. If these concepts are unclear, I

would recommend you play around in the Unix command line first (carefully!)

and consult a good beginner-level book such as Practical Computing for Biologists

by Steven Haddock and Casey Dunn (Sinauer, 2010) or UNIX and Perl to the Res‐

cue by Keith Bradnam and Ian Korf (Cambridge University Press, 2012).

You have a basic understanding of biology

Bioinformatics Data Skills is a BYOB book—bring your own biology. The examples

don’t require a lot of background in biology beyond what DNA, RNA, proteins,

and genes are, and the central dogma of molecular biology. You should also be

Preface | xvii

familiar with some very basic genetics and genomic concepts (e.g., single nucleo‐

tide polymorphisms, genotypes, GC content, etc.). All biological examples in the

book are designed to be quite simple; if you’re unfamiliar with any topic, you

should be able to quickly skim a Wikipedia article and proceed with the example.

You have a basic understanding of regular expressions

Occasionally, I’ll make use of regular expressions in this book. In most cases, I try

to quickly step through the basics of how a regular expression works so that you

can get the general idea. If you’ve encountered regular expressions while learning

a scripting language, you’re ready to go. If not, I recommend you learn the basics

—not because regular expressions are used heavily throughout the book, but

because mastering regular expressions is an important skill in bioinformatics.

Introducing Regular Expressions by Michael Fitzgerald (O’Reilly) is a great intro‐

duction. Nowadays, writing, testing, and debugging regular expressions is easier

than ever thanks to online tools like http://regex101.com and http://www.debug‐

gex.com. I recommend using these tools in your own work and when stepping

through my regular expression examples.

You know how to get help and read documentation

Throughout this book, I try to minimize teaching information that can be found

in manual pages, help documentation, or online. This is for three reasons:

•

I want to save space and focus on presenting material in a way you can’t find

elsewhere

• Manual pages and documentation will always be the best resource for this

information

• The ability to quickly find answers in documentation is one of the most

important skills you can develop when learning computing

This last point is especially important; you don’t need to remember all arguments of a

command or R function—you just need to know where to find this information. Pro‐

grammers consult documentation constantly in their work, which is why documenta‐

tion tools like man (in Unix) and help() (in R) exist.

You can manage your computer system (or have a system administrator)

This book does not teach you system administration skills like setting up a bioin‐

formatics server or cluster, managing user accounts, network security, managing

disks and disk space, RAID configurations, data backup, and high-performance

computing concepts. There simply isn’t the space to adequately cover these

important topics. However, these are all very, very important—if you don’t have a

system administrator and need to fill that role for your lab or research group, it’s

essential for you to master these skills, too. Frankly, system administration skills

take years to master and good sysadmins have incredible patience and experience

xviii | Preface

剩余537页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

掌握生物信息学数据技能：实现可复现与稳健研究

Bioinformatics Data Skills Reproducible and Robust Research with Open 无水印pdf

bioinformatics skills

Bioinformatics Data Skills Reproducible and Robust Research with Open azw3

Bioinformatics Data Skills Reproducible and Robust Research with Open epub

Bioinformatics Data Skills Reproducible and Robust Research with Open mobi

Python.Programming.for.Biology.Bioinformatics.and.Beyond.2015.pdf

Data Mining in Bioinformatics.pdf

Mastering Perl for Bioinformatics.pdf

Python course in Bioinformatics.pdf

Data Mining and Bioinformatics.ppt

最新资源