Python数据分析入门书籍《Foundation for Analytics with Python》

需积分: 9 1 下载量 97 浏览量 更新于2024-07-17 收藏 2.08MB PDF 举报
"《Python数据分析基础》(Foundations for Analytics with Python)是 Clinton W. Brownley 编著的一本专业书籍,它在2016年发布了早期版。该书主要针对IT行业中的数据分析入门者,讲解如何利用Python这一强大的编程语言进行数据分析和挖掘。作者保留所有版权,确保读者能够在教育、商业或销售推广等场景下合法获取和使用。 本书由O'Reilly Media出版社发行,地址位于美国加利福尼亚州塞巴斯托波尔。O'Reilly的在线资源也非常丰富,读者可以通过访问safaribooks online获取更多相关图书。对于团体或机构用户,可通过拨打800-998-9938或发送邮件至corporate@oreilly.com获取更多信息。 《Python数据分析基础》由Laurel Ruma和Tim McGovern担任编辑,他们负责整体内容的策划和质量把控。生产编辑、校对人员以及索引编撰者都对本书进行了精心工作,确保了内容的专业性和完整性。封面设计由Karen Montgomery操刀,插图则出自Rebecca Demarest之手。 本书的第一版于2016年4月5日首次发布早期版本,并且提供了一个在线错误报告链接(http://oreilly.com/catalog/errata.csp?isbn=0636920038375),以便读者和作者共同更新和改进内容。 《Python数据分析基础》是一本综合性的文档,涵盖了Python语言的基础知识,如数据处理、数据清洗、数据可视化、统计分析和机器学习等内容。通过这本书,读者能够系统地学习如何运用Python工具如Pandas、NumPy、Matplotlib和Scikit-Learn等,进行实际的数据分析项目。无论是对数据科学感兴趣的学生还是希望提升数据分析能力的职场人士,这都是一本非常实用的参考书籍。"
2016-12-08 上传
Foundations for Analytics with Python by Clinton W. Brownley Copyright © 2016 Clinton Brownley. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. Overview of Chapters Chapter 1, Python Basics We’ll begin by exploring how to create and run a Python script. This chapter focuses on basic Python syntax and the elements of Python that you need to know for later chapters in the book. For example, we’ll discuss basic data types such as numbers and strings and how you can manipulate them. We’ll also cover Preface | xvii the main data containers (i.e., lists, tuples, and dictionaries) and how you use them to store and manipulate your data, as well as how to deal with dates, as dates often appear in business analysis. This chapter also discusses programming concepts such as control flow, functions, and exceptions, as these are important elements for including business logic in your code and gracefully handling errors. Finally, the chapter explains how to get your computer to read a text file, read multiple text files, and write to a CSV-formatted output file. These are important techniques for accessing input data and retaining specific output data that I expand on in later chapters in the book. Chapter 2, Comma-Separated Values (CSV) Files This chapter covers how to read and write CSV files. The chapter starts with an example of parsing a CSV input file “by hand,” without Python’s built-in csv module. It transitions to an illustration of potential problems with this method of parsing and then presents an example of how to avoid these potential problems by parsing a CSV file with Python’s csv module. Next, the chapter discusses how to use three different types of conditional logic to filter for specific rows from the input file and write them to a CSV output file. Then the chapter presents two dif‐ ferent ways to filter for specific columns and write them to the output file. After covering how to read and parse a single CSV input file, we’ll move on to discus‐ sing how to read and process multiple CSV files. The examples in this section include presenting summary information about each of the input files, concate‐ nating data from the input files, and calculating basic statistics for each of the input files. The chapter ends with a couple of examples of less common proce‐ dures, including selecting a set of contiguous rows and adding a header row to the dataset. Chapter 3, Excel Files Next, we’ll cover how to read Excel workbooks with a downloadable, add-in module called xlrd. This chapter starts with an example of introspecting an Excel workbook (i.e., presenting how many worksheets the workbook contains, the names of the worksheets, and the number of rows and columns in each of the worksheets). Because Excel stores dates as numbers, the next section illustrates how to use a set of functions to format dates so they appear as dates instead of as numbers. Next, the chapter discusses how to use three different types of condi‐ tional logic to filter for specific rows from a single worksheet and write them to a CSV output file. Then the chapter presents two different ways to filter for specific columns and write them to the output file. After covering how to read and parse a single worksheet, the chapter moves on to discuss how to read and process all worksheets in a workbook and a subset of worksheets in a workbook. The exam‐ ples in these sections show how to filter for specific rows and columns in the worksheets. After discussing how to read and parse any number of worksheets in a single workbook, the chapter moves on to review how to read and process mul‐tiple workbooks. The examples in this section include presenting summary infor‐ mation about each of the workbooks, concatenating data from the workbooks, and calculating basic statistics for each of the workbooks. The chapter ends with a couple of examples of less common procedures, including selecting a set of contiguous rows and adding a header row to the dataset. Chapter 4, Databases Here, we’ll cover how to carry out basic database operations in Python. The chapter starts with examples that use Python’s built-in sqlite3 module so that you don’t have to install any additional software. The examples illustrate how to carry out some of the most common database operations, including creating a database and table, loading data in a CSV input file into a database table, updat‐ ing records in a table using a CSV input file, and querying a table. When you use the sqlite3 module, the database connection details are slightly different from the ones you would use to connect to other database systems like MySQL, Post‐ greSQL, and Oracle. To show this difference, the second half of the chapter dem‐ onstrates how to interact with a MySQL database system. If you don’t already have MySQL on your computer, the first step is to download and install MySQL. From there, the examples mirror the sqlite3 examples, including creating a database and table, loading data in a CSV input file into a database table, updat‐ ing records in a table using a CSV input file, querying a table, and writing query results to a CSV output file. Together, the examples in the two halves of this chapter provide a solid foundation for carrying out common database operations in Python. Chapter 5, Applications This chapter contains three examples that demonstrate how to combine techni‐ ques presented in earlier chapters to tackle three different problems that are rep‐ resentative of some common data processing and analysis tasks. The first application covers how to find specific records in a large collection of Excel and CSV files. As you can imagine, it’s a lot more efficient and fun to have a computer search for the records you need than it is to search for them yourself. Opening, searching in, and closing dozens of files isn’t fun, and the task becomes more and more challenging as the number of files increases. Because the problem involves searching through CSV and Excel files, this example utilizes a lot of the material covered in Chapters 2 and 3. The second application covers how to group or “bin” data into unique categories and calculate statistics for each of the categories. The specific example is parsing a CSV file of customer service package purchases that shows when customers paid for particular service packages (i.e., Bronze, Silver, or Gold), organizing the data into unique customer names and packages, and adding up the amount of time each customer spent in each package. The example uses two building blocks, creating a function and storing data in a dictionary, which are introduced in Chapter 1 but aren’t used in Chapters 2, 3, and 4. It also introduces another new technique: keeping track of the previous row you processed and the row you’re currently processing, in order to calculate a statistic based on values in the two rows. These two techniques—grouping or binning data with a dictionary and keeping track of the current row and the previous row—are very powerful capabilities that enable you to handle many common analysis tasks that involve events over time. The third application covers how to parse a text file, group or bin data into cate‐ gories, and calculate statistics for the categories. The specific example is parsing a MySQL error log file, organizing the data into unique dates and error messages, and counting the number of times each error message appeared on each date. The example reviews how to parse a text file, a technique that briefly appears in Chapter 1. The example also shows how to store information separately in both a list and a dictionary in order to create the header row and the data rows for the output file. This is a reminder that you can parse text files with basic string oper‐ ations and another good example of how to use a nested dictionary to group or bin data into unique categories. Chapter 6, Figures and Plots In this chapter, you’ll learn how to create common statistical graphs and plots in Python with four plotting libraries: matplotlib, pandas, ggplot, and seaborn. The chapter begins with matplotlib because it’s a long-standing package with lots of documentation (in fact, pandas and seaborn are built on top of matplot lib). The matplotlib section illustrates how to create histograms and bar, line, scatter, and box plots. The pandas section discusses some of the ways pandas simplifies the syntax you need to create these plots and illustrates how to create them with pandas. The ggplot section notes the library’s historical relationship with R and the Grammar of Graphics and illustrates how to use ggplot to build some common statistical plots. Finally, the seaborn section discusses how to cre‐ ate standard statistical plots as well as plots that would be more cumbersome to code in matplotlib. Chapter 7, Descriptive Statistics and Modeling Here, we’ll look at how to produce standard summary statistics and estimate regression and classification models with the pandas and statsmodels packages. pandas has functions for calculating measures of central tendency (e.g., mean, median, and mode), as well as for calculating dispersion (e.g., variance and stan‐ dard deviation). It also has functions for grouping data, which makes it easy to calculate these statistics for different groups of data. The statsmodels package has functions for estimating many types of regression and classification models. The chapter illustrates how to build multivariate linear regression and logistic classification models based on data in pandas DataFrames and then use the mod‐ els to predict output values for new input data. Chapter 8, Scheduling Scripts to Run Automatically This chapter covers how to schedule your scripts to run automatically on a rou‐ tine basis on both Windows and macOS. Until this chapter, we ran the scripts manually on the command line. Running a script manually on the command line is convenient when you’re debugging the script or running it on an ad hoc basis. However, it can be a nuisance if your script needs to run on a routine basis (e.g., daily, weekly, monthly, or quarterly), or if you need to run lots of scripts on a routine basis. On Windows, you create scheduled tasks to run scripts automati‐ cally on a routine basis. On macOS, you create cron jobs, which perform the same actions. This chapter includes several screenshots to show you how to cre‐ ate and run scheduled tasks and cron jobs. By scheduling your scripts to run on a routine basis, you don’t ever forget to run a script and you can scale beyond what’s possible when you’re running scripts manually on the command line. Chapter 9, Where to Go from Here The final chapter covers some additional built-in and add-in Python modules and functions that are important for data processing and analysis tasks, as well as some additional data structures that will enable you to efficiently handle a variety of complex programming problems you may run into as you move beyond the topics covered in this book. Built-ins are bundled into the Python installation, so they are immediately available to you when you install Python. The built-in mod‐ ules discussed in this chapter include collections, random, statistics, iter tools, and operator. The built-in functions include enumerate, filter, reduce, and zip. Add-in modules don’t come with the Python installation, so you have to download and install them separately. The add-in modules discussed in this chapter include NumPy, SciPy, and Scikit-Learn. We also take a look at some additional data structures that can help you store, process, or analyze your data more quickly and efficiently, such as stacks, queues, trees, and graphs.