Python Scikit-learn：机器学习实战指南

需积分: 10 19 浏览量更新于2024-07-19 1 收藏 39.87MB PDF 举报

"Scikit-learn 是一个基于Python的机器学习库，提供了各种监督和无监督学习算法，以及数据预处理、模型选择和评估工具。它以BSD开源许可证发布，由社区志愿者维护和发展。该库适合初学者和专业人士，具有详尽的文档和教程，帮助用户快速上手和深入理解机器学习技术。" Scikit-learn 是Python编程语言中的一个强大机器学习库，它的设计目标是提供简单和有效的数据分析工具。这个库包括多种机器学习算法，如分类、回归、聚类、降维和模型选择等，适用于各种任务。自2007年由David Cournapeau发起以来，Scikit-learn已经成为数据科学家和机器学习工程师的重要工具。在使用Scikit-learn时，首先需要安装这个库，安装过程通常非常简单，可以通过Python的包管理器pip完成。安装完成后，用户可以访问其丰富的教程，从基础的机器学习概念到高级应用，逐步学习如何使用Scikit-learn。教程涵盖了从简单的线性模型到复杂的深度学习算法，同时提供了数据集加载和预处理方法，确保数据准备就绪。 Scikit-learn的用户指南分为几个主要部分： 1. **欢迎使用Scikit-learn**：这部分介绍如何安装和获取支持，还包含了相关的社区资源和项目历史。 2. **Scikit-learn教程**：提供了多个逐步指南，帮助用户理解机器学习的基本概念，如分类和回归，并展示如何在实际问题中应用这些算法。 3. **用户指南**：详细介绍了库中的各个模块，包括监督学习（如SVM、决策树、随机森林等）、无监督学习（如K-Means、DBSCAN等）、模型选择与评估方法、数据集转换和加载工具，以及处理大规模数据的策略和性能优化。 4. **通用示例**：这部分展示了如何结合使用Scikit-learn的不同功能，例如组合特征提取方法、构建管道（Pipeline）来串联多个处理步骤，以及绘制交叉验证预测等。通过Scikit-learn，用户可以轻松地实现特征提取、特征选择、训练模型、模型验证和调优等流程。此外，Scikit-learn还与其他Python库（如NumPy、Pandas和Matplotlib）很好地集成，使得数据处理和可视化更加便捷。 Scikit-learn不仅被广泛用于学术研究，还在工业界得到了广泛应用，包括谷歌、亚马逊、微软等大型公司都在其产品中使用了Scikit-learn。因此，掌握Scikit-learn对于任何想要在数据分析和机器学习领域发展的人来说都是必不可少的技能。

scikit-learn user guide, Release 0.20.dev0

when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in

master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).

But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead

of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX

standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /

vecLib as a bug.

In Python 3.4+ it is now possible to conﬁgure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods

(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you

can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that

using the ‘forkserver’ method prevents joblib.Parallel to call function interactively deﬁned in a shell session.

If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the

‘forkserver’ mode globally for your program: Insert the following instructions in your main script:

import multiprocessing

# other imports, custom code, load data, define model...

if __name__ == '__main__':

multiprocessing.set_start_method('forkserver')

# call scikit-learn utils with n_jobs > 1 here

You can ﬁnd more default on the new start methods in the multiprocessing documentation.

1.2.16 Why is there no support for deep or reinforcement learning / Will there be

support for deep or reinforcement learning in scikit-learn?

Deep learning and reinforcement learning both require a rich vocabulary to deﬁne an architecture, with deep learning

additionally requiring GPUs for efﬁcient computing. However, neither of these ﬁt within the design constraints of

scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks

to achieve.

You can ﬁnd more information about addition of gpu support at Will you add GPU support?.

1.2.17 Why is my pull request not getting any attention?

The scikit-learn review process takes a signiﬁcant amount of time, and contributors should not be discouraged by a

lack of activity or review on their pull request. We care a lot about getting things right the ﬁrst time, as maintenance

and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be

subject to high use immediately and should be of the highest quality possible initially.

Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working

on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are

busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely

because of this reason.

1.2.18 How do I set a random_state for an entire execution?

For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-

random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own

global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it

6 Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.20.dev0

relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an

execution’s numpy global random state to 42, one could execute the following in his or her script:

import numpy as np

np.random.seed(42)

However, a global random state is prone to modiﬁcation by other code during execution. Thus, the only way to ensure

replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation

splitters have their random_state parameter set.

1.3 Support

There are several ways to get in touch with the developers.

1.3.1 Mailing List

• The main mailing list is scikit-learn.

• There is also a commit list scikit-learn-commits, where updates to the main repository and test failures get

notiﬁed.

1.3.2 User questions

• Some scikit-learn developers support users on StackOverﬂow using the [scikit-learn] tag.

• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suit-

able venue.

In both cases please use a descriptive question in the title ﬁeld (e.g. no “Please help with scikit-learn!” as this is not a

question) and put details on what you tried to achieve, what were the expected results and what you observed instead

in the details ﬁeld.

Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.

Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the

number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your

trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classiﬁcation or

continuous variable regression.

1.3.3 Bug tracker

If you think you’ve encountered a bug, please report it to the issue tracker:

https://github.com/scikit-learn/scikit-learn/issues

Don’t forget to include:

• steps (or better script) to reproduce,

• expected outcome,

• observed outcome or python (or gdb) tracebacks

1.3. Support 7

scikit-learn user guide, Release 0.20.dev0

To help developers ﬁx your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python

script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV

ﬁles using numpy.savetxt).

Note: gists are git cloneable repositories and thus you can use git to push dataﬁles to them.

1.3.4 IRC

Some developers like to hang out on channel #scikit-learn on irc.freenode.net.

If you do not have an IRC client or are behind a ﬁrewall this web client works ﬁne: http://webchat.freenode.net

1.3.5 Documentation resources

This documentation is relative to 0.20.dev0. Documentation for other versions can be found here:

• 0.18

• 0.17

• 0.16

• 0.15

Printable pdf documentation for all versions can be found here.

1.4 Related Projects

Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which

facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organisation also

accepts high-quality contributions of repositories conforming to this template.

Below is a list of sister-projects, extensions and domain speciﬁc packages.

1.4.1 Interoperability and framework enhancements

These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s

estimators.

Data formats

• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.

Auto-ML

• auto_ml Automated machine learning for production and analytics, built on scikit-learn and related projects.

Trains a pipeline wth all the standard machine learning steps. Tuned for prediction speed and ease of transfer to

production environments.

• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator

• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a ma-

chine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in

replacement for a scikit-learn estimator.

Experimentation frameworks

• REP Environment for conducting data-driven research in a consistent and reproducible way

8 Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.20.dev0

• ML Frontend provides dataset management and SVM ﬁtting/prediction through web-based and programmatic

interfaces.

• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning

experiments with multiple learners and large feature sets.

• Xcessiv is a notebook-like application for quick, scalable, and automated hyperparameter tuning and stacked

ensembling. Provides a framework for keeping track of model-hyperparameter combinations.

Model inspection and visualisation

• eli5 A library for debugging/inspecting machine learning models and explaining their predictions.

• mlxtend Includes model visualization utilities.

• scikit-plot A visualization library for quick and easy generation of common plots in data analysis and machine

learning.

• yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis,

model selection, evaluation, and diagnostics.

Model export for production

• sklearn-pmml Serialization of (some) scikit-learn estimators into PMML.

• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the

help of JPMML-SkLearn library.

• sklearn-porter Transpile trained scikit-learn models to C, Java, Javascript and others.

• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)

trained by sklearn. Useful for latency-sensitive production environments.

1.4.2 Other estimators and tasks

Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing

interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.

Structured learning

• Seqlearn Sequence classiﬁcation using HMMs or structured perceptron.

• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.

• PyStruct General conditional random ﬁelds and structured prediction.

• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.

• sklearn-crfsuite Linear-chain conditional random ﬁelds (CRFsuite wrapper with sklearn-like API).

Deep neural networks etc.

• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.

• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally

• nolearn A number of wrappers and abstractions around existing neural network libraries

• keras Deep Learning library capable of running on top of either TensorFlow or Theano.

• lasagne A lightweight library to build and train neural networks in Theano.

Broad scope

• mlxtend Includes a number of additional estimators as well as model visualization utilities.

• sparkit-learn Scikit-learn API and functionality for PySpark’s distributed modelling.

1.4. Related Projects 9

scikit-learn user guide, Release 0.20.dev0

Other regression and classiﬁcation

• xgboost Optimised gradient boosted decision tree library.

• ML-Ensemble Generalized ensemble learning (stacking, blending, subsemble, deep ensembles, etc.).

• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc. . . ).

• py-earth Multivariate adaptive regression splines

• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection

• gplearn Genetic Programming for symbolic regression tasks.

• multiisotonic Isotonic regression on multidimensional features.

Decomposition and clustering

• lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling

to sample from the true posterior distribution. (scikit-learn’s sklearn.decomposition.

LatentDirichletAllocation implementation uses variational inference to sample from a tractable

approximation of a topic model’s posterior distribution.)

• Sparse Filtering Unsupervised feature learning based on sparse-ﬁltering

• kmodes k-modes clustering algorithm for categorical data, and several of its variations.

• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.

• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hyper-

sphere.

Pre-processing

• categorical-encoding A library of sklearn compatible categorical variable encoders.

• imbalanced-learn Various methods to under- and over-sample datasets.

1.4.3 Statistical learning with Python

Other packages useful for data analysis and machine learning.

• Pandas Tools for working with heterogeneous and columnar data, relational queries, time series and basic statis-

tics.

• theano A CPU/GPU array processing framework geared towards deep learning research.

• statsmodels Estimating and analysing statistical models. More focused on statistical tests and less on prediction

than scikit-learn.

• PyMC Bayesian statistical models and ﬁtting algorithms.

• Sacred Tool to help you conﬁgure, organize, log and reproduce experiments

• Seaborn Visualization library based on matplotlib. It provides a high-level interface for drawing attractive

statistical graphics.

• Deep Learning A curated list of deep learning software libraries.

Domain speciﬁc packages

• scikit-image Image processing and computer vision in python.

• Natural language toolkit (nltk) Natural language processing and some machine learning.

10 Chapter 1. Welcome to scikit-learn

剩余2214页未读，继续阅读

十先生(公众号：Python知识学堂）

粉丝: 322
资源: 16

Python Scikit-learn：机器学习实战指南

最新资源