scikit-learn用户指南：从安装到实战教程

5星 · 超过95%的资源需积分: 9 166 浏览量更新于2024-07-20 收藏 44.43MB PDF 举报

Scikit-learn 是一个广泛使用的开源机器学习库，致力于简化数据科学任务中的机器学习流程。这份文档是 Scikit-learn 的用户指南，适用于版本 0.18.1，发布日期为 2016 年 12 月 20 日。它提供了丰富的教程和实用指导，帮助用户快速上手并深入了解该工具。首先，"欢迎来到 Scikit-learn" 部分介绍了如何安装 Scikit-learn，这对于初次接触的用户至关重要，因为正确安装是使用任何工具的基础。此外，文档还列出了常见问题解答，方便用户解决遇到的安装或使用过程中可能遇到的问题。接下来的"支持"部分涵盖了Scikit-learn社区提供的资源和支持渠道，包括官方论坛、邮件列表以及开发者文档，便于用户寻求帮助或参与项目开发。 "关于 Scikit-learn" 部分介绍了库的背景和目标，包括其背后的团队、使用场景以及与其他相关项目的对比。这部分有助于理解库的核心价值和适用范围。 "谁在使用 Scikit-learn?" 旨在展示库的广泛应用，通过列举不同领域的实际案例，说明 Scikit-learn在科学研究、数据分析和工业界的重要性。 "发布历史"部分记录了Scikit-learn的发展历程，让读者了解新功能的添加、性能改进以及修复的问题，对于关注库演进的用户来说非常有价值。进入核心内容部分，"Scikit-learn教程"涵盖了五个具体的主题：1) 机器学习入门，提供了一个简单易懂的介绍，帮助用户理解机器学习的基本概念；2) 统计学习方法，针对科研数据处理提供深入的指导；3) 处理文本数据的技巧，强调数据预处理和特征提取；4) 如何选择合适的模型，涉及模型选择和评估的方法；5) 外部资源，包括视频教程、演讲和更多学习资料，帮助用户扩展学习路径。 "用户指南"深入探讨了各种机器学习任务，如监督学习（如分类和回归）、无监督学习（聚类和降维）、模型选择与评估方法、数据预处理策略（如标准化和特征缩放）、数据加载工具，以及如何处理大规模数据和优化计算性能。这部分内容是学习者进行实际项目操作时的主要参考。最后的"通用例子"部分展示了如何使用 Scikit-learn 实现具体的功能，如交叉验证可视化、 isotonic 回归、特征组合以及管道化处理（结合主成分分析和逻辑回归等）等，这些实例演示了如何将理论知识转化为实践。这份文档提供了全面的 Scikit-learn 使用指南，无论你是初学者还是经验丰富的数据科学家，都能从中找到所需的资源和实用技巧，助力你的机器学习项目成功实施。

scikit-learn user guide, Release 0.18.1

when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in

master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).

But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead

of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX

standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /

vecLib as a bug.

In Python 3.4+ it is now possible to conﬁgure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods

(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you

can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that

using the ‘forkserver’ method prevents joblib.Parallel to call function interactively deﬁned in a shell session.

If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the

‘forkserver’ mode globally for your program: Insert the following instructions in your main script:

import multiprocessing

# other imports, custom code, load data, define model...

if __name__ == '__main__':

multiprocessing.set_start_method('forkserver')

# call scikit-learn utils with n_jobs > 1 here

You can ﬁnd more default on the new start methods in the multiprocessing documentation.

1.2.16 Why is there no support for deep or reinforcement learning / Will there be

support for deep or reinforcement learning in scikit-learn?

Deep learning and reinforcement learning both require a rich vocabulary to deﬁne an architecture, with deep learning

additionally requiring GPUs for efﬁcient computing. However, neither of these ﬁt within the design constraints of

scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks

to achieve.

1.2.17 Why is my pull request not getting any attention?

The scikit-learn review process takes a signiﬁcant amount of time, and contributors should not be discouraged by a

lack of activity or review on their pull request. We care a lot about getting things right the ﬁrst time, as maintenance

and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be

subject to high use immediately and should be of the highest quality possible initially.

Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working

on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are

busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely

because of this reason.

1.2.18 How do I set a random_state for an entire execution?

For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-

random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own

global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it

relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an

execution’s numpy global random state to 42, one could execute the following in his or her script:

6 Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.18.1

import numpy as np

np.random.seed(42)

However, a global random state is prone to modiﬁcation by other code during execution. Thus, the only way to ensure

replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation

splitters have their random_state parameter set.

1.3 Support

There are several ways to get in touch with the developers.

1.3.1 Mailing List

• The main mailing list is scikit-learn.

• There is also a commit list scikit-learn-commits, where updates to the main repository and test failures get

notiﬁed.

1.3.2 User questions

• Some scikit-learn developers support users on StackOverﬂow using the [scikit-learn] tag.

• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suit-

able venue.

In both cases please use a descriptive question in the title ﬁeld (e.g. no “Please help with scikit-learn!” as this is not a

question) and put details on what you tried to achieve, what were the expected results and what you observed instead

in the details ﬁeld.

Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.

Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the

number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your

trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classiﬁcation or

continuous variable regression.

1.3.3 Bug tracker

If you think you’ve encountered a bug, please report it to the issue tracker:

https://github.com/scikit-learn/scikit-learn/issues

Don’t forget to include:

• steps (or better script) to reproduce,

• expected outcome,

• observed outcome or python (or gdb) tracebacks

To help developers ﬁx your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python

script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV

ﬁles using numpy.savetxt).

Note: gists are git cloneable repositories and thus you can use git to push dataﬁles to them.

1.3. Support 7

scikit-learn user guide, Release 0.18.1

1.3.4 IRC

Some developers like to hang out on channel #scikit-learn on irc.freenode.net.

If you do not have an IRC client or are behind a ﬁrewall this web client works ﬁne: http://webchat.freenode.net

1.3.5 Documentation resources

This documentation is relative to 0.18.1. Documentation for other versions can be found here:

• 0.17

• 0.16

• 0.15

Printable pdf documentation for all versions can be found here.

1.4 Related Projects

Below is a list of sister-projects, extensions and domain speciﬁc packages.

1.4.1 Interoperability and framework enhancements

These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s

estimators.

• ML Frontend provides dataset management and SVM ﬁtting/prediction through web-based and programmatic

interfaces.

• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.

• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning

experiments with multiple learners and large feature sets.

• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator

• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a ma-

chine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in

replacement for a scikit-learn estimator.

• sklearn-pmml Serialization of (some) scikit-learn estimators into PMML.

• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the

help of JPMML-SkLearn library.

1.4.2 Other estimators and tasks

Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing

interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.

• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.

• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally

• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc...).

• Seqlearn Sequence classiﬁcation using HMMs or structured perceptron.

8 Chapter 1. Welcome to scikit-learn

scikit-learn user guide, Release 0.18.1

• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.

• PyStruct General conditional random ﬁelds and structured prediction.

• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.

• py-earth Multivariate adaptive regression splines

• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)

trained by sklearn. Useful for latency-sensitive production environments.

• lda: Fast implementation of Latent Dirichlet Allocation in Cython.

• Sparse Filtering Unsupervised feature learning based on sparse-ﬁltering

• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection

• gplearn Genetic Programming for symbolic regression tasks.

• nolearn A number of wrappers and abstractions around existing neural network libraries

• sparkit-learn Scikit-learn functionality and API on PySpark.

• keras Theano-based Deep Learning library.

• mlxtend Includes a number of additional estimators as well as model visualization utilities.

• kmodes k-modes clustering algorithm for categorical data, and several of its variations.

• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.

• lasagne A lightweight library to build and train neural networks in Theano.

• multiisotonic Isotonic regression on multidimensional features.

• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hyper-

sphere.

1.4.3 Statistical learning with Python

Other packages useful for data analysis and machine learning.

• Pandas Tools for working with heterogeneous and columnar data, relational queries, time series and basic statis-

tics.

• theano A CPU/GPU array processing framework geared towards deep learning research.

• statsmodels Estimating and analysing statistical models. More focused on statistical tests and less on prediction

than scikit-learn.

• PyMC Bayesian statistical models and ﬁtting algorithms.

• REP Environment for conducting data-driven research in a consistent and reproducible way

• Sacred Tool to help you conﬁgure, organize, log and reproduce experiments

• gensim A library for topic modelling, document indexing and similarity retrieval

• Seaborn Visualization library based on matplotlib. It provides a high-level interface for drawing attractive

statistical graphics.

• Deep Learning A curated list of deep learning software libraries.

1.4. Related Projects 9

剩余2059页未读，继续阅读

k_shmily

粉丝: 79
资源: 102

scikit-learn用户指南：从安装到实战教程

scikit-learn库官方指南：算法详解与实践示例

scikit-learn官方指南：Python机器学习入门与实践

scikit-learn官方用户指南0.17版

python-scikit-learn-docs.pdf

文档-scikit-learn-docs-2754pages.pdf

英文手册.zip（压缩包中包括numpy-ref-1.16.1、numpy-user-1.16.1、Matplotlib和scikit-learn-docs及scipy-ref-1.2.1五个pdf版手册，英文版）

Scikit-learn学习资料荟萃

[PDF]Machine learning tools docs: scikit-learn,numpy,scipy,pandas,matplotlib

python-3.7.4-docs-pdf-a4.zip

rizhuti-v2.zip

最新资源