没有合适的资源?快使用搜索试试~ 我知道了~
首页Python Scikit-learn:机器学习实战指南
Python Scikit-learn:机器学习实战指南
需积分: 10 2 下载量 19 浏览量
更新于2024-07-19
1
收藏 39.87MB PDF 举报
"Scikit-learn 是一个基于Python的机器学习库,提供了各种监督和无监督学习算法,以及数据预处理、模型选择和评估工具。它以BSD开源许可证发布,由社区志愿者维护和发展。该库适合初学者和专业人士,具有详尽的文档和教程,帮助用户快速上手和深入理解机器学习技术。"
Scikit-learn 是Python编程语言中的一个强大机器学习库,它的设计目标是提供简单和有效的数据分析工具。这个库包括多种机器学习算法,如分类、回归、聚类、降维和模型选择等,适用于各种任务。自2007年由David Cournapeau发起以来,Scikit-learn已经成为数据科学家和机器学习工程师的重要工具。
在使用Scikit-learn时,首先需要安装这个库,安装过程通常非常简单,可以通过Python的包管理器pip完成。安装完成后,用户可以访问其丰富的教程,从基础的机器学习概念到高级应用,逐步学习如何使用Scikit-learn。教程涵盖了从简单的线性模型到复杂的深度学习算法,同时提供了数据集加载和预处理方法,确保数据准备就绪。
Scikit-learn的用户指南分为几个主要部分:
1. **欢迎使用Scikit-learn**:这部分介绍如何安装和获取支持,还包含了相关的社区资源和项目历史。
2. **Scikit-learn教程**:提供了多个逐步指南,帮助用户理解机器学习的基本概念,如分类和回归,并展示如何在实际问题中应用这些算法。
3. **用户指南**:详细介绍了库中的各个模块,包括监督学习(如SVM、决策树、随机森林等)、无监督学习(如K-Means、DBSCAN等)、模型选择与评估方法、数据集转换和加载工具,以及处理大规模数据的策略和性能优化。
4. **通用示例**:这部分展示了如何结合使用Scikit-learn的不同功能,例如组合特征提取方法、构建管道(Pipeline)来串联多个处理步骤,以及绘制交叉验证预测等。
通过Scikit-learn,用户可以轻松地实现特征提取、特征选择、训练模型、模型验证和调优等流程。此外,Scikit-learn还与其他Python库(如NumPy、Pandas和Matplotlib)很好地集成,使得数据处理和可视化更加便捷。
Scikit-learn不仅被广泛用于学术研究,还在工业界得到了广泛应用,包括谷歌、亚马逊、微软等大型公司都在其产品中使用了Scikit-learn。因此,掌握Scikit-learn对于任何想要在数据分析和机器学习领域发展的人来说都是必不可少的技能。
scikit-learn user guide, Release 0.20.dev0
when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in
master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).
But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead
of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX
standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /
vecLib as a bug.
In Python 3.4+ it is now possible to configure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods
(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you
can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that
using the ‘forkserver’ method prevents joblib.Parallel to call function interactively defined in a shell session.
If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the
‘forkserver’ mode globally for your program: Insert the following instructions in your main script:
import multiprocessing
# other imports, custom code, load data, define model...
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
# call scikit-learn utils with n_jobs > 1 here
You can find more default on the new start methods in the multiprocessing documentation.
1.2.16 Why is there no support for deep or reinforcement learning / Will there be
support for deep or reinforcement learning in scikit-learn?
Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning
additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of
scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks
to achieve.
You can find more information about addition of gpu support at Will you add GPU support?.
1.2.17 Why is my pull request not getting any attention?
The scikit-learn review process takes a significant amount of time, and contributors should not be discouraged by a
lack of activity or review on their pull request. We care a lot about getting things right the first time, as maintenance
and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be
subject to high use immediately and should be of the highest quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working
on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are
busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely
because of this reason.
1.2.18 How do I set a random_state for an entire execution?
For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-
random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own
global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it
6 Chapter 1. Welcome to scikit-learn
scikit-learn user guide, Release 0.20.dev0
relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an
execution’s numpy global random state to 42, one could execute the following in his or her script:
import numpy as np
np.random.seed(42)
However, a global random state is prone to modification by other code during execution. Thus, the only way to ensure
replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation
splitters have their random_state parameter set.
1.3 Support
There are several ways to get in touch with the developers.
1.3.1 Mailing List
• The main mailing list is scikit-learn.
• There is also a commit list scikit-learn-commits, where updates to the main repository and test failures get
notified.
1.3.2 User questions
• Some scikit-learn developers support users on StackOverflow using the [scikit-learn] tag.
• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suit-
able venue.
In both cases please use a descriptive question in the title field (e.g. no “Please help with scikit-learn!” as this is not a
question) and put details on what you tried to achieve, what were the expected results and what you observed instead
in the details field.
Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.
Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the
number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your
trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classification or
continuous variable regression.
1.3.3 Bug tracker
If you think you’ve encountered a bug, please report it to the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues
Don’t forget to include:
• steps (or better script) to reproduce,
• expected outcome,
• observed outcome or python (or gdb) tracebacks
1.3. Support 7
scikit-learn user guide, Release 0.20.dev0
To help developers fix your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python
script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV
files using numpy.savetxt).
Note: gists are git cloneable repositories and thus you can use git to push datafiles to them.
1.3.4 IRC
Some developers like to hang out on channel #scikit-learn on irc.freenode.net.
If you do not have an IRC client or are behind a firewall this web client works fine: http://webchat.freenode.net
1.3.5 Documentation resources
This documentation is relative to 0.20.dev0. Documentation for other versions can be found here:
• 0.18
• 0.17
• 0.16
• 0.15
Printable pdf documentation for all versions can be found here.
1.4 Related Projects
Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which
facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organisation also
accepts high-quality contributions of repositories conforming to this template.
Below is a list of sister-projects, extensions and domain specific packages.
1.4.1 Interoperability and framework enhancements
These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s
estimators.
Data formats
• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.
Auto-ML
• auto_ml Automated machine learning for production and analytics, built on scikit-learn and related projects.
Trains a pipeline wth all the standard machine learning steps. Tuned for prediction speed and ease of transfer to
production environments.
• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a ma-
chine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in
replacement for a scikit-learn estimator.
Experimentation frameworks
• REP Environment for conducting data-driven research in a consistent and reproducible way
8 Chapter 1. Welcome to scikit-learn
scikit-learn user guide, Release 0.20.dev0
• ML Frontend provides dataset management and SVM fitting/prediction through web-based and programmatic
interfaces.
• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.
• Xcessiv is a notebook-like application for quick, scalable, and automated hyperparameter tuning and stacked
ensembling. Provides a framework for keeping track of model-hyperparameter combinations.
Model inspection and visualisation
• eli5 A library for debugging/inspecting machine learning models and explaining their predictions.
• mlxtend Includes model visualization utilities.
• scikit-plot A visualization library for quick and easy generation of common plots in data analysis and machine
learning.
• yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis,
model selection, evaluation, and diagnostics.
Model export for production
• sklearn-pmml Serialization of (some) scikit-learn estimators into PMML.
• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the
help of JPMML-SkLearn library.
• sklearn-porter Transpile trained scikit-learn models to C, Java, Javascript and others.
• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)
trained by sklearn. Useful for latency-sensitive production environments.
1.4.2 Other estimators and tasks
Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing
interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.
Structured learning
• Seqlearn Sequence classification using HMMs or structured perceptron.
• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.
• PyStruct General conditional random fields and structured prediction.
• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.
• sklearn-crfsuite Linear-chain conditional random fields (CRFsuite wrapper with sklearn-like API).
Deep neural networks etc.
• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.
• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally
• nolearn A number of wrappers and abstractions around existing neural network libraries
• keras Deep Learning library capable of running on top of either TensorFlow or Theano.
• lasagne A lightweight library to build and train neural networks in Theano.
Broad scope
• mlxtend Includes a number of additional estimators as well as model visualization utilities.
• sparkit-learn Scikit-learn API and functionality for PySpark’s distributed modelling.
1.4. Related Projects 9
scikit-learn user guide, Release 0.20.dev0
Other regression and classification
• xgboost Optimised gradient boosted decision tree library.
• ML-Ensemble Generalized ensemble learning (stacking, blending, subsemble, deep ensembles, etc.).
• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc. . . ).
• py-earth Multivariate adaptive regression splines
• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection
• gplearn Genetic Programming for symbolic regression tasks.
• multiisotonic Isotonic regression on multidimensional features.
Decomposition and clustering
• lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling
to sample from the true posterior distribution. (scikit-learn’s sklearn.decomposition.
LatentDirichletAllocation implementation uses variational inference to sample from a tractable
approximation of a topic model’s posterior distribution.)
• Sparse Filtering Unsupervised feature learning based on sparse-filtering
• kmodes k-modes clustering algorithm for categorical data, and several of its variations.
• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.
• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hyper-
sphere.
Pre-processing
• categorical-encoding A library of sklearn compatible categorical variable encoders.
• imbalanced-learn Various methods to under- and over-sample datasets.
1.4.3 Statistical learning with Python
Other packages useful for data analysis and machine learning.
• Pandas Tools for working with heterogeneous and columnar data, relational queries, time series and basic statis-
tics.
• theano A CPU/GPU array processing framework geared towards deep learning research.
• statsmodels Estimating and analysing statistical models. More focused on statistical tests and less on prediction
than scikit-learn.
• PyMC Bayesian statistical models and fitting algorithms.
• Sacred Tool to help you configure, organize, log and reproduce experiments
• Seaborn Visualization library based on matplotlib. It provides a high-level interface for drawing attractive
statistical graphics.
• Deep Learning A curated list of deep learning software libraries.
Domain specific packages
• scikit-image Image processing and computer vision in python.
• Natural language toolkit (nltk) Natural language processing and some machine learning.
10 Chapter 1. Welcome to scikit-learn
剩余2214页未读,继续阅读
十先生(公众号:Python知识学堂)
- 粉丝: 322
- 资源: 16
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- DobsonianDSC:简单的低成本电路,可将便宜的光学旋转编码器(用于方位角)和高分辨率加速度计(用于高度)连接到杜布森式望远镜
- 一个运用php与swoole实现的统计监控系统
- 24位bitmap 转1位bitmap 源码
- 神经网络气温预测数据.rar
- password-generator
- wifi-me-niet:Wifi-me-not扫描仪和提交者
- vertx-elasticsearch-client-failed-experiment:这是Eclipse Vert.x的Elasticsearch客户端正在进行的实验性工作。 原来,这不是正确的方法。
- jQuery图标导航选项卡切换特效代码
- 15-基于Unity 3D的游乐园漫游的设计与实现.zip
- 一个分布式统计监控系统 包含PHP客户端 、服务端
- jira-assistant-android-app:吉拉助手Android应用
- vim-colors-pbnj:Vim的默认颜色方案,并进行了一些改进
- Starter:该存储库专用于@Rocketseat Starter带来的挑战
- sendy:电子邮件服务,发送数据调用api
- Eve Skill Pocket-开源
- CSS3+SVG实现时间沙漏特效特效代码
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功