CHAPTER 1 ■ STEP 1 – GETTING STARTED IN PYTHON
3
There are well-designed development environments such as IPython Notebook and
Spyder that allow for a quick introspection of the data and enable developing of machine
learning models interactively.
Powerful modules such as NumPy and Pandas exist for the efficient use of numeric
data. Scientific computing is made easy with SciPy package. A number of primary
machine learning algorithms have been efficiently implemented in scikit-learn (also
known as sklearn). HadooPy, PySpark provides seamless work experience with big data
technology stacks. Cython and Numba modules allow executing Python code in par
with the speed of C code. Modules such as nosetest emphasize high-quality, continuous
integration tests, and automatic deployment.
Combining all of the above has made many machine learning engineers embrace
Python as the choice of language to explore data, identify patterns, and build and deploy
models to the production environment. Most importantly the business-friendly licenses
for various key Python packages are encouraging the collaboration of businesses and the
open source community for the benefit of both worlds. Overall the Python programming
ecosystem allows for quick results and happy programmers. We have been seeing
the trend of developers being part of the open source community to contribute to the
bug fixes and new algorithms for the use by the global community, at the same time
protecting the core IP of the respective company they work for.
Python 2.7.x or Python 3.4.x?
Python 3.4.x is the latest version and comes with nicer, consistent functionalities!
However, there is very limited third-party module support for it, and this will be the trend
for at least a couple of more years. However, all major frameworks still run on version
2.7.x and are likely to continue to do so for a significant amount of time. Therefore, it
is advised to start with Python 2, for the fact that it is the most widely used version for
building machine learning systems as of today.
For an in-depth analysis of the differences between Python 2 vs. 3, you can refer to Wiki.
python.org (https://wiki.python.org/moin/Python2orPython3v), which says that there
are benefits to each.
I recommend Anaconda (Python distribution), which is BSD licensed and gives you
permission to use it commercially and for redistribution. It has around 270 packages
including the most important ones for most scientific applications, data analysis, and
machine learning such as NumPy, SciPy, Pandas, IPython, matplotlib, and scikit-learn. It
also provides a superior environment tool conda that allows you to easily switch between
environments, even between Python 2 and 3 (if required). It is also updated very quickly
as soon as a new version of a package is released and you can just use conda update
<packagename> to update it.
You can download the latest version of Anaconda from their official website at
https://www.continuum.io/downloads and follow the installation instructions.