Instant ebooks textbook (Ebook) Python data science cookbook : over 60 practical recipes to help you explore Python and its robust data science capabilities by Subramanian, Gopi ISBN 9781784393663, 9781784396404, 1784393665, 1784396400 download all chapters
Instant ebooks textbook (Ebook) Python data science cookbook : over 60 practical recipes to help you explore Python and its robust data science capabilities by Subramanian, Gopi ISBN 9781784393663, 9781784396404, 1784393665, 1784396400 download all chapters
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT
Subject Test: Math Levels 1 & 2) by Arco ISBN 9780768923049,
0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094
(Ebook) Practical Data Science with Jupyter: Explore Data Cleaning,
Pre-processing, Data Wrangling, Feature Engineering and Machine
Learning using Python and Jupyter (English Edition) by Prateek Gupta
ISBN 9789389898064, 9389898064
https://ebooknice.com/product/practical-data-science-with-jupyter-
explore-data-cleaning-pre-processing-data-wrangling-feature-
engineering-and-machine-learning-using-python-and-jupyter-english-
edition-34713112
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044
https://ebooknice.com/product/data-science-with-python-6855552
https://ebooknice.com/product/data-science-essentials-in-python-
collect-organize-explore-predict-value-5557572
https://ebooknice.com/product/data-science-essentials-in-python-
collect-organize-explore-predict-value-5903344
Python data science cookbook over 60 practical recipes to
help you explore Python and its robust data science
capabilities Subramanian Digital Instant Download
Author(s): Subramanian, Gopi
ISBN(s): 9781784396404, 1784396400
Edition: Online-Ausg.
File Details: PDF, 7.27 MB
Year: 2015
Language: english
Python Data Science Cookbook
Table of Contents
Python Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Sections
Getting ready
How to do it…
How it works…
There’s more…
See also
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Python for Data Science
Introduction
Using dictionary objects
Getting ready
How to do it…
How it works…
There’s more…
See also
Working with a dictionary of dictionaries
Getting ready
How to do it…
How it works…
See also
Working with tuples
Getting ready
How to do it…
How it works…
There’s more…
See also
Using sets
Getting ready
How to do it…
How it works…
There’s more…
Writing a list
Getting ready
How to do it…
How it works…
There’s more…
Creating a list from another list - list comprehension
Getting ready
How to do it…
How it works…
There’s more…
Using iterators
Getting ready
How to do it…
How it works…
There’s more…
Generating an iterator and a generator
Getting ready
How it do it…
How it works…
There’s more…
Using iterables
Getting ready
How to do it…
How it works..
See also
Passing a function as a variable
Getting ready
How to do it…
How it works…
Embedding functions in another function
Getting ready
How to do it…
How it works…
Passing a function as a parameter
Getting ready
How to do it…
How it works…
Returning a function
Getting ready
How to do it…
How it works…
There’s more…
Altering the function behavior with decorators
Getting ready
How to do it…
How it works…
Creating anonymous functions with lambda
Getting ready
How to do it…
How it works…
Using the map function
Getting ready
How to do it…
How it works…
There’s more…
Working with filters
Getting ready
How to do it…
How it works…
Using zip and izip
Getting ready
How to do it…
How it works…
There’s more…
See also
Processing arrays from the tabular data
Getting ready
How to do it…
How it works…
There’s more…
Preprocessing the columns
Getting ready
How to do it…
How it works…
There’s more…
Sorting lists
Getting ready
How to do it…
How it works…
There’s more…
Sorting with a key
Getting ready
How to do it…
How it works…
There’s more…
Working with itertools
Getting ready
How to do it…
How it works…
2. Python Environments
Introduction
Using NumPy libraries
Getting ready
How to do it…
How it works…
There’s more…
See also
Plotting with matplotlib
Getting ready
How to do it…
How it works…
There’s more…
Machine learning with scikit-learn
Getting ready
How to do it…
How it works…
There’s more…
See also
3. Data Analysis – Explore and Wrangle
Introduction
Analyzing univariate data graphically
Getting ready
How to do it…
How it works…
See also
Grouping the data and using dot plots
Getting ready
How to do it…
How it works…
See also
Using scatter plots for multivariate data
Getting ready
How to do it…
How it works…
See also
Using heat maps
Getting ready
How to do it…
How it works…
There’s more…
See also
Performing summary statistics and plots
Getting ready
How to do it…
How it works…
See also
Using a box-and-whisker plot
Getting ready
How to do it…
How it works…
There’s more…
Imputing the data
Getting ready
How to do it…
How it works…
There’s more…
See also
Performing random sampling
Getting ready
How to do it…
How it works…
There’s more…
Stratified sampling
Progressive sampling
Scaling the data
Getting ready
How to do it…
How it works…
There’s more…
Standardizing the data
Getting ready
How to do it…
How it works…
There’s more…
Performing tokenization
Getting ready
How to do it…
How it works…
There’s more…
See also
Removing stop words
How to do it…
How it works…
There’s more…
See also
Stemming the words
Getting ready
How to do it…
How it works…
There’s more…
See also
Performing word lemmatization
Getting ready
How to do it…
How it works…
There’s more…
See also
Representing the text as a bag of words
Getting ready
How to do it…
How it works…
There’s more…
See also
Calculating term frequencies and inverse document frequencies
Getting ready
How to do it…
How it works…
There’s more…
4. Data Analysis – Deep Dive
Introduction
Matrix Decomposition:
Extracting the principal components
Getting ready
How to do it…
How it works…
There’s more…
See also
Using Kernel PCA
Getting ready
How to do it…
How it works…
There’s more…
Extracting features using singular value decomposition
Getting ready
How to do it…
How it works…
There’s more…
Reducing the data dimension with random projection
Getting ready
How to do it…
How it works…
There’s more…
See also
Decomposing the feature matrices using non-negative matrix factorization
Getting ready
How to do it…
How it works…
There’s more…
See also
5. Data Mining – Needle in a Haystack
Introduction
Working with distance measures
Getting ready
How to do it…
How it works…
There’s more…
See also
Learning and using kernel methods
Getting ready
How to do it…
How it works…
There’s more…
See also
Clustering data using the k-means method
Getting ready
How to do it…
How it works…
There’s more…
See also
Learning vector quantization
Getting ready
How to do it…
How it works…
There’s more…
See also
Finding outliers in univariate data
Getting ready
How to do it…
How it works…
There’s more…
See also
Discovering outliers using the local outlier factor method
Getting ready
How to do it…
How it works…
There’s more…
6. Machine Learning 1
Introduction
Preparing data for model building
Getting ready
How to do it…
How it works…
There’s more…
Finding the nearest neighbors
Getting ready
How to do it…
How it works…
There’s more…
See also
Classifying documents using Naïve Bayes
Getting ready
How to do it…
How it works…
There’s more…
See also
Building decision trees to solve multiclass problems
Getting ready
How to do it…
How it works…
There’s more…
See also
7. Machine Learning 2
Introduction
Predicting real-valued numbers using regression
Getting ready
How to do it…
How it works…
There’s more…
See also
Learning regression with L2 shrinkage – ridge
Getting ready
How to do it…
How it works…
There’s more…
See also
Learning regression with L1 shrinkage – LASSO
Getting ready
How to do it…
How it works…
There’s more…
See also
Using cross-validation iterators with L1 and L2 shrinkage
Getting ready
How to do it…
How it works…
There’s more…
See also
8. Ensemble Methods
Introduction
Understanding Ensemble – Bagging Method
Getting ready…
How to do it
How it works…
There’s more…
See also
Understanding Ensemble – Boosting Method
Getting Started…
How to do it
How it works…
There’s more…
See also
Understanding Ensemble – Gradient Boosting
Getting Started…
How to do it
How it works…
There’s more…
See also
9. Growing Trees
Introduction
Going from trees to Forest – Random Forest
Getting ready
How to do it…
How it works…
There’s more…
See also
Growing Extremely Randomized Trees
Getting ready…
How to do it…
How it works…
There’s more…
See also
Growing Rotational Forest
Getting ready…
How to do it…
How it works…
There’s more…
See also
10. Large-Scale Machine Learning – Online Learning
Introduction
Using perceptron as an online learning algorithm
Getting ready
How to do it…
How it works…
There’s more…
See also
Using stochastic gradient descent for regression
Getting ready
How to do it…
How it works…
There’s more…
See also
Using stochastic gradient descent for classification
Getting ready
How to do it…
How it works…
There’s more…
See also
Index
Python Data Science Cookbook
Python Data Science Cookbook
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2015
Production reference: 1041115
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-640-4
www.packtpub.com
Credits
Author
Gopi Subramanian
Reviewer
Bastiaan Sjardin
Commissioning Editor
Akram Hussain
Acquisition Editor
Nikhil Karkal
Content Development Editor
Siddhesh Salvi
Technical Editor
Danish Shaikh
Copy Editor
Tasneem Fatehi
Project Coordinator
Kranti Berde
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Production Coordinator
Nilesh Mohite
Cover Work
Nilesh Mohite
About the Author
Gopi Subramanian is a data scientist with over 15 years of experience in the field of data
mining and machine learning. During the past decade, he has designed, conceived,
developed, and led data mining, text mining, natural language processing, information
extraction and retrieval, and search systems for various domains and business verticals,
including engineering infrastructure, consumer finance, healthcare, and materials. In the
loyalty domain, he has conceived and built innovative consumer loyalty models and
designed enterprise-wide systems for personalized promotions. He has filed over ten
patent applications at the US and Indian patent office and has several publications to his
credit. He currently lives and works in Bangaluru, India.
About the Reviewer
Bastiaan Sjardin is a data scientist and entrepreneur with a background in artificial
intelligence, mathematics, and machine learning. He has an MSc degree in cognitive
science and mathematical statistics from the University of Leiden. In the past 5 years, he
has worked on a wide range of data science projects. He is a frequent community TA at
Coursera in the social network analysis course from the University of Michigan and the
practical machine learning course from Johns Hopkins University. His programming
language of choice is R and Python. Currently, he is the cofounder of Quandbee
(www.quandbee.com), a company specializing in machine learning applications.
www.PacktPub.com
Support files, eBooks, discount offers, and
more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as
a print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt books
and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt’s online digital
book library. Here, you can search, access, and readPackt’s entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view 9 entirely free books. Simply use your login credentials for
immediate access.
Preface
Today, we live in a world of connected things where tons of data is generated and it is
humanly impossible to analyze all the incoming data and make decisions. Human
decisions are increasingly replaced by decisions made by computers. Thanks to the field
of data science. Data science has penetrated deeply in our connected world and there is a
growing demand in the market for people who not only understand data science
algorithms thoroughly, but are also capable of programming these algorithms. Data
science is a field that is at the intersection of many fields, including data mining, machine
learning, and statistics, to name a few. This puts an immense burden on all levels of data
scientists; from the one who is aspiring to become a data scientist and those who are
currently practitioners in this field. Treating these algorithms as a black box and using
them in decision-making systems will lead to counterproductive results. With tons of
algorithms and innumerable problems out there, it requires a good grasp of the underlying
algorithms in order to choose the best one for any given problem.
Python as a programming language has evolved over the years and today, it is the number
one choice for a data scientist. Its ability to act as a scripting language for quick prototype
building and its sophisticated language constructs for full-fledged software development
combined with its fantastic library support for numeric computations has led to its current
popularity among data scientists and the general scientific programming community. Not
just that, Python is also popular among web developers; thanks to frameworks such as
Django and Flask.
This book has been carefully written to cater to the needs of a diverse range of data
scientists—starting from novice data scientists to experienced ones—through carefully
crafted recipes, which touch upon the different aspects of data science, including data
exploration, data analysis and mining, machine learning, and large scale machine learning.
Each chapter has been carefully crafted with recipes exploring these aspects. Sufficient
math has been provided for the readers to understand the functioning of the algorithms in
depth. Wherever necessary, enough references are provided for the curious readers. The
recipes are written in such a way that they are easy to follow and understand.
This book brings the art of data science with power Python programming to the readers
and helps them master the concepts of data science. Knowledge of Python is not
mandatory to follow this book. Non-Python programmers can refer to the first chapter,
which introduces the Python data structures and function programming concepts.
The early chapters cover the basics of data science and the later chapters are dedicated to
advanced data science algorithms. State-of-the-art algorithms that are currently used in
practice by leading data scientists across industries including the ensemble methods,
random forest, regression with regularization, and others are covered in detail. Some of
the algorithms that are popular in academia and still not widely introduced to the
mainstream such as rotational forest are covered in detail.
With a lot of do-it-yourself books on data science today in the market, we feel that there is
a gap in terms of covering the right mix of math philosophy behind the data science
algorithms and implementation details. This book is an attempt to fill this gap. With each
recipe, just enough math introductions are provided to contemplate how the algorithm
works; I believe that the readers can take full benefits of these methods in their
applications.
A word of caution though is that these recipes are written with the objective of explaining
the data science algorithms to the reader. They have not been hard-tested in extreme
conditions in order to be production ready. Production-ready data science code has to go
through a rigorous engineering pipeline.
This book can be used both as a guide to learn data science methods and quick references.
It is a self-contained book to introduce data science to a new reader with little
programming background and help them become experts in this trade.
What this book covers
Chapter 1, Python for Data Science, introduces Python’s built-in data structures and
functions, which are very handy for data science programming.
Chapter 2, Python Environments, introduces Python’s scientific programming and plotting
libraries, including NumPy, matplotlib, and scikit-learn.
Chapter 3, Data Analysis – Explore and wrangle, covers data preprocessing and
transformation routines to perform exploratory data analysis tasks in order to efficiently
build data science algorithms.
Chapter 4, Data Analysis – Deep Dive, introduces the concept of dimensionality reduction
in order to tackle the curse of dimensionality issues in data science. Starting with simple
methods and moving on to the advanced state-of-the-art dimensionality reduction
techniques are discussed in detail.
Chapter 5, Data Mining – Needle in a haystack Name, discusses unsupervised data mining
techniques, starting with elaborate discussions on distance methods and kernel methods
and following it up with clustering and outlier detection techniques.
Chapter 6, Machine Learning 1, covers supervised data mining techniques, including
nearest neighbors, Naïve Bayes, and classification trees. In the beginning, we will lay a
heavy emphasis on data preparation for supervised learning.
Chapter 7, Machine Learning 2, introduces regression problems and follows it up with
topics on regularization including LASSO and ridge. Finally, we will discuss cross-
validation techniques as a way to choose hyperparameters for these methods.
Chapter 8, Ensemble Methods, introduces various ensemble techniques including bagging,
boosting, and gradient boosting This chapter shows you how to make a powerful state-of-
the-art method in data science where, instead of building a single model for a given
problem, an ensemble or a bag of models are built.
Chapter 9, Growing Trees, introduces some more bagging methods based on tree-based
algorithms. Due to their robustness to noise and universal applicability to a variety of
problems, they are very popular among the data science community.
Chapter 10, Large scale machine learning – Online Learning, covers large scale machine
learning and algorithms suited to tackle such large scale problems. This includes
algorithms that work with streaming data and data that cannot be fitted into memory
completely.
What you need for this book
All the recipes in this book were developed and tested on an 8 GB machine with Intel i7
CPU running Windows 7 64-bit software.
Python 2.7.5, NumPy 1.8.0, SciPy 0.13.2, Matplotlib 1.3.1, NLTK 3.0.2, and scikit-learn
0.15.2 versions were used for the developing methods.
The same code should work on Linux variants and Macs with the appropriate libraries
mentioned here. Alternatively, a Python virtual environment can be created with the
version of these libraries and you can run all the recipes.
Who this book is for
This book is intended for all levels of data science professionals, both students and
practitioners from novice to experts. Different recipes in the chapters cater to the needs of
different audiences. Novice readers can spend some time in getting themselves acquainted
with data science in the first five chapters. Experts can refer to the later chapters to
refer/understand how advanced techniques are implemented using Python. The book
covers just enough mathematics and provides the necessary references for computer
programmers who wish to understand data science. People from a non-Python background
can effectively use this book. The first chapter of the book introduces Python as a
programming language for data science. It will be helpful if you have some prior basic
programming experience. The book is mostly self-contained and introduces data science
to a new reader and can help him become an expert in this trade.
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to
do it, How it works, There’s more, and See also).
To give clear instructions on how to complete a recipe, we use these sections as follows:
Getting ready
This section tells you what to expect in the recipe, and describes how to set up any
software or any preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
How it works…
This section usually consists of a detailed explanation of what happened in the previous
section.
Exploring the Variety of Random
Documents with Different Content
THE SLAVE MINGO’S POEM.
To the Editor of the Boston Journal:
[The following remarkable poem was sent me from the
South by a friend, who informs me that the author of it was
a slave named Mingo, a man of wonderful talents, and on
that account oppressed by his master. While in the slave-
prison, he penciled this poetic gem on one of the beams,
which was afterwards found and copied. My friend adds
that Mingo did escape, at night, but was recaptured and
destroyed by the bloodhounds. My friend promises to
send other poems of his, which, he says, are in
possession of Mingo’s aged wife.]
C. W.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com