Distributed Machine Learning with PySpark 1st Edition Abdelaziz Testas 2024 scribd download
Distributed Machine Learning with PySpark 1st Edition Abdelaziz Testas 2024 scribd download
https://ebookmass.com
https://ebookmass.com/product/distributed-machine-
learning-with-pyspark-1st-edition-abdelaziz-
testas/
https://ebookmass.com/product/distributed-machine-learning-with-
pyspark-1st-edition-abdelaziz-testas/
testbankdeal.com
https://ebookmass.com/product/applied-machine-learning-1st-edition-m-
gopal/
testbankdeal.com
https://ebookmass.com/product/source-separation-and-machine-learning-
chien/
testbankdeal.com
https://ebookmass.com/product/machine-learning-for-time-series-
forecasting-with-python-francesca-lazzeri/
testbankdeal.com
Machine Learning with Python for Everyone (Addison Wesley
Data & Analytics Series) 1st Edition, (Ebook PDF)
https://ebookmass.com/product/machine-learning-with-python-for-
everyone-addison-wesley-data-analytics-series-1st-edition-ebook-pdf/
testbankdeal.com
https://ebookmass.com/product/practical-machine-learning-on-
databricks-1st-edition-debu-sinha/
testbankdeal.com
https://ebookmass.com/product/time-series-algorithms-recipes-
implement-machine-learning-and-deep-learning-techniques-with-python-
akshay-r-kulkarni/
testbankdeal.com
Distributed Machine
Learning with PySpark
Migrating Effortlessly from Pandas
and Scikit-Learn
Abdelaziz Testas
Distributed Machine Learning with PySpark: Migrating Effortlessly from Pandas
and Scikit-Learn
Abdelaziz Testas
Fremont, CA, USA
Acknowledgments������������������������������������������������������������������������������������������������ xvii
Introduction����������������������������������������������������������������������������������������������������������� xix
v
Table of Contents
Scikit-Learn�������������������������������������������������������������������������������������������������������������������������������� 47
PySpark��������������������������������������������������������������������������������������������������������������������������������������� 49
Summary������������������������������������������������������������������������������������������������������������������������������������ 51
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
Chapter 13: Recommender Systems with Pandas, Surprise, and PySpark���������� 329
The Dataset������������������������������������������������������������������������������������������������������������������������������� 331
Building a Recommender System��������������������������������������������������������������������������������������������� 339
Recommender System with Surprise��������������������������������������������������������������������������������������� 339
Recommender System with PySpark���������������������������������������������������������������������������������� 345
Bringing It All Together�������������������������������������������������������������������������������������������������������������� 350
Surprise������������������������������������������������������������������������������������������������������������������������������� 351
PySpark������������������������������������������������������������������������������������������������������������������������������� 352
Summary���������������������������������������������������������������������������������������������������������������������������������� 353
ix
Visit https://ebookmass.com
now to explore a rich
collection of eBooks and enjoy
exciting offers!
Table of Contents
Chapter 15: k-Means Clustering with Pandas, Scikit-Learn, and PySpark����������� 395
The Dataset������������������������������������������������������������������������������������������������������������������������������� 396
Machine Learning with k-Means���������������������������������������������������������������������������������������������� 400
k-Means Clustering with Scikit-Learn��������������������������������������������������������������������������������� 400
k-Means Clustering with PySpark��������������������������������������������������������������������������������������� 408
Bringing It All Together�������������������������������������������������������������������������������������������������������������� 412
Scikit-Learn������������������������������������������������������������������������������������������������������������������������� 412
PySpark������������������������������������������������������������������������������������������������������������������������������� 414
Summary���������������������������������������������������������������������������������������������������������������������������������� 416
x
Table of Contents
Chapter 18: Deploying Models in Production with Scikit-Learn and PySpark����� 463
Steps in Model Deployment������������������������������������������������������������������������������������������������������ 465
Deploying a Multilayer Perceptron (MLP)���������������������������������������������������������������������������������� 466
Deployment with Scikit-Learn��������������������������������������������������������������������������������������������� 466
PySpark������������������������������������������������������������������������������������������������������������������������������� 470
Bringing It All Together�������������������������������������������������������������������������������������������������������������� 478
Scikit-Learn������������������������������������������������������������������������������������������������������������������������� 479
PySpark������������������������������������������������������������������������������������������������������������������������������� 480
Summary���������������������������������������������������������������������������������������������������������������������������������� 482
Index��������������������������������������������������������������������������������������������������������������������� 483
xi
About the Author
Abdelaziz Testas, PhD, is a data scientist with over a
decade of experience in data analysis and machine learning,
specializing in the use of standard Python libraries and Spark
distributed computing. He holds a PhD in Economics from
the University of Leeds and a master’s degree in Finance
from the University of Glasgow. He has completed several
certificates in computer science and data science.
In the last ten years, the author worked for Nielsen
in Fremont, California, as a lead data scientist, focusing
on improving the company’s audience measurement
by planning, initiating, and executing end-to-end data
science projects and methodology work, and drove advanced solutions into Nielsen’s
digital ad and content rating products by leveraging subject matter expertise in media
measurement and data science. The author is passionate about helping others improve
their machine learning skills and workflows and is excited to share his knowledge and
experience with a wider audience through this book.
xiii
About the Technical Reviewer
Bharath Kumar Bolla has over 12 years of experience and
is a senior data scientist at Salesforce, Hyderabad. Bharath
obtained an MS in Data Science from the University of
Arizona, USA. He also had a master’s in Life Sciences from
Mississippi State University, USA. Bharath worked as a
research scientist for around seven years at the University
of Georgia, Emory University, and Eurofins LLC. At Verizon,
Bharath led a team to build a “Smart Pricing” solution, and
at Happiest Minds, he worked on AI-based digital marketing
products. Along with his day-to-day responsibilities, he is a mentor and an active
researcher with more than 20 publications in conferences and journals. Bharath received
the “40 Under 40 Data Scientists 2021” award from Analytics India Magazine for his
accomplishments.
xv
Acknowledgments
I would like to express my gratitude to all those who have directly or indirectly
contributed to the creation and publication of this book. First and foremost, I am deeply
thankful to my family for their unwavering support and encouragement throughout this
journey.
I would like to acknowledge the invaluable assistance of Apress in bringing this
book to fruition. Specifically, I wish to extend my heartfelt thanks to Celestin John, the
Acquisitions Editor for AI and Machine Learning, for his guidance in shaping the main
themes of this work and providing continuous feedback during the initial stages of the
book. I am also grateful to Nirmal Selvaraj for his dedication and support as my main
point of contact throughout the development cycle of the book. Additionally, I would
like to express my appreciation to Laura Berendson for her advisory role and invaluable
contributions as our Development Editor.
Lastly, I extend my gratitude to the technical reviewer Bharath Kumar Bolla who
diligently reviewed the manuscript and provided valuable suggestions for improvement.
Your meticulousness and expertise have significantly enhanced the quality and clarity of
the final product.
To all those who have contributed, whether mentioned individually or not, your
contributions have played an integral part in making this book a reality. Thank you for
your support, feedback, and commitment to this project.
xvii
Introduction
In recent years, the amount of data generated and collected by companies and
organizations has grown exponentially. As a result, data scientists have been pushed
to process and analyze large amounts of data, and traditional single-node computing
tools such as Pandas and Scikit-Learn have become inadequate. In response, many data
scientists have turned to distributed computing frameworks such as Apache Spark, with
its Python-based interface, PySpark.
PySpark has several advantages over single-node computing, including the ability
to handle large volumes of data and the potential for significantly faster data processing
times. Furthermore, because PySpark is built on top of Spark, a widely used distributed
computing framework, it also offers a broader set of tools for data processing and
machine learning.
While transitioning from Pandas and Scikit-Learn to PySpark may seem daunting,
the transition can be relatively straightforward. Pandas/Scikit-Learn and PySpark offer
similar APIs, which means that many data scientists can easily transition from one to
the other.
In this context, this book will explore the benefits of using PySpark over traditional
single-node computing tools and provide guidance for data scientists who are
considering transitioning to PySpark.
In this book, we aim to provide a comprehensive overview of the main machine
learning algorithms with a particular focus on regression and classification. These are
fundamental techniques that form the backbone of many practical applications of
machine learning. We will cover popular methods such as linear and logistic regression,
decision trees, random forests, gradient-boosted trees, support vector machines, Naive
Bayes, and neural networks. We will also discuss how these algorithms can be applied
to real-world problems, such as predicting house prices, and the likelihood of diabetes
as well as classifying handwritten digits or the species of an Iris flower and predicting
whether a tumor is benign or malignant. Whether you are a beginner or an experienced
practitioner, this book is designed to help you understand the core concepts of machine
learning and develop the skills needed to apply these methods in practice.
xix
Introduction
This book spans 18 chapters and covers multiple topics. The first two chapters
examine why migration from Pandas and Scikit-Learn to PySpark can be a seamless
process, and address the challenges of selecting an algorithm. Chapters 3–6 build, train,
and evaluate some popular regression models, namely, multiple linear regression,
decision trees, random forests, and gradient-boosted trees, and use them to deal
with some real-world tasks such as predicting house prices. Chapters 7–12 deal with
classification issues by building, training, and evaluating widely used algorithms such
as logistic regression, decision trees, random forests, support vector machines, Naive
Bayes, and neural networks. In Chapters 13–15, we examine three additional types of
algorithms, namely, recommender systems, natural language processing, and clustering
with k-means. In the final three chapters, we deal with hyperparameter tuning, pipelines,
and deploying models into production.
xx
CHAPTER 1
An Easy Transition
One of the key factors in making the transition from Pandas and Scikit-Learn to PySpark
relatively easy is the similarity in functionality. This similarity will become evident after
reading this chapter and executing the code described herein.
One of the easiest ways to test the code is by signing up for an online Databricks
Community Edition account and creating a workspace. Databricks provides detailed
documentation on how to create a cluster, upload data, and create a notebook.
Additionally, Spark can also be installed locally through the pip install pyspark
command.
Another option is Google Colab. PySpark is preinstalled on Colab by default;
otherwise, it can be installed using the !pip install pyspark command in a Colab
notebook. This command will install PySpark and its dependencies in the Colab
environment. While both provide Jupyter-like notebooks, one advantage of Databricks
is that Colab is a single-core instance, whereas Databricks provides multi-node clusters
for parallel processing. This feature makes Databricks better suited for handling larger
datasets and more complex computational tasks in a collaborative team environment.
Although Pandas and Scikit-Learn are tools primarily designed for small data
processing and analysis, while PySpark is a big data processing framework, PySpark
offers functionality similar to Pandas and Scikit-Learn. This includes DataFrame
operations and machine learning algorithms. The presence of these familiar
functionalities in PySpark facilitates a smoother transition for data scientists accustomed
to working with Pandas and Scikit-Learn.
In this chapter, we examine in greater depth the factors that contribute to the ease
of transition from these small data tools (Pandas and Scikit-Learn) to PySpark. More
specifically, we focus on PySpark and Pandas integration and the similarity in syntax
between PySpark, on the one hand, and Pandas and Scikit-Learn, on the other.
1
© Abdelaziz Testas 2023
A. Testas, Distributed Machine Learning with PySpark, https://doi.org/10.1007/978-1-4842-9751-3_1
Chapter 1 An Easy Transition
[In]: pyspark_df.show()
[Out]:
Country River
2
Visit https://ebookmass.com
now to explore a rich
collection of eBooks and enjoy
exciting offers!
Chapter 1 An Easy Transition
Notice that before converting the Pandas DataFrame to the PySpark DataFrame
using the PySpark createDataFrame() method in step 3, we needed to create a Spark
Session named spark. There are two lines of code to achieve this. In the first line
(from pyspark.sql import SparkSession), we imported the SparkSession class from the
pyspark.sql module. In the second line, we created a new instance of the Spark Session
named spark (spark = SparkSession.builder.appName(“BigRivers”).getOrCreate()). The
following is what each part of this line of code does:
[In]: print(pandas_df)
[Out]:
Country River
Notice that Pandas displays an index column in the output, providing a unique
identifier for each row, whereas PySpark does not explicitly show an index column in its
DataFrame output.
As we have seen in this section, it is easy to toggle between Pandas and PySpark,
thanks to the close integration between the two libraries.
Similarity in Syntax
Another key factor that contributes to the smooth transition from small data tools
(Pandas and Scikit-Learn) to big data with PySpark is the familiarity of syntax. PySpark
shares a similar syntax with Pandas and Scikit-Learn in many cases. For example, square
brackets ([]) can be used on Databricks or Google Colab to select columns directly from
a PySpark DataFrame, just like in Pandas. Similarly, PySpark provides a DataFrame
API that resembles Pandas, and Spark MLlib (Machine Learning Library) includes
implementations of various machine learning algorithms found in Scikit-Learn.
In this section, we present examples of how PySpark code is similar to Pandas and
Scikit-Learn, facilitating an easy transition for data scientists between these tools.
4
Random documents with unrelated
content Scribd suggests to you:
L’APRÈS-MIDI DU FAUNE
II
III
«Mais vous, dès que vos fils sont sortis de leur mère,
Ils apprennent la mort et ses arts raffinés.
Vous les faites pourrir dans le charnier des guerres,
Vivants, vous les sciez et vous les dépecez.