Advanced Analytics with Pyspark 1st Edition Akash Tandondownload
Advanced Analytics with Pyspark 1st Edition Akash Tandondownload
or textbooks at https://ebookultra.com
https://ebookultra.com/download/advanced-analytics-with-
pyspark-1st-edition-akash-tandon/
https://ebookultra.com/download/data-analytics-with-hadoop-an-
introduction-for-data-scientists-1st-edition-benjamin-bengfort/
https://ebookultra.com/download/advanced-host-intrusion-prevention-
with-csa-1st-edition-chad-sullivan/
https://ebookultra.com/download/aristotle-s-prior-analytics-book-i-
translated-with-an-introduction-and-commentary-1st-edition-gisela-
striker/
Performance Marketing with Google Analytics Strategies and
Techniques for Maximizing Online ROI 1st Edition Sebastian
Tonkin
https://ebookultra.com/download/performance-marketing-with-google-
analytics-strategies-and-techniques-for-maximizing-online-roi-1st-
edition-sebastian-tonkin/
https://ebookultra.com/download/google-analytics-1st-edition-justin-
cutroni/
https://ebookultra.com/download/google-bigquery-analytics-1st-edition-
naidu/
https://ebookultra.com/download/advanced-regression-models-with-sas-
and-r_revised-1st-edition-olga-korosteleva/
https://ebookultra.com/download/shaderx3-advanced-rendering-with-
directx-and-opengl-1st-edition-wolfgang-engel/
Advanced Analytics with Pyspark 1st Edition Akash
Tandon Digital Instant Download
Author(s): Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
ISBN(s): 9781098103651, 1098103653
Edition: 1
File Details: PDF, 9.56 MB
Year: 2022
Language: english
Advanced
Analytics with
PySpark
Patterns for Learning from Data at Scale
Using Python and Spark
Akash Tandon,
Sandy Ryza, Uri Laserson,
Sean Owen & Josh Wills
Advanced Analytics with PySpark
The amount of data being generated today is staggering— Akash Tandon is cofounder and CTO of
and growing. Apache Spark has emerged as the de facto tool Looppanel. Previously, he worked as a
senior data engineer at Atlan.
for analyzing big data and is now a critical part of the data
science toolbox. Updated for Spark 3.0, this practical guide Sandy Ryza leads development of the
brings together Spark, statistical methods, and real-world Dagster project and is a committer on
Apache Spark.
datasets to teach you how to approach analytics problems
using PySpark, Spark’s Python API, and other best practices Uri Laserson is founder and CTO of
Patch Biosciences. Previously, he
in Spark programming.
worked on big data and genomics at
Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Cloudera.
Sean Owen, and Josh Wills offer an introduction to the Spark Sean Owen, a principal solutions
ecosystem, then dive into patterns that apply common architect focusing on machine learning
techniques—including classification, clustering, collaborative and data science at Databricks, is an
Apache Spark committer and PMC
filtering, and anomaly detection—to fields such as genomics,
member.
security, and finance. This updated edition also covers image
processing and the Spark NLP library. Josh Wills is a software engineer at
WeaveGrid and the former head of data
If you have a basic understanding of machine learning and engineering at Slack.
statistics and you program in Python, this book will get you
started with large-scale data analysis.
• Familiarize yourself with Spark’s programming model and
ecosystem
• Learn general approaches in data science
• Examine complete implementations that analyze large
public datasets
• Discover which machine learning tools make sense for
particular problems
• Explore code that can be adapted to many uses
9 781098 103651
Advanced Analytics with PySpark
by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Copyright © 2022 Akash Tandon. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional
sales department: 800-998-9938 or [email protected].
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with PySpark, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your
own risk. If any code samples or other technology this work contains or describes is subject to open
source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-098-10365-1
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
iii
Preparing the Data 41
Building a First Model 44
Spot Checking Recommendations 48
Evaluating Recommendation Quality 49
Computing AUC 51
Hyperparameter Selection 52
Making Recommendations 55
Where to Go from Here 56
iv | Table of Contents
TF-IDF 119
Computing the TF-IDFs 120
Creating Our LDA Model 121
Where to Go from Here 124
Table of Contents | v
10. Image Similarity Detection with Deep Learning and PySpark LSH. . . . . . . . . . . . . . . . . 179
PyTorch 180
Installation 180
Preparing the Data 181
Resizing Images Using PyTorch 181
Deep Learning Model for Vector Representation of Images 182
Image Embeddings 183
Import Image Embeddings into PySpark 185
Image Similarity Search Using PySpark LSH 186
Nearest Neighbor Search 187
Where to Go from Here 190
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
vi | Table of Contents
Preface
Apache Spark’s long lineage of predecessors, from MPI (message passing interface)
to MapReduce, made it possible to write programs that take advantage of massive
resources while abstracting away the nitty-gritty details of distributed systems. As
much as data processing needs have motivated the development of these frameworks,
in a way the field of big data has become so related to them that its scope is defined
by what these frameworks can handle. Spark’s original promise was to take this a little
further—to make writing distributed programs feel like writing regular programs.
The rise in Spark’s popularity coincided with that of the Python data (PyData) ecosys‐
tem. So it makes sense that Spark’s Python API—PySpark—has significantly grown
in popularity over the last few years. Although the PyData ecosystem has recently
sprung up some distributed programming options, Apache Spark remains one of the
most popular choices for working with large datasets across industries and domains.
Thanks to recent efforts to integrate PySpark with the other PyData tools, learning
the framework can help you boost your productivity significantly as a data science
practitioner.
We think that the best way to teach data science is by example. To that end, we have
put together a book of applications, trying to touch on the interactions between the
most common algorithms, datasets, and design patterns in large-scale analytics. This
book isn’t meant to be read cover to cover: page to a chapter that looks like something
you’re trying to accomplish, or that simply ignites your interest, and start there.
vii
The ecosystem changes, combined with Spark’s latest major release, make this edition
a timely one. Unlike previous editions of Advanced Analytics with Spark, which chose
Scala, we will use Python. We’ll cover best practices and integrate with the wider
Python data science ecosystem when appropriate. All chapters have been updated
to use the latest PySpark API. Two new chapters have been added and multiple
chapters have undergone major rewrites. We will not cover Spark’s streaming and
graph libraries. With Spark in a new era of maturity and stability, we hope that these
changes will preserve the book as a useful resource on analytics for years to come.
viii | Preface
Understanding utilization of New York cabs
We compute average taxi waiting time as a function of location by performing
temporal and geospatial analysis (see Chapter 7).
Reduce risk for an investment portfolio
We estimate financial risk for an investment portfolio using the Monte Carlo
simulation (see Chapter 9).
When possible, we attempt not to just provide a “solution,” but to demonstrate the
full data science workflow, with all of its iterations, dead ends, and restarts. This
book will be useful for getting more comfortable with Python, Spark, and machine
learning and data analysis. However, these are in service of a larger goal, and we hope
that most of all this book will teach you how to approach tasks like those described
earlier. Each chapter, in about 20 measly pages, will try to get as close as possible to
demonstrating how to build one piece of these data applications.
Preface | ix
This element indicates a warning or caution.
Our unique network of experts and innovators share their knowledge and expertise
through books, articles, and our online learning platform. O’Reilly’s online learning
platform gives you on-demand access to live training courses, in-depth learning
paths, interactive coding environments, and a vast collection of text and video from
O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.
x | Preface
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at https://oreil.ly/adv-analytics-pyspark.
Email [email protected] to comment or ask technical questions about this
book.
For news and information about our books and courses, visit https://oreilly.com.
Find us on LinkedIn: https://linkedin.com/company/oreilly-media
Follow us on Twitter: https://twitter.com/oreillymedia
Watch us on YouTube: https://youtube.com/oreillymedia
Acknowledgments
It goes without saying that you wouldn’t be reading this book if it were not for the
existence of Apache Spark and MLlib. We all owe thanks to the team that has built
and open sourced it and the hundreds of contributors who have added to it.
We would like to thank everyone who spent a great deal of time reviewing the
content of the previous editions of the book with expert eyes: Michael Bernico, Adam
Breindel, Ian Buss, Parviz Deyhim, Jeremy Freeman, Chris Fregly, Debashish Ghosh,
Juliet Hougland, Jonathan Keebler, Nisha Muktewar, Frank Nothaft, Nick Pentreath,
Kostas Sakellis, Tom White, Marcelo Vanzin, and Juliet Hougland again. Thanks all!
We owe you one. This has greatly improved the structure and quality of the result.
Sandy also would like to thank Jordan Pinkus and Richard Wang for helping with
some of the theory behind the risk chapter.
Thanks to Jeff Bleiel and O’Reilly for the experience and great support in getting this
book published and into your hands.
Preface | xi
CHAPTER 1
Analyzing Big Data
When people say that we live in an age of big data they mean that we have tools for
collecting, storing, and processing information at a scale previously unheard of. The
following tasks simply could not have been accomplished 10 or 15 years ago:
• Build a model to detect credit card fraud using thousands of features and billions
of transactions
• Intelligently recommend millions of products to millions of users
• Estimate financial risk through simulations of portfolios that include millions of
instruments
• Easily manipulate genomic data from thousands of people to detect genetic
associations with disease
• Assess agricultural land use and crop yield for improved policymaking by peri‐
odically processing millions of satellite images
Sitting behind these capabilities is an ecosystem of open source software that can lev‐
erage clusters of servers to process massive amounts of data. The introduction/release
of Apache Hadoop in 2006 has led to widespread adoption of distributed computing.
The big data ecosystem and tooling have evolved at a rapid pace since then. The past
five years have also seen the introduction and adoption of many open source machine
learning (ML) and deep learning libraries. These tools aim to leverage vast amounts
of data that we now collect and store.
But just as a chisel and a block of stone do not make a statue, there is a gap between
having access to these tools and all this data and doing something useful with it.
Often, “doing something useful” means placing a schema over tabular data and using
SQL to answer questions like “Of the gazillion users who made it to the third page
in our registration process, how many are over 25?” The field of how to architect
1
data storage and organize information (data warehouses, data lakes, etc.) to make
answering such questions easy is a rich one, but we will mostly avoid its intricacies in
this book.
Sometimes, “doing something useful” takes a little extra work. SQL still may be core
to the approach, but to work around idiosyncrasies in the data or perform complex
analysis, we need a programming paradigm that’s more flexible and with richer
functionality in areas like machine learning and statistics. This is where data science
comes in and that’s what we are going to talk about in this book.
In this chapter, we’ll start by introducing big data as a concept and discuss some of
the challenges that arise when working with large datasets. We will then introduce
Apache Spark, an open source framework for distributed computing, and its key
components. Our focus will be on PySpark, Spark’s Python API, and how it fits within
a wider ecosystem. This will be followed by a discussion of the changes brought by
Spark 3.0, the framework’s first major release in four years. We will finish with a brief
note about how PySpark addresses challenges of data science and why it is a great
addition to your skillset.
Previous editions of this book used Spark’s Scala API for code examples. We decided
to use PySpark instead because of Python’s popularity in the data science community
and an increased focus by the core Spark team to better support the language. By the
end of this chapter, you will ideally appreciate this decision.
Thoas reprend:
Oui, je dois la punir,
Et tout son sang…
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookultra.com