Dask For Parallel Computing Cheat Sheet

Dask is a Python library for parallel computing that allows users to scale Python programs beyond a single machine. It provides parallel versions of NumPy, Pandas, and other tools. Dask can run on a single machine or on a cluster. It supports reading and writing data from common file formats like CSV, HDF5, and Parquet. Users write normal Python and NumPy code and Dask handles the parallel execution and scheduling.

Uploaded by

xjackx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

371 views2 pages

Dask For Parallel Computing Cheat Sheet

Uploaded by

xjackx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

DASK FOR PARALLEL COMPUTING CHEAT SHEET

See full Dask documentation at: http://dask.pydata.org/

These instructions use the conda environment manager. Get yours at http://bit.ly/getconda

DASK QUICK INSTALL

Install Dask with conda conda install dask
Install Dask with pip pip install dask[complete]
DASK COLLECTIONS EASY TO USE BIG DATA COLLECTIONS
DASK DATAFRAMES PARALLEL PANDAS DATAFRAMES FOR LARGE DATA
Import import dask.dataframe as dd
Read CSV data df = dd.read_csv('my-data.*.csv')
Read Parquet data df = dd.read_parquet('my-data.parquet')
Filter and manipulate data with Pandas syntax df['z'] = df.x + df.y
Standard groupby aggregations, joins, etc. result = df.groupby(df.z).y.mean()
Compute result as a Pandas dataframe out = result.compute()
Or store to CSV, Parquet, or other formats result.to_parquet('my-output.parquet')
EXAMPLE df = dd.read_csv('filenames.*.csv')
df.groupby(df.timestamp.day)\
.value.mean().compute()
DASK ARRAYS PARALLEL NUMPY ARRAYS FOR LARGE DATA
Import import dask.array as da
Create from any array-like object import h5py
dataset = h5py.File('my-data.hdf5')['/group/dataset']
Including HFD5, NetCDF, or other x = da.from_array(dataset, chunks=(1000, 1000))
on-disk formats.
Alternatively generate an array from a random da.random.uniform(shape=(1e4, 1e4), chunks=(100, 100))
distribution.
Perform operations with NumPy syntax y = x.dot(x.T - 1) - x.mean(axis=0)
Compute result as a NumPy array result = y.compute()
Or store to HDF5, NetCDF or other out = f.create_dataset(...)
on-disk format x.store(out)
EXAMPLE with h5py.File('my-data.hdf5') as f:
x = da.from_array(f['/path'], chunks=(1000, 1000))
x -= x.mean(axis=0)
out = f.create_dataset(...)
x.store(out)
DASK BAGS PARELLEL LISTS FOR UNSTRUCTURED DATA
Import import dask.bag as db
Create Dask Bag from a sequence b = db.from_sequence(seq, npartitions)
Or read from text formats b = db.read_text('my-data.*.json')
Map and filter results import json
records = b.map(json.loads)
.filter(lambda d: d["name"] == "Alice")
Compute aggregations like mean, count, sum records.pluck('key-name').mean().compute()
Or store results back to text formats records.to_textfiles('output.*.json')
EXAMPLE db.read_text('s3://bucket/my-data.*.json')
.map(json.loads)
.filter(lambda d: d["name"] == "Alice")
.to_textfiles('s3://bucket/output.*.json')

CONTINUED ON BACK →
DASK COLLECTIONS (CONTINUED)
ADVANCED
Read from distributed file systems or df = dd.read_parquet('s3://bucket/myfile.parquet')
cloud storage
Prepend prefixes like hdfs://, s3://, b = db.read_text('hdfs:///path/to/my-data.*.json')
or gcs:// to paths
Persist lazy computations in memory df = df.persist()
Compute multiple outputs at once dask.compute(x.min(), x.max())
CUSTOM COMPUTATIONS FOR CUSTOM CODE AND COMPLEX ALGORITHMS
DASK DELAYED LAZY PARALLELISM FOR CUSTOM CODE
Import import dask
Wrap custom functions with the @dask.delayed
@dask.delayed annotation def load(filename):
...
Delayed functions operate lazily, @dask.delayed
producing a task graph rather than def process(data):
executing immediately ...
Passing delayed results to other load = dask.delayed(load)
delayed functions creates process = dask.delayed(process)
dependencies between tasks
Call functions in normal code data = [load(fn) for fn in filenames]
results = [process(d) for d in data]
Compute results to execute in parallel dask.compute(results)
CONCURRENT.FUTURES ASYNCHRONOUS REAL-TIME PARALLELISM
Import from dask.distributed import Client
Start local Dask Client client = Client()
Submit individual task asynchronously future = client.submit(func, *args, **kwargs)
Block and gather individual result result = future.result()
Process results as they arrive for future in as_completed(futures):
...
EXAMPLE L = [client.submit(read, fn) for fn in filenames]
L = [client.submit(process, future) for future in L]
future = client.submit(sum, L)
result = future.result()
SET UP CLUSTER HOW TO LAUNCH ON A CLUSTER
MANUALLY
Start scheduler on one machine $ dask-scheduler
Scheduler started at SCHEDULER_ADDRESS:8786
Start workers on other machines host1$ dask-worker SCHEDULER_ADDRESS:8786
Provide address of the running scheduler host2$ dask-worker SCHEDULER_ADDRESS:8786
Start Client from Python process from dask.distributed import Client
client = Client('SCHEDULER_ADDRESS:8786')
ON A SINGLE MACHINE
Call Client() with no arguments for easy client = Client()
setup on a single host
CLOUD DEPLOYMENT
See dask-kubernetes project for Google Cloud pip install dask-kubernetes
See dask-ec2 project for Amazon EC2 pip install dask-ec2

MORE RESOURCES
User Documentation dask.pydata.org
Technical documentation for distributed scheduler distributed.readthedocs.org
Report a bug github.com/dask/dask/issues

anaconda.com · [email protected] · 512-776-1066

8/20/2017

Polaris New
No ratings yet
Polaris New
40 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Hyundai Noida
100% (1)
Hyundai Noida
56 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
2013 Ibbotson SBBI
100% (1)
2013 Ibbotson SBBI
43 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
E-Poster Clinical Project
No ratings yet
E-Poster Clinical Project
1 page
Chapter3-Working With Dask DataFrames
100% (1)
Chapter3-Working With Dask DataFrames
24 pages
Chapter1-Working With Big Data
No ratings yet
Chapter1-Working With Big Data
44 pages
Chapter4-Working With Dask Bags For Unstructured Data
No ratings yet
Chapter4-Working With Dask Bags For Unstructured Data
33 pages
Chapter2-Working With Dask Arrays
No ratings yet
Chapter2-Working With Dask Arrays
41 pages
Are You Still Using Pandas For Big Data
No ratings yet
Are You Still Using Pandas For Big Data
14 pages
Data Wrangling With Dask CheatSheet 1731972488
No ratings yet
Data Wrangling With Dask CheatSheet 1731972488
7 pages
Logistic Regression in Python Using Dask
No ratings yet
Logistic Regression in Python Using Dask
19 pages
Dask: The Definitive Guide - Scalable Python Data Science with Dask (Early Release 1) 1st Edition Matthew Rocklin download
No ratings yet
Dask: The Definitive Guide - Scalable Python Data Science with Dask (Early Release 1) 1st Edition Matthew Rocklin download
123 pages
Chapter5-Case Study Analyzing Flight Delays
No ratings yet
Chapter5-Case Study Analyzing Flight Delays
32 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Python For Data Science - Ultimate Library Guide
No ratings yet
Python For Data Science - Ultimate Library Guide
5 pages
Complex Computing Problem KMeans Clustering
No ratings yet
Complex Computing Problem KMeans Clustering
4 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Spark and Dask
No ratings yet
Spark and Dask
55 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
Py Spark
No ratings yet
Py Spark
9 pages
pyspark
No ratings yet
pyspark
4 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
Python Numpy Pandas CheatSheet
No ratings yet
Python Numpy Pandas CheatSheet
4 pages
Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science
No ratings yet
Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science
10 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
Interview
No ratings yet
Interview
1 page
Pyspark- Notes 1
No ratings yet
Pyspark- Notes 1
3 pages
Parallel Distributed Computing Using Pyt
No ratings yet
Parallel Distributed Computing Using Pyt
41 pages
Pyspark
No ratings yet
Pyspark
10 pages
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
No ratings yet
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
2 pages
Pyspart Iq
No ratings yet
Pyspart Iq
27 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Vivek 210033252 BDCW - Ipynb - Colaboratory
No ratings yet
Vivek 210033252 BDCW - Ipynb - Colaboratory
112 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Parcs Python
No ratings yet
Parcs Python
10 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
cs441 Big Data Concept by Sial
No ratings yet
cs441 Big Data Concept by Sial
23 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
PDC Assignment 44
No ratings yet
PDC Assignment 44
5 pages
Page 01
No ratings yet
Page 01
2 pages
23CP309T BDA MSE Question Paper
No ratings yet
23CP309T BDA MSE Question Paper
2 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
ELT Using Pandas
No ratings yet
ELT Using Pandas
5 pages
Enherent Business Analytics
No ratings yet
Enherent Business Analytics
24 pages
pq012616 PDF
No ratings yet
pq012616 PDF
5 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
Orange County Jail: Percentage of Inmates by Race
No ratings yet
Orange County Jail: Percentage of Inmates by Race
2 pages
Data Analysis With PANDAS: Cheat Sheet
86% (7)
Data Analysis With PANDAS: Cheat Sheet
4 pages
Sas Insurance Analytics Architecture 104891
No ratings yet
Sas Insurance Analytics Architecture 104891
4 pages
Chart Suggestions - A Thought-Starter
No ratings yet
Chart Suggestions - A Thought-Starter
1 page
Kolmogorov
No ratings yet
Kolmogorov
5 pages
SET (M) Indefinite Leave To Remain Application (April 2011)
No ratings yet
SET (M) Indefinite Leave To Remain Application (April 2011)
20 pages
Hybrid Ground-Source Heat Pumps: Saving Energy and Cost
No ratings yet
Hybrid Ground-Source Heat Pumps: Saving Energy and Cost
2 pages
Essentials of Metaheuristics
No ratings yet
Essentials of Metaheuristics
263 pages
Big Data in Retail
100% (1)
Big Data in Retail
12 pages
Shivangi
No ratings yet
Shivangi
31 pages
Bach John... Analisis
No ratings yet
Bach John... Analisis
147 pages
Social Movements
No ratings yet
Social Movements
8 pages
Solution Unit Test 1 Calculus 1
No ratings yet
Solution Unit Test 1 Calculus 1
3 pages
Class 11 Economics Sample Paper Set 8
No ratings yet
Class 11 Economics Sample Paper Set 8
8 pages
Methods For Testing Tar and Bituminous Materials - Determination of Specific Gravity
100% (1)
Methods For Testing Tar and Bituminous Materials - Determination of Specific Gravity
10 pages
Chocolate Eclairs
No ratings yet
Chocolate Eclairs
4 pages
Success STory: SAP C4C Sales Cloud Implementation at AL Tasnim Group (ATNM)
No ratings yet
Success STory: SAP C4C Sales Cloud Implementation at AL Tasnim Group (ATNM)
1 page
Current State of The World and Domestic Aluminium
No ratings yet
Current State of The World and Domestic Aluminium
7 pages
Pronouns Class 6
No ratings yet
Pronouns Class 6
12 pages
Pre-Alternative Algebras and Pre-Alternative Bialgebras: Abstract
No ratings yet
Pre-Alternative Algebras and Pre-Alternative Bialgebras: Abstract
34 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
8 pages
Week 8 Marriage and Family Life On The Eve of Missionary Contact
No ratings yet
Week 8 Marriage and Family Life On The Eve of Missionary Contact
8 pages
16-Limpan Investment Corp. v. CIR G.R. No. L-21570 July 26, 1966
No ratings yet
16-Limpan Investment Corp. v. CIR G.R. No. L-21570 July 26, 1966
4 pages
Eng 322
No ratings yet
Eng 322
10 pages
Alka Bhagat
No ratings yet
Alka Bhagat
2 pages
Judge Spinner Cases 01 01 2010 To 06 16 2012
No ratings yet
Judge Spinner Cases 01 01 2010 To 06 16 2012
75 pages
Past Simple Busy Teacher
No ratings yet
Past Simple Busy Teacher
8 pages
Quick Guide - 2-ASIC - SEGA-GG REV2 - 0
No ratings yet
Quick Guide - 2-ASIC - SEGA-GG REV2 - 0
2 pages
Application of Coir Geotextile For Road Construction Some Issues
No ratings yet
Application of Coir Geotextile For Road Construction Some Issues
5 pages
Book Release: Rock Garden
No ratings yet
Book Release: Rock Garden
4 pages
Sm-Ii Group Assignment: Adani Enterprises LTD
No ratings yet
Sm-Ii Group Assignment: Adani Enterprises LTD
9 pages
Chapter - 3
No ratings yet
Chapter - 3
44 pages
Sad Module 3 (Edited Jan 23)
No ratings yet
Sad Module 3 (Edited Jan 23)
41 pages
Asdaf Kabupaten Kutai Kartanegara, Provinsikalimantan Timur Program Studi Keuangan Publik
No ratings yet
Asdaf Kabupaten Kutai Kartanegara, Provinsikalimantan Timur Program Studi Keuangan Publik
13 pages
Earthquake Contingency Plan For Armed Forces Division (AFD)
No ratings yet
Earthquake Contingency Plan For Armed Forces Division (AFD)
47 pages
THCDC
No ratings yet
THCDC
132 pages

Dask For Parallel Computing Cheat Sheet

Uploaded by

Dask For Parallel Computing Cheat Sheet

Uploaded by

DASK FOR PARALLEL COMPUTING CHEAT SHEET

See full Dask documentation at: http://dask.pydata.org/

DASK QUICK INSTALL

anaconda.com · [email protected] · 512-776-1066

You might also like