0% found this document useful (0 votes)

10 views

Spark Commands

The document discusses common Spark functions for manipulating and analyzing large datasets. It lists functions like map, filter, reduce, groupBy, join, union and others and provides examples of how to use each function.

Uploaded by

Anil Padarthi

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Spark Commands

Uploaded by

Anil Padarthi

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

1.

Create a Spark Session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("MySparkApp") \

.getOrCreate()

2.Read data from a file into a Data Frame:

df = spark.read.csv("file.csv", header=True, inferSchema=True)

3. View DataFrame schema:

df.printSchema()

4. Show first n rows of DataFrame:

df.show(n=5)

5.Select specific columns from Data Frame:

df.select("column1", "column2")

6. Filter DataFrame:

df.filter(df["column"] > 10)

1.spark.read.csv(): Reads a CSV file into a DataFrame.
Example:

df = spark.read.csv("file.csv", header=True, inferSchema=True)

2.df.show(): Displays the contents of the DataFrame.

Example:
df.show()

3.df.printSchema(): Prints the schema of the DataFrame.

Example:
df.printSchema()

4.df.select(): Selects specific columns from the DataFrame.

Example:
df.select("column1", "column2")

5.df.filter(): Filters rows based on a condition.

Example:
df.filter(df["column1"] > 10)

6.df.groupBy(): Groups the DataFrame using the specified columns.

Example:
df.groupBy("column1").count()

7.df.join(): Joins two DataFrames.

Example:
df1.join(df2, df1["key"] == df2["key"], "inner")
spark functions
ChatGPT
ChatGPT
Spark is a distributed computing framework that's widely used for big data
processing and analytics. It offers various functions and capabilities to
manipulate and analyze large datasets efficiently. Here are some common
Spark functions:

1. map(func): Applies a function to each element of an RDD (Resilient

Distributed Dataset) and returns a new RDD with the results.
2. flatMap(func): Similar to map, but each input item can be mapped to zero
or more output items.
3. filter(func): Returns an RDD consisting of elements that satisfy a predicate
(i.e., a condition).
4. reduce(func): Aggregate function that reduces the elements of an RDD
using a specified associative binary operation.
5. groupBy(func): Groups the elements of an RDD according to a specified
function.
6. sortBy(func): Sorts the elements of an RDD based on a specified function.
7. join(other): Performs an inner join between two RDDs.
8. union(other): Returns the union of two RDDs.
9. intersection(other): Returns the intersection of two RDDs.
10.distinct(): Removes duplicate elements from an RDD.
11.collect(): Retrieves all elements of an RDD as an array.
12.take(n): Retrieves the first n elements of an RDD.
13.foreach(func): Applies a function to each element of an RDD.
14.reduceByKey(func): Similar to reduce, but performs reduction on elements
with the same key.
15.aggregateByKey(zeroValue)(seqOp, combOp): Aggregates the values of
each key using given combine functions.
16.mapPartitions(func): Similar to map, but operates on each partition of the
RDD.
17.coalesce(numPartitions): Reduces the number of partitions in an RDD to a
given number.
18.repartition(numPartitions): Reshuffles the data in an RDD to create a
specified number of partitions.

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Chapter 4 Quiz
No ratings yet
Chapter 4 Quiz
6 pages
Windows Command Prompt A-N
From Everand
Windows Command Prompt A-N
Prometheus MMS
5/5 (2)
ETL Testing - Basics
100% (1)
ETL Testing - Basics
43 pages
Unique, Primary Key, Foreign Key, Check, Not Null, Default Value
No ratings yet
Unique, Primary Key, Foreign Key, Check, Not Null, Default Value
7 pages
Top 100 Pyspark Functions for Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions for Data Engineers 1738131847
30 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
SQL Cheat Sheet Python
No ratings yet
SQL Cheat Sheet Python
1 page
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
RDD
No ratings yet
RDD
4 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Pyspark Essentials
No ratings yet
Pyspark Essentials
24 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Important PySpark Operations 1698872557
No ratings yet
Important PySpark Operations 1698872557
4 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Apache Spark with Scala - cheatsheet (1) (1)
No ratings yet
Apache Spark with Scala - cheatsheet (1) (1)
7 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
10 Spark1
No ratings yet
10 Spark1
31 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Note
No ratings yet
Note
14 pages
Spark Material
No ratings yet
Spark Material
6 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
pyspark (1)
No ratings yet
pyspark (1)
44 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Spark and Scala 2
No ratings yet
Spark and Scala 2
11 pages
L4
No ratings yet
L4
65 pages
Python Pyspark q's
No ratings yet
Python Pyspark q's
16 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Databricks vs SQL Cheat Sheet
No ratings yet
Databricks vs SQL Cheat Sheet
11 pages
Apache Spark - Practices
No ratings yet
Apache Spark - Practices
24 pages
Spark
No ratings yet
Spark
13 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Windows Command Prompt
From Everand
Windows Command Prompt
Murat Yildirimoglu
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
ProMoTe A Data Product Model Template For Data Meshes
No ratings yet
ProMoTe A Data Product Model Template For Data Meshes
18 pages
Epicor Technical QA
No ratings yet
Epicor Technical QA
4 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
7 pages
Big Data Analytics - AKM
No ratings yet
Big Data Analytics - AKM
208 pages
12 - SmartPlant P-ID
No ratings yet
12 - SmartPlant P-ID
22 pages
Chapter 2 - Database System Model and Architecture
No ratings yet
Chapter 2 - Database System Model and Architecture
22 pages
SQL Errors 6000-7000
No ratings yet
SQL Errors 6000-7000
6 pages
Apex_Collection by Muhammad Nur E Alam for DDD 60 (18 Dec 2024)
No ratings yet
Apex_Collection by Muhammad Nur E Alam for DDD 60 (18 Dec 2024)
7 pages
Docu61872 VNXe1600 Drive Support Matrix
No ratings yet
Docu61872 VNXe1600 Drive Support Matrix
6 pages
6 Search Engine Marketing
No ratings yet
6 Search Engine Marketing
6 pages
Candidate List Barangay BUSDAK
No ratings yet
Candidate List Barangay BUSDAK
2 pages
Parallel Databases
No ratings yet
Parallel Databases
23 pages
Data Structures
No ratings yet
Data Structures
6 pages
How Autonomous Is The Oracle Autonomous Data Warehouse?: Christian Antognini / Dani Schnider
No ratings yet
How Autonomous Is The Oracle Autonomous Data Warehouse?: Christian Antognini / Dani Schnider
57 pages
Introduction To Object Oriented Database: Unit-I
No ratings yet
Introduction To Object Oriented Database: Unit-I
67 pages
Ch02 Constraints Triggers View
No ratings yet
Ch02 Constraints Triggers View
35 pages
What Is Data Mart?
No ratings yet
What Is Data Mart?
4 pages
DBMS
No ratings yet
DBMS
7 pages
Advantages of DBMS-Elmasri
No ratings yet
Advantages of DBMS-Elmasri
6 pages
Aids Ia DDM
No ratings yet
Aids Ia DDM
5 pages
RMAN Internals and History
100% (2)
RMAN Internals and History
23 pages
MYSQL NOTES 2024 XII
No ratings yet
MYSQL NOTES 2024 XII
19 pages
Dwm Chp2 Notes
No ratings yet
Dwm Chp2 Notes
21 pages
R12.2 Development Standards Overview
No ratings yet
R12.2 Development Standards Overview
7 pages
ChatGPT Cheat Sheet For Data Science
100% (4)
ChatGPT Cheat Sheet For Data Science
78 pages

Spark Commands

Uploaded by

Spark Commands

Uploaded by

1.

Create a Spark Session:

from pyspark.sql import SparkSession

2.Read data from a file into a Data Frame:

df = spark.read.csv("file.csv", header=True, inferSchema=True)

3. View DataFrame schema:

4. Show first n rows of DataFrame:

5.Select specific columns from Data Frame:

df.filter(df["column"] > 10)

df = spark.read.csv("file.csv", header=True, inferSchema=True)

2.df.show(): Displays the contents of the DataFrame.

3.df.printSchema(): Prints the schema of the DataFrame.

4.df.select(): Selects specific columns from the DataFrame.

5.df.filter(): Filters rows based on a condition.

6.df.groupBy(): Groups the DataFrame using the specified columns.

7.df.join(): Joins two DataFrames.

1. map(func): Applies a function to each element of an RDD (Resilient

You might also like