0% found this document useful (0 votes)
10 views

Spark Commands

The document discusses common Spark functions for manipulating and analyzing large datasets. It lists functions like map, filter, reduce, groupBy, join, union and others and provides examples of how to use each function.

Uploaded by

Anil Padarthi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Spark Commands

The document discusses common Spark functions for manipulating and analyzing large datasets. It lists functions like map, filter, reduce, groupBy, join, union and others and provides examples of how to use each function.

Uploaded by

Anil Padarthi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Create a Spark Session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("MySparkApp") \

.getOrCreate()

2.Read data from a file into a Data Frame:

df = spark.read.csv("file.csv", header=True, inferSchema=True)

3. View DataFrame schema:

df.printSchema()

4. Show first n rows of DataFrame:

df.show(n=5)

5.Select specific columns from Data Frame:

df.select("column1", "column2")

6. Filter DataFrame:

df.filter(df["column"] > 10)


1.spark.read.csv(): Reads a CSV file into a DataFrame.
Example:

df = spark.read.csv("file.csv", header=True, inferSchema=True)

2.df.show(): Displays the contents of the DataFrame.


Example:
df.show()

3.df.printSchema(): Prints the schema of the DataFrame.


Example:
df.printSchema()

4.df.select(): Selects specific columns from the DataFrame.


Example:
df.select("column1", "column2")

5.df.filter(): Filters rows based on a condition.


Example:
df.filter(df["column1"] > 10)

6.df.groupBy(): Groups the DataFrame using the specified columns.


Example:
df.groupBy("column1").count()

7.df.join(): Joins two DataFrames.


Example:
df1.join(df2, df1["key"] == df2["key"], "inner")
spark functions
ChatGPT
ChatGPT
Spark is a distributed computing framework that's widely used for big data
processing and analytics. It offers various functions and capabilities to
manipulate and analyze large datasets efficiently. Here are some common
Spark functions:

1. map(func): Applies a function to each element of an RDD (Resilient


Distributed Dataset) and returns a new RDD with the results.
2. flatMap(func): Similar to map, but each input item can be mapped to zero
or more output items.
3. filter(func): Returns an RDD consisting of elements that satisfy a predicate
(i.e., a condition).
4. reduce(func): Aggregate function that reduces the elements of an RDD
using a specified associative binary operation.
5. groupBy(func): Groups the elements of an RDD according to a specified
function.
6. sortBy(func): Sorts the elements of an RDD based on a specified function.
7. join(other): Performs an inner join between two RDDs.
8. union(other): Returns the union of two RDDs.
9. intersection(other): Returns the intersection of two RDDs.
10.distinct(): Removes duplicate elements from an RDD.
11.collect(): Retrieves all elements of an RDD as an array.
12.take(n): Retrieves the first n elements of an RDD.
13.foreach(func): Applies a function to each element of an RDD.
14.reduceByKey(func): Similar to reduce, but performs reduction on elements
with the same key.
15.aggregateByKey(zeroValue)(seqOp, combOp): Aggregates the values of
each key using given combine functions.
16.mapPartitions(func): Similar to map, but operates on each partition of the
RDD.
17.coalesce(numPartitions): Reduces the number of partitions in an RDD to a
given number.
18.repartition(numPartitions): Reshuffles the data in an RDD to create a
specified number of partitions.

You might also like