The document discusses common Spark functions for manipulating and analyzing large datasets. It lists functions like map, filter, reduce, groupBy, join, union and others and provides examples of how to use each function.
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
10 views
Spark Commands
The document discusses common Spark functions for manipulating and analyzing large datasets. It lists functions like map, filter, reduce, groupBy, join, union and others and provides examples of how to use each function.
2.df.show(): Displays the contents of the DataFrame.
Example: df.show()
3.df.printSchema(): Prints the schema of the DataFrame.
Example: df.printSchema()
4.df.select(): Selects specific columns from the DataFrame.
Example: df.select("column1", "column2")
5.df.filter(): Filters rows based on a condition.
Example: df.filter(df["column1"] > 10)
6.df.groupBy(): Groups the DataFrame using the specified columns.
Example: df.groupBy("column1").count()
7.df.join(): Joins two DataFrames.
Example: df1.join(df2, df1["key"] == df2["key"], "inner") spark functions ChatGPT ChatGPT Spark is a distributed computing framework that's widely used for big data processing and analytics. It offers various functions and capabilities to manipulate and analyze large datasets efficiently. Here are some common Spark functions:
1. map(func): Applies a function to each element of an RDD (Resilient
Distributed Dataset) and returns a new RDD with the results. 2. flatMap(func): Similar to map, but each input item can be mapped to zero or more output items. 3. filter(func): Returns an RDD consisting of elements that satisfy a predicate (i.e., a condition). 4. reduce(func): Aggregate function that reduces the elements of an RDD using a specified associative binary operation. 5. groupBy(func): Groups the elements of an RDD according to a specified function. 6. sortBy(func): Sorts the elements of an RDD based on a specified function. 7. join(other): Performs an inner join between two RDDs. 8. union(other): Returns the union of two RDDs. 9. intersection(other): Returns the intersection of two RDDs. 10.distinct(): Removes duplicate elements from an RDD. 11.collect(): Retrieves all elements of an RDD as an array. 12.take(n): Retrieves the first n elements of an RDD. 13.foreach(func): Applies a function to each element of an RDD. 14.reduceByKey(func): Similar to reduce, but performs reduction on elements with the same key. 15.aggregateByKey(zeroValue)(seqOp, combOp): Aggregates the values of each key using given combine functions. 16.mapPartitions(func): Similar to map, but operates on each partition of the RDD. 17.coalesce(numPartitions): Reduces the number of partitions in an RDD to a given number. 18.repartition(numPartitions): Reshuffles the data in an RDD to create a specified number of partitions.