0% found this document useful (0 votes)
23 views

_ Databricks & PySpark learning day-10

Uploaded by

suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

_ Databricks & PySpark learning day-10

Uploaded by

suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Master PySpark: From Zero to Big Data Hero!!

Split Function In Dataframe

Let's create a PySpark DataFrame for employee data, which will include columns such as
EmployeeID, Name, Department, and Skills.
I'll demonstrate the usage of the split, explode, and other relevant PySpark functions with
the employee data, along with notes for each operation.

Sample Data Creation for Employee Data


from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, size,
array_contains, col
# Sample employee data
data = [
(1, "Alice", "HR", "Communication Management"),
(2, "Bob", "IT", "Programming Networking"),
(3, "Charlie", "Finance", "Accounting Analysis"),
(4, "David", "HR", "Recruiting Communication"),
(5, "Eve", "IT", "Cloud DevOps")
]

# Define the schema


columns = ["EmployeeID", "Name", "Department", "Skills"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Display the original DataFrame


df.show(truncate=False)

Follow me on LinkedIn – Shivakiran kotur


Notes with Examples
1. Split the "Skills" column:
We will split the Skills column into an array, where each skill is separated by a space.
python

Note: This splits the Skills column into an array of skills based on the space separator. The
alias("Skills_Array") gives the resulting array a meaningful name.

2. Select the first skill from the "Skills_Array":


You can select specific elements from an array using index notation. In this case, we’ll select
the first skill from the Skills_Array.

Note: The array index starts from 0, so Skills_Array[0] gives the first skill for each employee.

Follow me on LinkedIn – Shivakiran kotur


3. Calculate the size of the "Skills_Array":
We can calculate how many skills each employee has by using the size() function.

Note: The size() function returns the number of elements (skills) in the Skills_Array.

4. Check if the array contains a specific skill:


We can check if a particular skill (e.g., "Cloud") is present in the employee's skillset using
the array_contains() function.

Note: This returns a boolean indicating whether the array contains the specified skill,
"Cloud", for each employee.

Follow me on LinkedIn – Shivakiran kotur


5. Use the explode function to transform array elements into individual rows:
The explode() function can be used to flatten the array into individual rows, where each skill
becomes a separate row for the employee.

Note: The explode() function takes an array column and creates a new row for each
element of the array. Here, each employee will have multiple rows, one for each skill.

Summary of Key Functions:


• split(): This splits a column's string value into an array based on a specified delimiter
(in this case, a space).
• explode(): Converts an array column into multiple rows, one for each element in the
array.
• size(): Returns the number of elements in an array.
• array_contains(): Checks if a specific value exists in the array.
• selectExpr(): Allows you to use SQL expressions (like array[0]) to select array
elements.

Follow me on LinkedIn – Shivakiran kotur

You might also like