_ Databricks & PySpark learning day-10
_ Databricks & PySpark learning day-10
Let's create a PySpark DataFrame for employee data, which will include columns such as
EmployeeID, Name, Department, and Skills.
I'll demonstrate the usage of the split, explode, and other relevant PySpark functions with
the employee data, along with notes for each operation.
# Create DataFrame
df = spark.createDataFrame(data, columns)
Note: This splits the Skills column into an array of skills based on the space separator. The
alias("Skills_Array") gives the resulting array a meaningful name.
Note: The array index starts from 0, so Skills_Array[0] gives the first skill for each employee.
Note: The size() function returns the number of elements (skills) in the Skills_Array.
Note: This returns a boolean indicating whether the array contains the specified skill,
"Cloud", for each employee.
Note: The explode() function takes an array column and creates a new row for each
element of the array. Here, each employee will have multiple rows, one for each skill.