Informative Questions
1. What is PySpark, and how does it differ from traditional Spark?
2. Explain the concept of Resilient Distributed Datasets (RDDs) in
PySpark.
3. How do DataFrames in PySpark differ from RDDs?
4. What are some common transformations and actions available
in PySpark?
5. Describe how PySpark handles partitioning and shuffling of data.
Scenario-Based Questions
1. You need to process streaming data from Kafka using PySpark
Streaming. How would you set this up?
2. Imagine you have to join two large DataFrames that do not fit
into memory; what strategies would you employ?
3. How would you optimize a slow-running PySpark job that
processes large datasets?
4. You need to perform aggregations on a dataset that has missing
values; how would you handle this in PySpark?
5. If you encounter skewed data during processing, what
techniques can you use to mitigate its effects?
1. Write PySpark code to create a DataFrame from a list of tuples and show its content:
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, schema=columns)
df.show()
2. Implement code to filter rows from a DataFrame based on a condition (e.g., Age > 30):
filtered_df = df.filter(df.Age > 30)
filtered_df.show()
3. Write code to group data by a column and calculate the average of another column (e.g., average
age by name):
avg_age_df = df.groupBy("Name").agg({"Age": "avg"})
avg_age_df.show()
4. Create a DataFrame from an external JSON file and display its schema and content:
json_df = spark.read.json("data.json")
json_df.printSchema()
json_df.show()
5. Write code to perform an inner join between two DataFrames and show the result:
df1 = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["Name", "ID"])
df2 = spark.createDataFrame([(1, "HR"), (2, "Finance")], ["ID", "Department"])
joined_df = df1.join(df2, "ID", "inner")
joined_df.show()
6. Implement code to write a DataFrame to Parquet format and read it back into another DataFrame:
df.write.parquet("output.parquet")
parquet_df = spark.read.parquet("output.parquet")
parquet_df.show()
7. Create a new column in an existing DataFrame by applying a transformation on another column
(e.g., double the age):
df_with_new_col = df.withColumn("Double_Age", df.Age * 2)
df_with_new_col.show()
8. Write code to handle missing values in a DataFrame by filling them with default values (e.g., fill
null ages with 0):
filled_df = df.fillna({"Age": 0})
filled_df.show()
9. Implement code to calculate the total number of records in a DataFrame using an action (e.g.,
count):
total_count = df.count()
print(f"Total records: {total_count}")
10. Write PySpark code to create and use a temporary view for SQL queries on DataFrames:
df.createOrReplaceTempView("people")
sql_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 30")
sql_result.show()