Apache Spark - Practices
Apache Spark - Practices
A) Centralized control
B) Scalability
A) Network latency
B) Data locality
C) Single-threaded processing
A) In-memory processing
B) Lazy evaluation
D) Strict consistency
6. Which of the following storage formats can Spark read? (Select all that apply)
A) Parquet
B) JSON
C) CSV
D) XML
A) By using replication
9. Which of the following components is part of the Spark ecosystem? (Select all that
apply)
A) Spark SQL
B) Hadoop HDFS
C) Spark Streaming
D) Apache Kafka
D) A collection of RDDs.
11. Which of the following methods can be used to persist data in Spark? (Select all that
apply)
A) cache()
B) persist()
C) saveAsTextFile()
D) store()
A) Using fillna()
B) Using dropna()
C) Ignoring them
14. Which of the following is a valid way to create an RDD from a collection?
A) spark.createRDD(collection)
B) spark.parallelize(collection)
C) spark.makeRDD(collection)
D) spark.newRDD(collection)
D) To filter datasets.
18. What type of join does Spark perform by default when joining two DataFrames?
A) Inner join
B) Left join
C) Right join
19. How can you optimize performance in a Spark application? (Select all that apply)
A) RDD
B) DataFrame
C) Dataset
D) Table
2. Which of the following methods can be used to create a DataFrame from an existing
RDD?
A) createDataFrame()
B) toDF()
C) fromRDD()
D) loadDataFrame()
A) df.show(10)
B) df.head(10)
C) df.first(10)
D) df.display(10)
A) renameColumn()
B) withColumnRenamed("oldName", "newName")
C) changeColumnName()
D) setColumnName()
A) df.filter(condition)
B) df.where(condition)
C) df.select(condition)
D) Both A and B
7. Which of the following functions can be used to aggregate data in a DataFrame? (Select
all that apply)
A) count()
B) sum()
C) avg()
D) concat()
D) It drops a column.
9. How can you convert a DataFrame to an RDD?
A) df.toRDD()
B) df.rdd
C) df.asRDD()
D) df.convertToRDD()
10. Which method can be used to read a CSV file into a DataFrame?
A) spark.read.csv("file.csv")
B) spark.loadCSV("file.csv")
C) spark.read.load("file.csv")
D) spark.importCSV("file.csv")
A) df.printSchema()
B) df.showSchema()
C) df.schema()
D) df.displaySchema()
14. Which of the following methods can be used to drop a column from a DataFrame?
A) drop("columnName")
B) remove("columnName")
C) delete("columnName")
D) exclude("columnName")
D) It aggregates data.
16. How can you group data in a DataFrame and perform an aggregation?
A) df.groupBy("column").agg(sum("value"))
B) df.aggregate("column", sum("value"))
C) df.group("column").sum("value")
D) df.groupBy("column").aggregate(sum("value"))
17. Which of the following can be used to handle missing values in a DataFrame? (Select all
that apply)
A) fillna(value)
B) dropna()
C) replaceNulls(value)
D) ignoreNulls()
D) To aggregate data.
A) df.write.parquet("output.parquet")
B) df.saveAsParquet("output.parquet")
C) df.writeToParquet("output.parquet")
D) df.saveParquet("output.parquet")
20. Which of the following statements about Spark DataFrames is true? (Select all that
apply)
22. Which method allows you to change the data type of a column in a DataFrame?
A) cast("newType")
B) changeType("newType")
C) convertType("newType")
D) modifyType("newType")
B) df1.innerJoin(df2, "key")
C) df1.join(df2, "key")
D) df1.joinInner(df2, "key")
25. Which function would you use to find the minimum value of a column in Spark
DataFrames?
A) MIN_VALUE()
B) min()
C) lowest()
D) minimum()
26. How do you create a temporary view from a DataFrame for SQL queries?
A) df.createTempView("view_name")
B) df.registerTempTable("view_name")
C) df.createView("view_name")
D) df.createGlobalTempView("view_name")
D) It aggregates data.
28. Which of the following is not a valid way to read data into a Spark DataFrame?
A) spark.read.json("file.json")
B) spark.read.csv("file.csv")
C) spark.read.load("file.txt")
D) spark.read.textFile("file.txt")
29. How can you apply a user-defined function (UDF) to a column in a DataFrame?
A) df.apply(udf, "column")
B) df.withColumn("new_column", udf(df["column"]))
C) df.transform(udf, "column")
D) df.udf("column")
30. What does the coalesce() method do when applied to a DataFrame?
A) DataFrame
B) RDD
C) Dataset
D) Table
2. Which of the following formats can Spark SQL read natively? (Select all that apply)
A) Parquet
B) JSON
C) CSV
D) XML
A) df.createOrReplaceTempView("view_name")
B) df.registerTempTable("view_name")
C) df.createGlobalTempView("view_name")
D) df.createView("view_name")
4. What is the default behavior of Spark SQL when performing a join operation?
A) Inner join
5. When using Spark SQL, what is the purpose of the explain() method?
6. Which of the following functions can be used to aggregate data in Spark SQL? (Select all
that apply)
A) COUNT(*)
B) SUM(*)
C) AVG(*)
D) CONCAT
A) To filter rows
D) To join tables
9. Which of the following statements about DataFrames is true? (Select all that apply)
C) df1.leftJoin(df2, "key")
12. Which SQL function would you use to concatenate two strings in Spark SQL?
A) CONCATENATE()
B) JOIN()
C) CONCAT(*)
D) MERGE()
13. What is a common use case for window functions in Spark SQL?
14. In Spark SQL, how can you change the column names of a DataFrame?
15. Which of the following clauses is used to filter records in a Spark SQL query?
A) WHERE
B) HAVING
C) FILTER
D) SELECT
D) It aggregates data.
A) df.toRDD()
B) df.rdd
C) df.asRDD()
D) df.convertToRDD()
19. How can you handle schema evolution when reading Parquet files in Spark SQL?
20. What is the result of executing SELECT * FROM table WHERE column IS NULL?
21. Which of the following can be used to perform string manipulation in Spark SQL?
(Select all that apply)
A) UPPER(*)
B) LOWER(*)
C) TRIM(*)
D) SPLIT()
C) To sort records
D) To group records
23. How can you optimize query performance in Spark SQL? (Select all that apply)
24. Which function would you use to find the maximum value of a column in Spark SQL?
A) MAX_VALUE()
B) MAX(*)
C) HIGHEST()
D) TOP()
D) It sorts values.
D) To aggregate results
27. Which of the following statements about DataFrames and Datasets is true? (Select all
that apply)
29. Which command would you use to drop a table in Spark SQL?
3. Which of the following is the best practice for handling large datasets in Spark?
5. What is the best practice for writing data back to storage in Spark?
6. When using Spark SQL, what is the best practice for query optimization?
7. What should you do to avoid memory issues when processing large datasets?
11. Which of the following is thebest practice when dealing with DataFrame operations?
12. What should you do before running a Spark job on a production cluster?
13. When working with Spark Streaming, what is the best practice for managing stateful
operations?
14. How can you efficiently read data from external sources in Spark?
18. How can you handle schema evolution when reading data from sources like Parquet?