Mock Interview Questions - Please add yours
PySpark Theory
Spark Architecture
Resilient Distributed Dataset (RDD) vs DataFrame vs Dataset
What’s Lazy Evaluation ?
AQE
Checkpointing in spark
Lineage Graph vs DAG
Spark job Optimization techniques
Caching and Persistence (Memory and Disk) with diff levels
Catalyst Optimizer
Joins and Join Strategies (Broadcast Join, Sort Merge Join)
Spark Submit
DAG Visualization
Serialization (Kryo vs. Java Serialization)
how to debug a failed PySpark job
PySpark Coding
Word count program using RDD and using DF
Explicit schema definition -StructType()...
Spark-Submit command
nth highest salary using window functions
SparkSession creation
DataFrame from a CSV file with corrupted records
How to perform a groupBy
Handle missing values with example
rename columns in a DataFrame
PySpark Scenarios
There are two tables- Sales and Products (with similar columns inside) Find the
total and average sales per product
How can u handle skewed data
How to retry for failed jobs
How can you remove null records
Duplicate records - as Bad records
Python (Theory)
Memory management in python
Monkey Patching
OOPs
Exception Handling with all blocks
Tuple vs Set vs Frozenset
Some dictionary methods
All data types
Python (Coding)
List flattening
Eg. list1 = [12,3[‘string’, 2,3,4,5],(456)]
List [10,2,3,4,5,6,7] op > 70 (maximum product)
List comprehension examples
Dict comprehension with examples
Decorators
Generators
Example of inheritance
Try except block with finally and its execution
String reversing
Fibonacci
Prime number
Even odd
Duplicate removal without changing order
ip > (4,5,6,2,7,9,4,3,2,4,2) op> (4,5,6,2,7,9,3)
Project explanation
Odc discussion questions
1)
2 lists
L1 = [‘rohitt’,’dhoni’,Sachin]
l2 = [10,20,45]
output = [(‘rohit’,10),(‘dhoni’,20),(‘sachin’,45)]
2)list1 = [‘mohini’,’jojo’,’gaurav’,’ajay’]
find the count of elements having the same length
3)pivot syntax
4)first name,second name,third name separate using regex
5)pyspark args and pyspark variable
6)cdc
7)database deadlock
8) Select department,avg(salary) from [Link]
groupby department
having avg(salary)>(select avg(salary) from [Link])
SELECT [Link],AVG([Link]) FROM [Link] A
INNER JOIN [Link] B ON([Link] = [Link])
GROUPBY [Link]
HAVING AVG([Link])> AVG([Link])
Which of the above query is faster??
9) how to implement normalization in practical?(in sql and spark)
10)which schema did we use in airline ,how do we choose a schema for a project?
11) if we have already cached dataframe ,how do we know that the dataframe is
already cached? Ans df.is_cached for dataframe
And for table [Link].is_cached
12)spark submit using 3 ways
13)where is spark submit command saved?
14) is cache and persist a action?
15)is oracle sql , OLTP? OR OLAP?
16) We have sales table and in that we have column name transaction where the
transaction is recorded on hourly basis..We have to fetch the time from that column
17)by default which join happens in spark?
18 ) garbage collector in spark? How it impacts application execution
19 ) types of connectors
20)diff between jdbc and odbc?
21)how to read excel file in spark
22)datawarehouse architecture
23) how to read json,csv in python
24)how do we know if the data is skewed
25) how are tasks discarded in speculative analysis(practically)
26) scd and scd2 codes
27)how to check the size of the data in the partition
28) sql,python architecture
29) diff between sql,python,spark
30) self scenarios,like decorator,generator
31)why do we select the cores between 3-5 in spark ?
32)python optimization tecniques
33) what is deadlock
34)what is GIL?
35)init and new - diff between
36)diff between self and init
37)how read textfile in spark other than rdd?
38) memory management in sql
39) why do we store bad data
40) recursive funx in sql
41) scd2 use case
42) database deadlock
43) dynamic schema change in sql
44) user_id login_date
1 2024-11-29
1 2024-11-30
1 2024-11-31
2 2024-11-29
2 2024-11-30
2 2024-11-31
3 2024-11-29
3 2024-11-30
3 2024-12-02
Output:
user_id
1
2
Get user ids who have logged in for 3 consecutive days
45)500gb data – how will we optimize?
46)parquet file and avro – format ( how is data stored)
47)five algorithms of data compression
48) Mask bank account number using 2 approaches in sql n spark
49)how to explain spark architecture using spark submit
50) File compressing types
51)if we have 5 column table with data and next day we have incremental load and
the data changes and we have 10 columns , how will we handle the data
52) if we have two 100 gb tables , how will u use broadcast join on any one of them
53)python execution plan
54)PEP -8
55)what is py and pyc files
56)monkey patching
57)stack and heap memory in python
58)procedure bottleneck
59)how will you handle OOM in executor and driver?
60)deadlock in python,sql,spark
61)database logging
62)how will you maintain python code
63) monotonically increasing id in pyspark
64)map partition in spark
65)naming convention in sql
66)sql memory management
67)GIL
68)synchronous and asynchronus
69)what is diff between store procedure and trigger
70) name mangling in python
71)iter tools
72)lineage graph
73)mapreduce
74)accumulator
75)file system - hdfs
76)sliding window option
77)temp view , global view
78)vertices and edges