0% found this document useful (0 votes)
3 views

DA Interview Preparatory Guide

This document provides guidelines for preparing for interviews in data analysis, particularly in the banking sector, emphasizing the importance of clear introductions, project explanations, and technical skills in Python, PySpark, and SQL. It covers common interview questions, domain-specific knowledge such as wholesale banking and regulatory frameworks like Basel-3 and IFRS, as well as data architecture principles and best practices. Additionally, it outlines the significance of customer digital journeys and anti-money laundering measures in banking operations.

Uploaded by

Shivani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DA Interview Preparatory Guide

This document provides guidelines for preparing for interviews in data analysis, particularly in the banking sector, emphasizing the importance of clear introductions, project explanations, and technical skills in Python, PySpark, and SQL. It covers common interview questions, domain-specific knowledge such as wholesale banking and regulatory frameworks like Basel-3 and IFRS, as well as data architecture principles and best practices. Additionally, it outlines the significance of customer digital journeys and anti-money laundering measures in banking operations.

Uploaded by

Shivani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

General Guidelines (Most Important)

1. Introduction: Be clear and concise. Include relevant work experience, technologies,


and key projects. Mention wholesale banking experience if applicable.
2. Projects: Prepare to explain at least one project end-to-end, focusing on technical,
functional, and analytical aspects. Highlight your role and the business impact. There
might be lots of cross questions (up to 60% of your interview time), so only pick
projects you are most comfortable with.
1. Prepare a writeup
2. Your Introduction
3. Project Objective – Pick this judiciously where you are most comfortable DA
project. Prefer (if possible) financial domain (Banking/Insurance/Investment,
etc)
4. Teams Role
5. Your Role – Clearly define how you drive it
6. Impact – Before vs After, Problem vs solution
3. Technical and Functional Depth: Focus on Python, PySpark, SQL, and banking use
cases.

Section 1: Common Questions


1. Introduction and Experience

1. Can you introduce yourself and summarize your professional experience?


o Answer: Highlight your years of experience, key skills (Python, PySpark, SQL),
domain knowledge (banking if applicable), and major projects. Example: "I
have 5 years of experience as a Data Analyst, sp ecializing in data
management and transformation using Python and PySpark. I worked on
banking projects, including customer engagement analytics and compliance
solutions like Basel-3."
2. Describe a project where you worked end-to-end. What was your role, and what
challenges did you face?
o Answer: Choose a project relevant to the role. Example: "I worked on a
project to build a 360° customer view across banking products. I was
responsible for ETL processes, modelling the data in PySpark, and optimizing
SQL queries. Challenges included integrating multiple systems and ensuring
data accuracy, which I resolved by implementing automated validation
scripts."
3. Can you explain a business use case from your experience and how you resolved
it?
o Answer: Example: "In a customer engagement analytics project, I identified
user churn patterns using SQL and Python. I proposed actionable insights to
improve customer retention by 15%."

2. Project Explanation

1. Choose a project and explain it end-to-end:


oProblem Statement: "The client required a global view of customer data
across systems."
o Technology Stack Used: PySpark, PowerBI.
o Processes: "Extracted data from disparate systems using PySpark,
transformed it into a unified schema, and loaded it into a global dashboard
using Power BI."
o Results: "Delivered a solution that improved data accessibility and reduced
reporting time by 30%."
2. How did you ensure data quality and accuracy in your project?
o Answer: "By implementing validation scripts in Python and setting up
automated checks during ETL processes."
3. How did you manage stakeholder expectations and communicate progress?
o Answer: "I held recurring meetings (daily/weekly as required) and present
progress and blockers on regular basis."

3. Functional Skills

1. Explain how you approached analysing and resolving discrepancies in data.


o Answer: "By using SQL to identify mismatches and writing Python scripts for
further analysis. For example, I resolved a data inconsistency issue in
transaction records by cross-referencing multiple data sources."
2. How do you model data for banking use cases?
o Answer: "I model data based on business needs. For instance, in a loan
securitization project, I created normalized schemas to integrate loan,
customer, and credit history data."
3. Describe your experience with Basel-3 or similar regulatory frameworks.
o Answer: "I worked on Basel-3 compliance by developing ETL pipelines to
generate regulatory reports with accurate risk-weighted asset calculations."

Section 2: Domain Questions


Wholesale Banking

Wholesale banking refers to banking services offered to large-scale clients, such as


corporations, financial institutions, government agencies, and non-profit organizations.
These services include financing for large projects, trade finance, risk management, treasury
services, and structured financial products. Unlike retail banking, which focuses on
individuals and small businesses, wholesale banking deals with high-value transactions and
complex client requirements. Examples include cash management, mergers and acquisitions
advisory, and facilitating international trade. Technological advancements like blockchain
are also shaping this sector.

• Key Features:
o Focus on large-scale clients.
o Includes services like fund management, risk hedging, and large-scale
financing.
o Supports economic growth by funding infrastructure projects and enabling
global trade.

1. What is Wholesale Banking, and how does it differ from Retail Banking?
o Wholesale banking refers to banking services provided to corporations,
institutions, and high-net-worth individuals, focusing on services like loans,
credit, asset management, and treasury management. Retail banking deals
with individual customers for savings accounts, personal loans, and similar
products.
2. What are the main products and services offered in Wholesale Banking?
o Corporate loans, trade finance, treasury management, foreign exchange
services, mergers & acquisitions advisory, and syndicated loans.
3. How do risk management practices differ in Wholesale Banking?
o Wholesale banking risks include credit risk, market risk, and operational risk,
requiring stringent due diligence, creditworthiness analysis, and compliance
with regulatory frameworks like Basel-3.

IFRS

IFRS is a set of global accounting standards that aim to make financial reporting consistent,
transparent, and comparable across international boundaries. Used in over 140 countries, it
helps businesses prepare financial statements that investors and regulators can trust.

• Key Topics:
o Standardizes reporting for revenues, leases, and financial instruments.
o Aligns financial disclosures for comparability.

https://www.investopedia.com/terms/i/ifrs.asp
Basel-3

Basel III is a regulatory framework established by the Basel Committee on Banking


Supervision (BCBS) to strengthen the regulation, supervision, and risk management of banks
globally. It was introduced after the 2008 financial crisis to enhance banking sector
resilience.

• Key Components:
o Capital Requirements: Increases minimum Tier 1 and Tier 2 capital ratios.
o Leverage Ratio: Limits excessive on- and off-balance sheet leverage.
o Liquidity Requirements: Introduces Liquidity Coverage Ratio (LCR) and Net
Stable Funding Ratio (NSFR) to manage short- and long-term liquidity risks.

https://www.investopedia.com/terms/b/basel_accord.asp

1. What are the primary objectives of Basel-3 regulations?


o Enhance the banking sector's ability to absorb shocks, improve risk
management, and strengthen transparency by introducing capital adequacy
requirements, leverage ratios, and liquidity standards.
2. What is the significance of the Capital Adequacy Ratio (CAR) in Basel-3?
o It ensures that a bank has enough capital to absorb potential losses while
meeting its obligations. Basel-3 mandates a minimum CAR of 10.5%, including
buffers.
3. What are the key differences between Basel-2 and Basel-3?
o Basel-3 introduced stricter capital requirements, liquidity coverage ratios,
and a leverage ratio to address deficiencies exposed by the 2008 financial
crisis.

Data Assets (Party, Accounts, Products, Revenue)

1. How are "Party," "Accounts," "Products," and "Revenue" interrelated in banking


data management?
o Party represents customers or entities; Accounts are financial records linked
to a Party; Products are the services offered to Parties; and Revenue is
generated through these Accounts and Products.
2. Why is a single view of Party critical for banking operations?
o It helps consolidate data across multiple products and accounts, providing
insights into customer behavior, enabling better risk management, and
enhancing cross-sell and upsell opportunities.
3. What are the challenges in managing data assets in banks?
o Data silos, inconsistent data formats, compliance requirements, and ensuring
data privacy and security.
4. what is facility domain in banking?
o In banking, the facility domain refers to the area focused on managing and
administering various types of credit facilities or financial accommodations
extended to customers by financial institutions. These facilities are
agreements between the bank and the borrower, outlining the terms under
which the bank will provide funds or services to meet the borrower's financial
needs.

Key Aspects of Facility Domain in Banking

5. Types of Facilities
o Fund-Based Facilities: Direct financial assistance in the form of funds.
1. Loans: Term loans, working capital loans, project financing.
2. Overdraft (OD): A line of credit allowing customers to withdraw more
than the balance in their account.
3. Cash Credit (CC): A short-term credit facility for working capital
requirements.
4. Discounting of Bills: Purchasing receivables before they are due.
o Non-Fund-Based Facilities: Indirect financial assistance without immediate
disbursement of funds.
1. Bank Guarantees: Assurances provided by the bank to a third party
on behalf of a customer.
2. Letter of Credit (LC): A financial guarantee for trade finance.
6. Core Functions
o Credit Appraisal: Assessment of a borrower’s creditworthiness, including
financial analysis, collateral valuation, and risk evaluation.
o Sanctioning and Disbursement: Approval of the facility terms and release of
funds or issuance of guarantees/LCs.
o Facility Management: Monitoring the utilization of the facility, compliance
with agreed terms, and periodic reviews.
o Recovery and Monitoring: Ensuring timely repayments and addressing
overdue or non-performing facilities.
7. Technology in Facility Management
o Loan Origination Systems (LOS): Automating the credit approval process.
o Loan Management Systems (LMS): Tracking loan performance,
disbursements, repayments, and delinquencies.
o Credit Risk Management Tools: Evaluating and mitigating risks associated
with facilities.
8. Parties Involved
o Customer/Party: The borrower or entity seeking the facility.
o Relationship Manager: A bank representative handling the client’s account.
o Credit Team: Responsible for appraising and approving the facility.
o Risk and Compliance: Overseeing adherence to regulatory and policy
requirements.
9. Regulations and Compliance
o Credit facilities must comply with guidelines from regulatory bodies such as
the Reserve Bank of India (RBI), Federal Reserve, or European Central Bank
(ECB).
o Banks must adhere to Basel III norms for managing credit risks and capital
adequacy.
10. Examples of Use Cases
o A corporate client might secure a cash credit facility for short-term working
capital needs.
o A small business may request an overdraft facility to handle seasonal cash
flow fluctuations.
o A construction firm could obtain a project financing facility to fund large-
scale infrastructure projects.

Anti-Money Laundering (AML)

AML involves laws, regulations, and procedures designed to prevent criminals from
disguising illegally obtained funds as legitimate income. It includes customer due diligence,
transaction monitoring, and reporting suspicious activities to regulatory bodies.

• Core Components:
o KYC (Know Your Customer) processes.
o Monitoring high-risk accounts.
o Reporting to Financial Intelligence Units (FIUs).

1. What are the key components of an AML framework in banks?


o Know Your Customer (KYC), transaction monitoring, suspicious activity
reporting (SAR), sanctions screening, and risk-based customer segmentation.
2. How does technology enhance AML efforts?
o AI and machine learning are used for anomaly detection, pattern recognition,
and real-time transaction monitoring. Blockchain can improve transparency
and traceability.
3. What are some global regulations governing AML?
o USA Patriot Act, EU’s AML directives, and FATF (Financial Action Task Force)
recommendations.

Customer Digital Journey

This refers to the digital transformation of customer interactions with banks, encompassing
online account opening, mobile banking, digital payments, and AI-driven chatbots. The focus
is on enhancing customer experience through convenience, personalization, and security.

• Trends:
o Integration of mobile apps and internet banking.
o Use of AI for predictive analytics and chat support.
o Seamless onboarding and transactions.

1. What are the stages of a customer’s digital journey in banking?


o Awareness, consideration, onboarding, engagement, servicing, and loyalty.
2. How can banks enhance the customer digital journey?
o By leveraging AI for personalized experiences, offering seamless multi-
channel integration, and adopting predictive analytics for proactive service.
3. What role does mobile banking play in the digital journey?
o It serves as the primary touchpoint for most customers, enabling easy access
to banking services and fostering engagement through features like chatbots,
notifications, and self-service tools.

Regulatory Reporting

In the UK, the Prudential Regulation Authority (PRA) oversees regulatory reporting to ensure
financial stability. Banks are required to submit detailed reports covering capital adequacy,
liquidity, and risk exposures.

• Key Reports:
o Common Reporting (COREP) for capital adequacy.
o Financial Reporting (FINREP) for financial statements.

https://www.bankofengland.co.uk/prudential-regulation/authorisations/which-firms-does-
the-pra-regulate
1. What is the importance of regulatory reporting in banking?
o It ensures compliance with financial regulations, helps monitor systemic risks,
and promotes transparency in financial institutions.
2. What are the challenges banks face in regulatory reporting?
o Data accuracy, integration across systems, keeping up with changing
regulations, and the high cost of compliance.

Section 3: Data Architecture Overview

Data architecture provides the structural framework for managing an organization's data
assets, ensuring they are organized, stored, accessed, and utilized effectively. Here’s a
breakdown of its key concepts along with learning resources and diagrams to enhance
understanding:

Key Components of Data Architecture

1. Data Sources: Origin points for data (e.g., databases, APIs, IoT devices).
2. Data Storage: Includes databases, data warehouses, and data lakes for structured
and unstructured data storage.
3. Data Pipelines: Facilitate data movement and processing between systems.
4. Data Governance: Defines policies, roles, and procedures to ensure data quality,
privacy, and security.
5. Real-time Analytics: Enables on-the-fly data analysis for quick decision-making.
6. Cloud Computing: Often used for scalable storage and computational resources

Best Practices

• Eliminate Silos: Ensure data is accessible across departments to foster collaboration.


• Document the Architecture: Clearly define how data is collected, processed, and
used.
• Implement Robust Governance: Maintain data accuracy and security through
policies.
• Design for Scalability: Use modular approaches to adapt to business growth

Tools for Visualization and Implementation

• Diagramming Tools: Lucidchart, Diagrams.net, and Gliffy are commonly used to


create architecture diagrams.
• Orchestration Tools: Apache Airflow and Prefect for managing data workflows.
• ETL/ELT Tools: Fivetran, Databricks, and dbt for transforming and integrating data
pipelines
Example Data Architecture Diagrams

Diagrams can range from high-level blueprints of enterprise data flow to detailed data
pipeline layouts showing ETL/ELT processes. Many resources like Monte Carlo Data and The
Knowledge Academy provide examples and explanations of layered data architectures

Interview Questions on Data Architecture


General Knowledge

1. What is the purpose of data architecture in an organization?


o Explain its role in organizing, integrating, and making data accessible for
business needs.
2. What are the differences between a data warehouse and a data lake?
o Data warehouse: Structured data for reporting and analytics.
o Data lake: Raw, unstructured, or semi-structured data for broader analytics.
3. Can you explain the concept of data lineage and why it is important?
o Describes data's origins, movement, and transformations to ensure trust and
compliance.

Technical Design

4. How do you design a scalable data architecture for a rapidly growing company?
o Focus on modularity, cloud-based storage, and integration with analytics
tools.
5. What factors do you consider when selecting between on-premise and cloud-based
data storage?
o Factors: Cost, scalability, latency, security, and compliance.
6. How would you design a real-time data pipeline for streaming analytics?
o Mention tools like Kafka, Spark Streaming, or AWS Kinesis.

Data Integration and Governance

7. What challenges arise in data integration, and how do you address them?
o Challenges: Schema mismatches, duplicate data, latency.
o Solutions: Data mapping, real-time ETL, quality checks.
8. What is your approach to ensuring data compliance with GDPR or other
regulations?
o Discuss data masking, consent management, and auditing.

Tools and Practices

9. What are some data architecture frameworks you have worked with?
o Mention tools like Apache Hadoop, Snowflake, or AWS Redshift.
10. Can you explain how you manage metadata in a large organization?
o Mention tools like Alation or Collibra for cataloguing and searchability.
Problem-Solving Scenarios

11. How would you migrate data from a legacy system to a modern data warehouse?
o Include steps like mapping schemas, cleaning data, and testing.
12. Your data warehouse is experiencing performance bottlenecks. How do you
diagnose and resolve the issue?
o Methods: Query optimization, indexing, partitioning, scaling hardware.

Section 4: Technical Questions


1. Python
Learning Link - https://www.youtube.com/watch?v=gtjxAH8uaP0

1. Explain the difference between Python lists, tuples, and dictionaries with
examples.
o Answer: "Lists are mutable, tuples are immutable, and dictionaries store key-
value pairs. Example: [1, 2, 3], (1, 2, 3), { 'key': 'value' }."
2. What libraries have you used for data manipulation and analysis (e.g., Pandas,
NumPy)? Provide examples.
o Answer: "I used Pandas for data cleaning and NumPy for numerical
operations. Example: Calculated moving averages using NumPy arrays."
3. How do you use Pandas for data analysis? Provide examples.
o Answer: "Using Pandas DataFrames to filter, aggregate, and visualize data.
Example: df.groupby('Category').sum() for aggregating revenue."
4. What are Python data types? How do you use them?
o Answer: "Python has basic data types like int, float, string, list, and advanced
types like sets and dictionaries. Example: A dictionary for mapping user IDs to
names: {101: 'Alice', 102: 'Bob'}."
5. How do you handle missing data in Python?
o Answer: "Using Pandas functions like fillna() for imputing or dropna() for
removing missing values."
6. How do you implement custom sorting in Python?
o Answer: "Using the sorted() function with a lambda function as the key.
Example: sorted(data, key=lambda x: x['age'])."
7. How do you merge DataFrames in Pandas?
o Answer: "Using pd.merge() for SQL-like joins. Example: pd.merge(df1, df2, on='key',
how='inner')."
8. Explain the concept of Python comprehensions.
o Answer: "List comprehensions are concise ways to create lists. Example: [x**2
for x in range(5)]."
9. Transformation and filtering
o Question: You are given a CSV file containing customer information with
columns: CustomerID, Name, Age, City, and PurchaseAmount. Some rows
have missing values in the Age and PurchaseAmount columns.Task: Write a
Python script to:
o Fill missing Age values with the mean age.
o Replace missing PurchaseAmount values with 0.
o Add a new column AgeGroup where:
o Age < 18 is labeled as "Minor".
o 18 <= Age <= 60 is labeled as "Adult".
o Age > 60 is labeled as "Senior".
o Answer:
import pandas as pd
o
o # Example data
o data = {
o "CustomerID": [1, 2, 3],
o "Name": ["Alice", "Bob", "Charlie"],
o "Age": [25, None, 70],
o "City": ["NY", "LA", "SF"],
o "PurchaseAmount": [100, None, 200],
o }
o df = pd.DataFrame(data)
o
o # Fill missing values
o df["Age"].fillna(df["Age"].mean(), inplace=True)
o df["PurchaseAmount"].fillna(0, inplace=True)
o
o # Add AgeGroup column
o def categorize_age(age):
o if age < 18:
o return "Minor"
o elif age <= 60:
o return "Adult"
o else:
o return "Senior"
o
o df["AgeGroup"] = df["Age"].apply(categorize_age)
o print(df)
10. Nested Dictionary Flattening
o Question: Transform the following nested dictionary:
o
o python
o Copy code
o data = {
o "user1": {"name": "Alice", "age": 25, "city": "New York"},
o "user2": {"name": "Bob", "age": 30, "city": "Los Angeles"},
o }
o Into a flat list of dictionaries:
o
o python
o Copy code
o [
o {"user": "user1", "name": "Alice", "age": 25, "city": "New York"},
o {"user": "user2", "name": "Bob", "age": 30, "city": "Los Angeles"},
o ]
o Answer:
o
o python
o Copy code
o data = {
o "user1": {"name": "Alice", "age": 25, "city": "New York"},
o "user2": {"name": "Bob", "age": 30, "city": "Los Angeles"},
o }
o
o flat_list = [{"user": user, **info} for user, info in data.items()]
o print(flat_list)
11. Pivot Table Creation
o Question: Transform the following data into a pivot table:
o
o python
o Copy code
o data = {
o "Category": ["A", "B", "A", "B", "C"],
o "SubCategory": ["X", "X", "Y", "Y", "X"],
o "Value": [100, 200, 150, 50, 300],
o }
o Answer:
o
o python
o Copy code
o import pandas as pd
o
o data = {
o "Category": ["A", "B", "A", "B", "C"],
o "SubCategory": ["X", "X", "Y", "Y", "X"],
o "Value": [100, 200, 150, 50, 300],
o }
o df = pd.DataFrame(data)
o
o pivot_table = df.pivot_table(
o values="Value", index="Category", columns="SubCategory",
aggfunc="sum", fill_value=0
o )
o print(pivot_table)
12. Grouping and Aggregation
o Question: Group by Item and calculate the total sales:
o
o python
o Copy code
o data = {
o "Store": ["A", "A", "B", "B", "C"],
o "Item": ["Apple", "Banana", "Apple", "Banana", "Apple"],
o "Sales": [30, 50, 20, 40, 60],
o }
o Answer:
o
o python
o Copy code
o import pandas as pd
o
o data = {
o "Store": ["A", "A", "B", "B", "C"],
o "Item": ["Apple", "Banana", "Apple", "Banana", "Apple"],
o "Sales": [30, 50, 20, 40, 60],
o }
o df = pd.DataFrame(data)
o
o total_sales = df.groupby("Item")["Sales"].sum().reset_index()
o print(total_sales)
13. Data Transformation
o Question: Add a column CumulativeSales:
o
o python
o Copy code
o data = {
o "Date": ["2024-01-01", "2024-01-02", "2024-01-03"],
o "Sales": [100, 200, 150],
o }
o Answer:
o
o python
o Copy code
o import pandas as pd
o
o data = {
o "Date": ["2024-01-01", "2024-01-02", "2024-01-03"],
o "Sales": [100, 200, 150],
o }
o df = pd.DataFrame(data)
o
o df["CumulativeSales"] = df["Sales"].cumsum()
o print(df)
14. String Parsing and Transformation
o Question: Parse the following product codes into a structured list:
o
o python
o Copy code
o codes = ["A-001-2024", "B-002-2024", "C-003-2023"]
o Answer:
o
o python
o Copy code
o codes = ["A-001-2024", "B-002-2024", "C-003-2023"]
o
o parsed_codes = [
o {"Category": code.split("-")[0], "Code": code.split("-")[1], "Year":
code.split("-")[2]}
o for code in codes
o ]
o print(parsed_codes)
15. Data Filtering and Sorting
o Question: Filter out failed students and sort by scores:
o
o python
o Copy code
o data = {
o "Name": ["Alice", "Bob", "Charlie", "David"],
o "Score": [85, 92, 70, 60],
o "Passed": [True, True, False, False],
o }
o Answer:
o
o python
o Copy code
o import pandas as pd
o
o data = {
o "Name": ["Alice", "Bob", "Charlie", "David"],
o "Score": [85, 92, 70, 60],
o "Passed": [True, True, False, False],
o }
o df = pd.DataFrame(data)
o
o filtered_sorted = df[df["Passed"]].sort_values(by="Score", ascending=False)
o print(filtered_sorted)
16. Merging and Data Enrichment
o Question: Merge two DataFrames and calculate Tax:
o
o python
o Copy code
o employees = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]}
o salaries = {"ID": [1, 2, 3], "Salary": [50000, 60000, 55000]}
o Answer:
o
o python
o Copy code
o import pandas as pd
o
o employees = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]}
o salaries = {"ID": [1, 2, 3], "Salary": [50000, 60000, 55000]}
o
o df_employees = pd.DataFrame(employees)
o df_salaries = pd.DataFrame(salaries)
o
o merged_df = pd.merge(df_employees, df_salaries, on="ID")
o merged_df["Tax"] = merged_df["Salary"] * 0.10
o print(merged_df)

2. PySpark
Learning Link - https://www.youtube.com/watch?v=_C8kWso4ne4

1. What is the difference between RDD, DataFrame, and Dataset in PySpark?


o Answer: "RDD is low-level and unstructured, DataFrame is high-level with
schema, and Dataset combines both with strong typing."
2. How do you handle large datasets in PySpark?
o Answer: "By using partitioning and caching strategies."
3. Explain partitioning in PySpark. How do you optimize partitioning?
o Answer: "Partitioning splits data for parallelism. Optimized by choosing
columns with high cardinality."
4. How do you use withColumn() in PySpark?
o Answer: "To add or modify a column. Example: df.withColumn('new_col', df['col1']
+ df['col2'])."
5. How do you implement joins in PySpark?
o Answer: "Using join() on DataFrames. Example: df1.join(df2, df1.key == df2.key,
'inner')."
6. Explain the role of SparkContext in PySpark.
o Answer: "It initializes the cluster and manages resources for Spark
applications."
7. How do you write data to external storage in PySpark?
o Answer: "Using write.format('parquet').save('path') or similar functions."
8. Explain how PySpark handles schema inference.
o Answer: "Schema is inferred automatically or specified explicitly using
StructType."
9. How do you handle out-of-memory errors in Spark?
o Answer: "By adjusting executor memory or increasing partition counts."
10. What is the difference between cache() and persist() in PySpark?
o Answer: "cache() stores in memory; persist() supports different storage levels."
11. Explain the use of PySpark’s groupBy and reduceByKey.
o Answer: "groupBy groups data, reduceByKey applies functions like sum
efficiently."
12. How do you use PySpark UDFs?
o Answer: "Create custom functions and apply using udf(). Example:
spark.udf.register('my_udf', lambda x: x*2) .
13. Explain PySpark window functions with examples.
o Answer: "Used for ranking and aggregations. Example:
rank().over(Window.partitionBy('dept').orderBy('salary'))."
14. Ranking Salespersons by Sales within a Region
o Question:
o You have a dataset with columns Region, Salesperson, and SalesAmount.
Write a PySpark script to rank salespersons within each region based on their
sales amount.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import rank
o
o window_spec =
Window.partitionBy("Region").orderBy(col("SalesAmount").desc())
o
o df.withColumn("Rank", rank().over(window_spec)).show()
15. Cumulative Sum of Sales for Each Product
o Question:
o Calculate the cumulative sales (CumulativeSales) for each Product over time
(OrderDate).
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, sum
o
o window_spec = Window.partitionBy("Product").orderBy("OrderDate")
o
o df.withColumn("CumulativeSales", sum("Sales").over(window_spec)).show()
16. Percentile (NTILE) Calculation
o Question:
o Divide employees into quartiles based on their salaries within each
department.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import ntile
o
o window_spec = Window.partitionBy("Department").orderBy("Salary")
o
o df.withColumn("Quartile", ntile(4).over(window_spec)).show()
17. Difference Between Current and Previous Sales
o Question:
o For each product, calculate the sales difference compared to the previous
day.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, lag
o
o window_spec = Window.partitionBy("Product").orderBy("SalesDate")
o
o df.withColumn("PreviousSales", lag("SalesAmount").over(window_spec)) \
o .withColumn("SalesDifference", col("SalesAmount") - col("PreviousSales")) \
o .show()
18. Find the Top-N Products per Category
o Question:
o Identify the top 3 products by revenue within each category.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, row_number
o
o window_spec =
Window.partitionBy("Category").orderBy(col("Revenue").desc())
o
o df.withColumn("RowNum", row_number().over(window_spec)) \
o .filter(col("RowNum") <= 3) \
o .show()
19. Average Salary and Rank for Employees
o Question:
o For each department, calculate the average salary and rank employees by
their salary.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, avg, rank
o
o window_spec =
Window.partitionBy("Department").orderBy(col("Salary").desc())
o
o df.withColumn("AverageSalary",
avg("Salary").over(Window.partitionBy("Department"))) \
o .withColumn("Rank", rank().over(window_spec)) \
o .show()
20. Identify Consecutive Absences
o Question:
o You have a dataset with EmployeeID, Date, and Status columns. Calculate
how many consecutive days an employee has been marked as "Absent."
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, lag, when, count
o
o window_spec = Window.partitionBy("EmployeeID").orderBy("Date")
o
o df.withColumn("PreviousStatus", lag("Status").over(window_spec)) \
o .withColumn("NewGroup", when(col("Status") != col("PreviousStatus"),
1).otherwise(0)) \
o .withColumn("GroupID", sum("NewGroup").over(window_spec)) \
o .filter(col("Status") == "Absent") \
o .groupBy("EmployeeID", "GroupID") \
o .agg(count("Date").alias("ConsecutiveAbsences")) \
o .show()
21. Find First and Last Purchase Date
o Question:
o For each customer, find their first and last purchase date.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, first, last
o
o window_spec = Window.partitionBy("CustomerID").orderBy("PurchaseDate")
o
o df.withColumn("FirstPurchase", first("PurchaseDate").over(window_spec)) \
o .withColumn("LastPurchase", last("PurchaseDate").over(window_spec)) \
o .show()
22. Running Total and Rank by Category
o Question:
o Compute the running total of sales and rank products within each category.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, sum, rank
o
o window_spec = Window.partitionBy("Category").orderBy(col("Sales").desc())
o
o df.withColumn("RunningTotal", sum("Sales").over(window_spec)) \
o .withColumn("Rank", rank().over(window_spec)) \
o .show()
23. Fill Missing Values Using Last Observed Value
o Question:
o Fill missing sales values (SalesAmount) with the last observed value for each
product.
o
o Answer:
o
o python
o Copy code
o from pyspark.sql import Window
o from pyspark.sql.functions import col, last
o
o window_spec =
Window.partitionBy("Product").orderBy("SalesDate").rowsBetween(-
sys.maxsize, 0)
o
o df.withColumn("FilledSalesAmount", last("SalesAmount",
ignorenulls=True).over(window_spec)).show()
3. SQL
Subquery & Window function -
https://www.youtube.com/watch?v=dQ7l9k7A_nY&pp=ygUjc3FsIHN1YnF1ZXJ5LCBqb2luLCB
3aW5kb3cgZnVuY3Rpb24%3D
Interview questions -
https://www.youtube.com/playlist?list=PLqM7alHXFySGweLxxAdBDK1CcDEgF-Kwx

1. Write a query to find the second-highest salary in a table.


o Answer: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM
employees);
2. How do you perform joins in SQL? Explain inner, left, and outer joins with
examples.
o Answer: "Inner join returns matching rows, left join returns all rows from the
left table, and outer join includes unmatched rows from both tables."
3. What is the difference between WHERE and HAVING clauses?
o Answer: "WHERE filters rows before aggregation, HAVING filters after."
4. Explain window functions in SQL. Provide examples.
o Answer: "Used for ranking, cumulative sums, etc. Example: ROW_NUMBER()
OVER (PARTITION BY department ORDER BY salary DESC) gives rank by department."
5. How do you optimize SQL queries?
o Answer: "By indexing, avoiding SELECT *, and using query plans."
6. Write a query to calculate the cumulative sum of sales per month.
o Answer: SELECT month, SUM(sales) OVER (ORDER BY month) AS cumulative_sales FROM
sales_data;
7. How do you identify duplicate rows in a table?
o Answer: SELECT col1, col2, COUNT(*) FROM table GROUP BY col1, col2 HAVING COUNT(*) >
1;
8. What is a common table expression (CTE), and how do you use it?
o Answer: "A CTE provides temporary result sets. Example: WITH cte AS (SELECT *
FROM table) SELECT * FROM cte WHERE condition;"
9. How do you write a recursive query in SQL?
o Answer: "Using CTEs. Example: WITH RECURSIVE cte AS (SELECT col FROM table UNION
ALL SELECT col FROM table WHERE condition) SELECT * FROM cte; "
10. How do you handle NULL values in SQL?
o Answer: "Using IS NULL or COALESCE() to replace them."
11. What are windowing functions? Explain rank and dense_rank.
o Answer: "RANK() leaves gaps, DENSE_RANK() doesn’t. Example: Ranking
employees by salary in departments."
12. How do you calculate the percentage of total in SQL?
o Answer: SELECT col, SUM(col) * 100.0 / SUM(SUM(col)) OVER () FROM table GROUP BY col;
13. What is the difference between UNION and UNION ALL?
o Answer: "UNION removes duplicates, UNION ALL doesn’t."
14. How do you create indexes in SQL?
o Answer: CREATE INDEX idx_name ON table(col1, col2);
15. How do you write a query to find top-N records in each group?
o Answer: SELECT * FROM (SELECT col, ROW_NUMBER() OVER (PARTITION BY group_col
ORDER BY col DESC) AS rnk FROM table) WHERE rnk <= N;
16. Explain CROSS JOIN with examples.
o Answer: "Returns Cartesian product. Example: Pairing all employees with all
departments."
17. Using Subqueries to Filter Data
o Question:
o Find the employees with a salary higher than the average salary of their
department.
o
o Answer:
o
o sql
o Copy code
o SELECT EmployeeID, Name, Salary, DepartmentID
o FROM Employees
o WHERE Salary > (
o SELECT AVG(Salary)
o FROM Employees AS Sub
o WHERE Sub.DepartmentID = Employees.DepartmentID
o );
18. Aggregation with GROUP BY and HAVING
o Question:
o Find departments where the total salary expense exceeds $100,000.
o
o Answer:
o
o sql
o Copy code
o SELECT DepartmentID, SUM(Salary) AS TotalSalary
o FROM Employees
o GROUP BY DepartmentID
o HAVING SUM(Salary) > 100000;
o 3. Window Function for Running Totals
o Question:
o Calculate the cumulative sales for each salesperson by order date.
o
o Answer:
o
o sql
o Copy code
o SELECT SalespersonID, OrderDate, Amount,
o SUM(Amount) OVER (PARTITION BY SalespersonID ORDER BY OrderDate)
AS CumulativeSales
o FROM Sales;
19. Ranking with ROW_NUMBER
o Question:
o Find the top 3 highest-paid employees in each department.
o
o Answer:
o
o sql
o Copy code
o WITH RankedSalaries AS (
o SELECT EmployeeID, Name, DepartmentID, Salary,
o ROW_NUMBER() OVER (PARTITION BY DepartmentID ORDER BY Salary
DESC) AS RowNum
o FROM Employees
o )
o SELECT EmployeeID, Name, DepartmentID, Salary
o FROM RankedSalaries
o WHERE RowNum <= 3;
20. Aggregation with Window Functions
o Question:
o Show each employee's salary along with the average and maximum salary of
their department.
o
o Answer:
o
o sql
o Copy code
o SELECT EmployeeID, Name, Salary, DepartmentID,
o AVG(Salary) OVER (PARTITION BY DepartmentID) AS AvgSalary,
o MAX(Salary) OVER (PARTITION BY DepartmentID) AS MaxSalary
o FROM Employees;
21. Using RANK for Identifying Duplicates
o Question:
o Identify duplicate products based on name and category but keep only one
instance.
o
o Answer:
o
o sql
o Copy code
o WITH RankedProducts AS (
o SELECT ProductID, Name, CategoryID,
o RANK() OVER (PARTITION BY Name, CategoryID ORDER BY ProductID)
AS Rank
o FROM Products
o )
o SELECT ProductID, Name, CategoryID
o FROM RankedProducts
o WHERE Rank = 1;
22. Subqueries with EXISTS
o Question:
o List all customers who have placed at least one order.
o
o Answer:
o
o sql
o Copy code
o SELECT CustomerID, Name
o FROM Customers
o WHERE EXISTS (
o SELECT 1
o FROM Orders
o WHERE Orders.CustomerID = Customers.CustomerID
o );
23. Percentile Calculation Using NTILE
o Question:
o Divide employees' salaries into quartiles within each department.
o
o Answer:
o
o sql
o Copy code
o SELECT EmployeeID, Name, DepartmentID, Salary,
o NTILE(4) OVER (PARTITION BY DepartmentID ORDER BY Salary) AS
Quartile
o FROM Employees;
24. Using LAG to Calculate Differences
o Question:
o Find the sales difference for each product compared to the previous month.
o
o Answer:
o
o sql
o Copy code
o SELECT ProductID, SalesDate, SalesAmount,
o LAG(SalesAmount) OVER (PARTITION BY ProductID ORDER BY SalesDate)
AS PreviousSales,
o SalesAmount - LAG(SalesAmount) OVER (PARTITION BY ProductID ORDER
BY SalesDate) AS SalesDifference
o FROM ProductSales;
25. Combining Window Functions with Aggregation
o Question:
o For each department, find the employee with the highest salary, but show all
employees with their relative rank.
o
o Answer:
o
o sql
o Copy code
o WITH DepartmentRanked AS (
o SELECT EmployeeID, Name, DepartmentID, Salary,
o RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary DESC) AS
Rank
o FROM Employees
o )
o SELECT EmployeeID, Name, DepartmentID, Salary
o FROM DepartmentRanked
o WHERE Rank = 1;

You might also like