0% found this document useful (0 votes)
3 views

Interview Prep

The document outlines projects at JPMorgan Chase and Expandtree Infotech focused on building ETL pipelines for financial and customer behavior data. It details the processes of extracting, transforming, and loading data using tools like Teradata and Hadoop, along with challenges faced and solutions implemented. Additionally, it provides an overview of Teradata's architecture, utilities, and optimization techniques, as well as a comparison with Hadoop's framework for handling large datasets.

Uploaded by

sopallisaivamsi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Interview Prep

The document outlines projects at JPMorgan Chase and Expandtree Infotech focused on building ETL pipelines for financial and customer behavior data. It details the processes of extracting, transforming, and loading data using tools like Teradata and Hadoop, along with challenges faced and solutions implemented. Additionally, it provides an overview of Teradata's architecture, utilities, and optimization techniques, as well as a comparison with Hadoop's framework for handling large datasets.

Uploaded by

sopallisaivamsi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

JPMorgan Chase : worked on a project focused on building and optimizing ETL pipelines for a financial

data warehouse that supported risk and compliance reporting."

process daily transaction data from various internal systems, transform it to meet compliance rules,
and make it available for dashboards and audits.

Pipeline Breakdown:

was working with large volumes of financial data. used Teradata tools like BTEQ and TPT to extract data
from different internal systems. After that, wrote complex SQL to clean, join, and transform the data
based on business rules. Finally, loaded everything into Teradata data marts with partitioning and
indexing to support reporting and performance.

Extract: used BTEQ scripts and Teradata Parallel Transporter (TPT) to extract large volumes of
transactional and customer data from multiple systems and flat files.​

Transform:I used complex SQL in Teradata to clean, join, and apply business logic.​
Tasks included identifying high-value transactions, standardizing formats, and removing duplicates.​

Load:Loaded the transformed data into Teradata data marts using MultiLoad and TPT.​

Added indexing and partitioning for performance and supported daily refresh jobs.​

Tools Used:Teradata SQL, BTEQ, TPT, FastLoad, JIRA, Python, Agile

Challenges Faced:

Query performance issues due to large volumes – solved with indexing, proper joins, and partitioning.​

Data mismatches between systems – handled with validation logic and error handling scripts.​

Expandtree Infotech: building big data pipelines to integrate customer behavior data into a central
analytics platform using Hadoop and Spark."

Project Overview:processed structured and unstructured data from multiple sources to generate reports
on customer engagement and product insights.

the setup was more focused on big data. We pulled data from databases and streaming sources
like Kafka using tools like Sqoop and Python scripts. Then, we used PySpark and HiveQL to
transform and clean the data—doing things like joining tables, removing duplicates, and applying
logic. Once the data was ready, we loaded it into Hive tables and HDFS so analysts could run
reports or use it in dashboards."

Pipeline Breakdown:

Extract:Ingested data from Oracle and SQL Server using Sqoop, and streamed log files using Kafka
into HDFS.​
Transform:Used PySpark and HiveQL to clean the data, perform joins, and apply business logic (e.g.,
tagging users based on activity).​

Automated data enrichment and type conversions using Python scripts and Spark UDFs.​

Load:Loaded processed data into Hive tables, HBase, and exported Parquet files into HDFS.​

Made this data available to BI tools like Tableau.​

Tools Used:PySpark, Hive, Sqoop, HDFS, Kafka, MapReduce, Python

Challenges Faced:

Out-of-memory errors in Spark jobs – resolved by tuning Spark configurations and optimizing
transformations.​

Schema drift in streaming data – handled using dynamic schema validation and fallback logic.​

1. Teradata


What is Teradata and where is it used?​
Teradata is a relational database management system (RDBMS) designed for large-scale data
warehousing and analytics. It is widely used for handling large volumes of structured data across
industries like retail, banking, and telecommunications.​

How is Teradata different from other RDBMS?​


Unlike traditional RDBMS, Teradata uses a massively parallel processing (MPP) architecture, allowing it
to process large amounts of data across multiple nodes, improving scalability and performance.​

2. Parallel Processing in Teradata

Teradata uses MPP (Massively Parallel Processing) architecture where each node (AMP) processes a
portion of the data. Data is distributed across Access Module Processors (AMPs), allowing parallel
querying and

How does parallelism work in Teradata?​


Teradata splits large datasets across multiple AMPs. When a query is executed, the system sends tasks
to each AMP to work on their portion of the data simultaneously, significantly improving performance.​

What is the role of AMPs in parallel processing?​


AMPS handle the data processing independently and work in parallel, allowing faster execution of
queries by distributing the workload evenly across nodes.​
3. Order of Execution in Teradata


The order of execution of SQL components in Teradata generally follows the logical query plan:

FROM: Determine tables involved​

WHERE: Filter rows​

GROUP BY: Aggregate data​

HAVING: Filter groups​

SELECT: Choose columns​

ORDER BY: Sort results

What is the typical SQL execution order in Teradata?​


The typical order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. However,
this may vary based on optimizer decisions.​

Can optimizer change the query execution flow?​


Yes, Teradata’s optimizer may rearrange the order for efficiency to reduce resource consumption or
improve query performance.​

4. Most Commonly Used Commands in Teradata

Explanation:​
Some of the most used commands in Teradata include:

SELECT: To retrieve data from a table.​

INSERT/UPDATE/DELETE: Used to modify data.​

COLLECT STATISTICS: Collects statistics on tables or columns, which helps the optimizer choose
efficient query plans.​

EXPLAIN: Displays the query execution plan.

What is the purpose of the EXPLAIN command?​


The EXPLAIN command shows the execution plan for a query, detailing how the optimizer will access
data (e.g., whether it will use indexes, a full table scan, etc.).​
Why do we use COLLECT STATISTICS?​
COLLECT STATISTICS is used to gather data distribution information on tables and columns, helping
the optimizer make better decisions on query execution plans.​

5. Utilities in Teradata : Teradata provides several utilities for managing data loads and transformations:

FastLoad: Loads data into empty tables quickly.​

MultiLoad: Supports batch inserts, updates, and deletes.​

TPump: Used for near real-time insert operations.​

TPT (Teradata Parallel Transporter): A unified tool that combines the functionality of FastLoad,
MultiLoad, and TPump.​

What is the difference between FastLoad and MultiLoad?​


FastLoad is used for loading large volumes of data into empty tables, while MultiLoad supports inserts,
updates, and deletes for tables with existing data.​

When would you use TPump over other utilities?​


TPump is suitable when you need to load data incrementally and in near real-time, making it ideal for
environments requiring frequent, smaller updates.​

6. Push Down Predicate in Teradata : Pushdown predicates involve applying filtering conditions at the
source (or on an earlier step) during data extraction, which reduces the volume of data processed and
improves performance by pushing the filtering work to the ETL source or target system.

What is pushdown predicate and why is it useful?​


A pushdown predicate moves filter conditions closer to the data source to minimize the data volume
transferred, optimizing the entire ETL process.​

How does Teradata handle filter conditions during ETL?​


Teradata pushes down filter conditions to the source system during extraction, thereby improving
performance by reducing the amount of data being moved and processed.​
7. Collecting Statistics


Collecting statistics involves gathering information about data distribution and table structure (e.g., the
number of rows, column values) to help the optimizer make informed decisions when planning query
execution.

Why is collecting statistics important in Teradata?​


Collecting statistics provides the optimizer with data distribution information, enabling it to choose the
most efficient query execution plan.​

How often should you collect stats?​


Statistics should be collected periodically, especially after significant data changes or table modifications,
to ensure the optimizer has up-to-date information for decision-making.​

8. Optimizer in Teradata


The optimizer in Teradata uses a cost-based approach to determine the most efficient way to execute
queries. It evaluates different strategies, such as the use of indexes, data distribution methods, and join
types, based on available statistics.

How does the Teradata optimizer work?​


The optimizer analyzes the query and determines the most efficient execution plan based on available
statistics, access paths, and cost estimation models.​

What factors influence the optimizer’s decision?​


Factors include available statistics, table structure, indexing, data distribution, and the query's complexity.​

9. Architecture of Teradata


Teradata architecture is built around MPP (Massively Parallel Processing). The key components are:

Parsing Engine (PE): Parses SQL queries, compiles execution plans, and communicates with the AMP.​

Access Module Processors (AMP): Responsible for data storage, retrieval, and parallel processing.​

BYNET: A high-speed network connecting PEs to AMPs for communication and data transfer.​

Disaster Recovery: Built-in features for data backup and failover support.​

Explain Teradata architecture.​


Teradata has a shared-nothing MPP architecture where the Parsing Engine (PE) distributes tasks across
multiple Access Module Processors (AMPs), connected through a high-speed network (BYNET).​
What is the role of BYNET in Teradata?​
BYNET is the communication layer that connects Parsing Engines (PE) to Access Module Processors
(AMPs) to allow data transfer and query execution.

10. Other Basic Concepts

Primary Index (PI): A hash function determines how data is distributed across AMPs. It ensures uniform
data distribution, reducing data skew.​

Secondary Index (SI): Improves query performance by providing alternative access paths to data.​

Skewness: Occurs when data distribution across AMPs is uneven, which can lead to inefficiency due to
some AMPs being overloaded.​

What is skewness and how do you handle it?​


Skewness refers to uneven data distribution, which can lead to performance degradation. To handle it,
you might redistribute the data, optimize the primary index, or review data loading practices.​

Difference between Primary and Secondary Index?​


A Primary Index determines how data is distributed across AMPs, while a Secondary Index offers
additional access paths for queries that do not involve the Primary Index.​

BTEQ (Basic Teradata Query) in Teradata

Explanation:​
BTEQ is a command-line utility for interacting with Teradata databases. It allows users to run SQL
queries, generate reports, and export data to and from Teradata. It can be used for both interactive and
batch processing, making it suitable for administrative tasks, data migration, and reporting.

Key Features:

SQL Execution: You can run SQL queries interactively or in batch mode.​

Reporting: BTEQ allows you to format the output of queries for reporting purposes, making it easier to
generate structured reports from Teradata.​

Data Export/Import: It supports importing data into Teradata tables and exporting data to external files in
various formats (e.g., CSV, tab-delimited).​

Error Handling: BTEQ provides error handling mechanisms, allowing users to capture errors and handle
them programmatically.​

Common Commands in BTEQ:


.RUN FILE: Executes a script containing BTEQ commands or SQL queries.​

.EXPORT: Exports data from a query result to a file.​

.IMPORT: Imports data from a file into Teradata tables.​

.QUIT: Ends the BTEQ session.​

.SET ERRORLEVEL: Controls error handling levels for the session.​

.IF: Conditional logic within BTEQ scripts, allowing dynamic query execution based on error levels or
other conditions.

Interactive Mode:​

bteq
.logon <hostname>/<username>,<password>
SELECT * FROM my_table;
.logoff
.quit
Batch Mode (Running a Script): Save SQL queries in a file (query.txt), then run it through ​
bteq < query.txt
Export Data:​

.EXPORT FILE=mydata.txt
SELECT * FROM my_table;
.EXPORT RESET

What is BTEQ in Teradata?​


BTEQ is a command-line utility in Teradata that allows users to execute SQL queries, manage data, and
create reports in both interactive and batch modes.​

What are some common use cases of BTEQ?​


BTEQ is commonly used for running SQL queries, exporting/importing data, and generating reports in
Teradata databases.​

How do you export data using BTEQ?​


Data can be exported using the .EXPORT command, followed by the query you want to execute. The
data is saved to a file that can be specified in the command.​

What does the .QUIT command do in BTEQ?​


The .QUIT command ends the BTEQ session, logging you out of the Teradata database and closing the
utility.
1. Hadoop Overview:

Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It works well with structured, semi-structured, and unstructured data.​

Key Components:​

HDFS (Hadoop Distributed File System): A distributed file system designed to store large files across
machines.​

MapReduce: A programming model to process large datasets in parallel across a distributed cluster.​

2. Hadoop Architecture (Master-Slave):

Master Node:​

NameNode: Manages the file system metadata and ensures fault tolerance.​

ResourceManager: Manages resources and schedules tasks for processing.

Slave Node:​

DataNode: Stores data in the HDFS.​

NodeManager: Manages resources on individual slave nodes.​

3. Data Processing in Hadoop (File System):

HDFS stores data in blocks, typically 128 MB or 256 MB, replicated across nodes for fault tolerance.​

Hadoop divides data processing tasks into smaller units, which are handled by multiple nodes
(MapReduce jobs).​

4. File Types in Hadoop:

Text File: Plain text storage in HDFS.​


Sequence File: A binary format suitable for storing data in key-value pairs.​

Avro: A binary format that is schema-based, offering compression and splitting.​

Parquet: A columnar format optimized for queries.​

ORC: Optimized Row Columnar format, used primarily in Hive.​

5. Hadoop Commands:

HDFS Commands:​

hadoop fs -ls /: List files in HDFS.​

hadoop fs -copyFromLocal /local/path /hdfs/path: Copy data from local to HDFS.​

hadoop fs -get /hdfs/path /local/path: Copy data from HDFS to local.​

MapReduce Commands:​

hadoop jar <jar-file> <main-class> <input> <output>: Running a MapReduce job.​

6. Hadoop Algorithms:

Hadoop works by dividing the data into small chunks, which are processed in parallel (Map) and then
aggregated (Reduce). Data is distributed across various nodes, ensuring scalability and efficiency.​

7. Drawbacks of Hadoop:

Complexity: Setting up and managing Hadoop clusters can be complex.

Latency: High latency due to disk-based processing, especially for small datasets.

Data consistency: HDFS provides eventual consistency, which may not suit all applications.​

Not suitable for real-time processing: Hadoop is batch-based and doesn’t perform well for real-time
data streaming.​

8. Hadoop Utilities:
Hive: Data warehouse for managing large datasets.

Pig: High-level platform for creating MapReduce programs.

HBase: NoSQL database that runs on HDFS.

Oozie: Workflow scheduler for Hadoop jobs.

Sqoop: For importing and exporting data from relational databases.​

9. Order of Execution (Basic Pipeline):

Data is loaded into HDFS.

Map phase divides the data into key-value pairs.

Shuffle and Sort phase: Data is sorted based on keys.

Reduce phase aggregates results.

Output is written back to HDFS.​

10. Bucketing vs Partitioning in Hadoop:

Partitioning: Dividing data into subdirectories, based on a column value (e.g., date).​

Bucketing: Dividing data into multiple files with a fixed number of buckets, usually for more efficient query
performance.​

11. External Tables:

External Tables in Hadoop (commonly used in Hive) allow data to reside outside the system, while still
being managed by Hive.​

12. Hadoop vs Teradata:

Teradata: A traditional data warehousing solution, optimized for SQL querying and high-performance
analytics on structured data.​

Hadoop: A distributed system designed for large-scale data storage and batch processing. Hadoop is
more suited for handling unstructured data and massive scale.​

13. Basic Syntax & Queries (Hive):

Hive Syntax:​
CREATE TABLE table_name (column1 datatype, column2 datatype, ...) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';

SELECT * FROM table_name WHERE condition;

INSERT INTO TABLE table_name VALUES (value1, value2, ...);

15. Stages in Hadoop:

Ingestion: Import data from various sources.

Storage: Data is stored in HDFS.

Processing: Data is processed using MapReduce or other frameworks like Spark.

Analysis: Data analysis is performed via Hive, Pig, or HBase.

Export: Data is exported to external systems.

1. What is Hadoop and how does it work?

Answer: Hadoop is an open-source framework designed for distributed storage and processing of large
datasets. It is highly scalable and fault-tolerant. Hadoop consists of two main components: HDFS
(Hadoop Distributed File System) for storage and MapReduce for processing. Data is stored in HDFS in
blocks, and MapReduce processes the data in parallel across a cluster of machines.​

2. What are the core components of Hadoop?

Answer: The core components of Hadoop are:

HDFS: A distributed file system that stores data in blocks across multiple nodes.

MapReduce: A programming model for processing large datasets in parallel.

YARN: The resource management layer that manages and schedules resources across the cluster.​

Hive: A data warehouse infrastructure that provides SQL-like queries.

Pig: A high-level platform for creating MapReduce programs.

HBase: A NoSQL database that runs on top of HDFS.

Zookeeper: A coordination service for distributed systems.​

3. What is the role of NameNode and DataNode in Hadoop?

Answer:​

NameNode: It is the master node in Hadoop that manages the file system metadata,
such as the locations of data blocks, the directory structure, and file permissions. It
doesn’t store the actual data but tracks where the data is stored in HDFS.​

DataNode: These are the slave nodes in the Hadoop cluster that store the actual data
blocks. Each DataNode is responsible for serving read and write requests from clients.​

4. What is the difference between ResourceManager and NodeManager in YARN?

Answer:​

ResourceManager: It is the master daemon in YARN that manages the cluster resources
and schedules jobs. It makes decisions about where to run tasks and allocates resources
based on availability.​

NodeManager: It runs on each slave node and manages resources (CPU, memory) on
that node. It monitors the health of the node and reports to the ResourceManager.​

5. How does Hadoop process data in a distributed manner?

Answer: Hadoop processes data using the MapReduce programming model. In the Map phase, data is
split into smaller chunks (blocks) and distributed across various nodes in the cluster. Each node
processes its chunk of data in parallel, and then the results are shuffled and sorted. In the Reduce phase,
data is aggregated and combined to produce the final result.​

6. What is the difference between Avro, Parquet, and ORC file formats in Hadoop?

Answer:​

Avro: A row-based format suitable for serializing data with a schema. It supports efficient
data storage and is good for data transfer and integration.​

Parquet: A columnar format optimized for analytical querying. It allows for efficient
column-based compression and is ideal for complex queries and large-scale data
analytics.​

ORC: Similar to Parquet, ORC is also a columnar storage format optimized for Hive. It
provides high compression and is more efficient in terms of both storage and query
performance compared to Avro.​

7. Why is SequenceFile used in Hadoop?

Answer: SequenceFile is a binary format in Hadoop that stores data in key-value pairs. It is used for
storing large amounts of data efficiently, especially in cases where multiple MapReduce jobs read and
write data in sequence. It allows for better performance and compression compared to text files.​
8. How do you list files in HDFS?
Answer: You can list files in HDFS using the command:​

bash​
CopyEdit​
hadoop fs -ls /path/to/directory

9. **
4o mini

You said:
like in the format i mentioned earlier.

ChatGPT said:

Apologies for the confusion! Here's a more structured format, with answers placed directly after each
question, as per your request:

1. What is Hadoop and how does it work?

Answer: Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It consists of two primary components:​

HDFS (Hadoop Distributed File System): A distributed file system that stores data
across multiple machines.​

MapReduce: A parallel programming model for processing data in a distributed manner.


Data is divided into blocks and stored across a cluster of machines, and MapReduce
processes the data by dividing tasks into smaller units and distributing them across the
cluster.​

2. What are the core components of Hadoop?

HDFS: Distributed file system for storage.​

MapReduce: Programming model for data processing.​

YARN: Resource management layer that manages resources across the cluster.​

Hive: SQL-like query language for querying large datasets.​

Pig: High-level scripting language for creating MapReduce programs.​


HBase: NoSQL database that runs on HDFS.​

Zookeeper: Coordination service for distributed applications.

3. What is the role of NameNode and DataNode in Hadoop?​

NameNode: It is the master node that stores metadata such as file locations and
directory structure. It does not store the actual data but keeps track of where it is stored
across the cluster.​

DataNode: These are the worker nodes that store the actual data in blocks. They serve
read and write requests for data.​

4. What is the difference between ResourceManager and NodeManager in YARN?​

ResourceManager: It is the master daemon in YARN that manages resources and


schedules tasks across the cluster. It decides where tasks will run and allocates
resources.​

NodeManager: It runs on each node in the cluster, managing resources (CPU, memory)
and monitoring the health of that node. It reports the status to the ResourceManager.​

5. How does Hadoop process data in a distributed manner?

Answer: Hadoop processes data in a distributed manner using the MapReduce framework. The data is
split into blocks and distributed across the cluster. In the Map phase, the data is processed in parallel,
and in the Reduce phase, the results are aggregated. This parallelism allows Hadoop to efficiently
handle large datasets.​

6. What is the difference between Avro, Parquet, and ORC file formats in Hadoop?​

Avro: A row-based format that is schema-based and ideal for serialization of data. It
supports compression and splitting, making it suitable for use with MapReduce.​

Parquet: A columnar storage format that is optimized for analytic queries. It is efficient for
reading specific columns and supports compression, making it ideal for large-scale data
analytics.​

ORC: A columnar storage format optimized for Hive. It offers high compression and better
query performance, especially for complex queries.​
7. Why is SequenceFile used in Hadoop?

Answer: SequenceFile is a binary format used to store data as key-value pairs. It is particularly efficient
for use in MapReduce jobs where the data needs to be read and written in sequence. It supports high
compression and is a good choice for storing large datasets.​

8. How do you list files in HDFS?



hadoop fs -ls /path/to/directory

9. How do you copy a file from the local filesystem to HDFS?



hadoop fs -copyFromLocal /local/path /hdfs/path

10. How do you copy a file from HDFS to the local filesystem?
:hadoop fs -get /hdfs/path /local/path

11. What is the difference between partitioning and bucketing in Hive?​

Partitioning: It divides the data into directories based on the column value (e.g., partitioning by date).
This helps with more efficient querying by narrowing down the search space.​

Bucketing: It divides data into a fixed number of files (buckets) based on a column's hash value. It is
used for optimization in certain types of queries, especially when working with joins.

12. What is an external table in Hive?

Answer: An external table in Hive allows data to reside outside the Hive system, usually in HDFS or
other storage systems. Unlike internal tables, external tables do not manage the data itself; Hive only
manages the metadata. The data remains outside Hive, so deleting the table does not delete the data.​

13. How does Hadoop differ from Teradata in terms of data storage and processing?

Answer:​

Hadoop: It is a distributed system for storing and processing large datasets, often in a batch processing
style. It is scalable and ideal for handling unstructured and semi-structured data.​

Teradata: A data warehousing solution optimized for OLAP (Online Analytical Processing) and SQL
queries on structured data. It is more suited for high-performance analytics on small to medium-sized
datasets.​
14. How do you create a table in Hive?
Answer: You can create a table in Hive using the following syntax:​

sql​
CopyEdit​
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

15. Can you explain the syntax of a SELECT statement in Hive?


Answer: The basic syntax for a SELECT statement in Hive is:​

sql​
CopyEdit​
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column
HAVING condition
ORDER BY column;

16. What are the stages of a MapReduce job execution?

Answer:​

Input Splitting: Input data is divided into chunks (splits).

Map phase: Each chunk is processed by the Map function, creating key-value pairs.

Shuffle and Sort phase: Data is shuffled and sorted by keys before being passed to the Reducer.

Reduce phase: The results from the Map phase are aggregated in the Reduce phase.

Output: The final results are written to HDFS.​

17. What are the key differences between Hadoop and Teradata?

Hadoop: It is a distributed system for storage and processing of large-scale datasets, suitable for
unstructured data, batch processing, and scalability.

Teradata: It is a traditional data warehousing solution optimized for structured data and high-performance
OLAP workloads using SQL queries.
PySpark Interview Notes: Complete Guide

PySpark is the Python API for Apache Spark, an open-source, distributed computing system that
provides fast and general-purpose cluster-computing capabilities. PySpark enables parallel processing
and big data analytics using Python programming.

Core Concepts:

●​ Distributed Computing: PySpark divides data into partitions and processes them in parallel
across a cluster.​

●​ Lazy Evaluation: Transformations in Spark are not executed immediately but are recorded in a
DAG and executed only when an action is called.​

●​ Fault Tolerance: Achieved using RDD lineage information and recomputation.​

Frameworks within PySpark:

1.​ Spark Core - The foundation for all components, handles memory management, fault recovery,
task scheduling.​

2.​ Spark SQL - Allows querying of structured/semi-structured data using SQL.​

3.​ MLlib - Machine Learning library with algorithms like classification, regression, clustering, and
collaborative filtering.​

4.​ GraphX - For graph-parallel computations.​

5.​ Spark Streaming - For real-time stream processing.​

What is PySpark?

A: PySpark is a Python interface for Apache Spark. It provides the ability to work with RDDs and
DataFrames, and supports SQL, streaming, machine learning, and graph processing.​

What are the main modules of Spark?​

A: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX.​

Data Types in PySpark


PySpark has its own data type system under pyspark.sql.types. These are used when defining
schemas manually.

●​ Primitive Types: IntegerType, StringType, BooleanType, FloatType, DoubleType,


LongType, DateType, TimestampType​

●​ Complex Types: ArrayType, MapType, StructType​


Example: Defining Schema Manually
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

PySpark Architecture
Apache Spark follows a master-slave architecture consisting of:

●​ Driver Program: Initiates the SparkContext. Translates user code into DAG (Directed Acyclic
Graph).​

●​ Cluster Manager: Allocates resources. Examples: YARN, Mesos, Kubernetes, Standalone.​

●​ Executors: JVM processes that execute tasks on worker nodes.​

●​ Tasks: Basic unit of execution that operates on a single data partition.​

Lifecycle of a Spark Job:

1.​ User submits code (transformations and actions).​

2.​ Driver creates DAG of stages.​

3.​ DAG scheduler breaks down DAG into stages and tasks.​

4.​ Cluster Manager assigns executors.​

5.​ Tasks are scheduled and executed.​

6.​ Results are sent back to the driver.​

SparkSession Initialization:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()

DataFrame Operations:
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
df.printSchema()
df.columns

Filtering and Grouping:


df.filter(df["age"] > 25).show()
df.groupBy("department").count().show()
df.orderBy("salary", ascending=False).show()
Reading Files in PySpark
PySpark supports reading and writing data in multiple formats:

# CSV
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True)

# JSON
df_json = spark.read.json("data.json")

# Parquet
df_parquet = spark.read.parquet("data.parquet")

# ORC
df_orc = spark.read.orc("data.orc")

# Avro (requires avro package)


df_avro = spark.read.format("avro").load("data.avro")

Using SQL in Spark


Spark SQL allows you to run SQL queries on DataFrames. First, register a DataFrame as a temporary
view:

df.createOrReplaceTempView("people")
result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()

Temporary Tables
Temporary View: Exists only for the current SparkSession.​

Global Temporary View: Lives across multiple SparkSessions.​

df.createOrReplaceTempView("local_temp")
df.createGlobalTempView("global_temp")

Performance Tuning in PySpark :

improve the performance of the Spark and PySpark applications by adjusting and optimizing system
resources

Cache/Persist: Used for reusing a DataFrame across multiple actions.


df.cache()
df.persist()
Broadcast Joins:

from pyspark.sql.functions import broadcast


joined = df1.join(broadcast(df2), "id")

Partition Tuning:

df.repartition(4)
df.coalesce(2)

Avoid Wide Transformations: Like groupByKey, reduceByKey is preferred.​

Skewing and Parallel Processing


●​ Data Skew occurs when one key has a significantly larger number of records.​

●​ Solutions:​

○​ Use salting to distribute skewed keys.​

○​ Broadcast small tables.​

○​ Use custom partitioner if necessary.​

Parallelism is achieved through multiple executors working on partitions.

PySpark vs Teradata vs Hadoop


Feature PySpark Hadoop MapReduce Teradata

Language Python, Scala, Java Java SQL

Execution In-Memory DAG Disk I/O SQL Engine

Use Case ETL, ML, Streaming Batch jobs Warehousing

Speed Fast Slower Fast for SQL

Cost Open Source Open Source Expensive Licensing


Q: Why is PySpark faster than Hadoop?​

A: PySpark uses in-memory processing, DAG optimization, and better fault tolerance, unlike Hadoop
which is disk-based and slower for iterative tasks.​

Unique Values
df.select("column_name").distinct().show()
df.dropDuplicates(["column_name"]).show()

Finding Files in a Directory


import os

for root, dirs, files in os.walk("/path"):


for file in files:
print(os.path.join(root, file))

SQL Order of Execution


1.​ FROM/JOIN: Load data and perform joins.​

2.​ WHERE: Apply row-level filters.​

3.​ GROUP BY: Aggregate rows.​

4.​ HAVING: Filter on aggregate results.​

5.​ SELECT: Return desired columns.​

6.​ ORDER BY: Sort results.​

7.​ LIMIT: Limit number of rows.​

Sample PySpark ETL Pipeline


from pyspark.sql import SparkSession

# Step 1: Initialize Spark Session


spark = SparkSession.builder \
.appName("ETL Pipeline") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

# Step 2: Extract - Load CSV file into DataFrame


df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Step 3: Transform - Drop nulls and filter rows


df_cleaned = df.dropna().filter(df["age"] > 18)
df_transformed = df_cleaned.withColumn("salary_in_k", df_cleaned["salary"] / 1000)

# Step 4: Load - Write to Parquet


output_path = "hdfs:///user/output/data"
df_transformed.write.mode("overwrite").parquet(output_path)

You might also like