0% found this document useful (0 votes)

3 views

Interview Prep

The document outlines projects at JPMorgan Chase and Expandtree Infotech focused on building ETL pipelines for financial and customer behavior data. It details the processes of extracting, transforming, and loading data using tools like Teradata and Hadoop, along with challenges faced and solutions implemented. Additionally, it provides an overview of Teradata's architecture, utilities, and optimization techniques, as well as a comparison with Hadoop's framework for handling large datasets.

Uploaded by

sopallisaivamsi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Interview Prep

Uploaded by

sopallisaivamsi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

JPMorgan Chase : worked on a project focused on building and optimizing ETL pipelines for a financial

data warehouse that supported risk and compliance reporting."

process daily transaction data from various internal systems, transform it to meet compliance rules,
and make it available for dashboards and audits.

Pipeline Breakdown:

was working with large volumes of financial data. used Teradata tools like BTEQ and TPT to extract data
from different internal systems. After that, wrote complex SQL to clean, join, and transform the data
based on business rules. Finally, loaded everything into Teradata data marts with partitioning and
indexing to support reporting and performance.

Extract: used BTEQ scripts and Teradata Parallel Transporter (TPT) to extract large volumes of
transactional and customer data from multiple systems and flat files.

Transform:I used complex SQL in Teradata to clean, join, and apply business logic.
Tasks included identifying high-value transactions, standardizing formats, and removing duplicates.

Load:Loaded the transformed data into Teradata data marts using MultiLoad and TPT.

Added indexing and partitioning for performance and supported daily refresh jobs.

Tools Used:Teradata SQL, BTEQ, TPT, FastLoad, JIRA, Python, Agile

Challenges Faced:

Query performance issues due to large volumes – solved with indexing, proper joins, and partitioning.

Data mismatches between systems – handled with validation logic and error handling scripts.

Expandtree Infotech: building big data pipelines to integrate customer behavior data into a central
analytics platform using Hadoop and Spark."

Project Overview:processed structured and unstructured data from multiple sources to generate reports
on customer engagement and product insights.

the setup was more focused on big data. We pulled data from databases and streaming sources
like Kafka using tools like Sqoop and Python scripts. Then, we used PySpark and HiveQL to
transform and clean the data—doing things like joining tables, removing duplicates, and applying
logic. Once the data was ready, we loaded it into Hive tables and HDFS so analysts could run
reports or use it in dashboards."

Pipeline Breakdown:

Extract:Ingested data from Oracle and SQL Server using Sqoop, and streamed log files using Kafka
into HDFS.
Transform:Used PySpark and HiveQL to clean the data, perform joins, and apply business logic (e.g.,
tagging users based on activity).

Automated data enrichment and type conversions using Python scripts and Spark UDFs.

Load:Loaded processed data into Hive tables, HBase, and exported Parquet files into HDFS.

Made this data available to BI tools like Tableau.

Tools Used:PySpark, Hive, Sqoop, HDFS, Kafka, MapReduce, Python

Challenges Faced:

Out-of-memory errors in Spark jobs – resolved by tuning Spark configurations and optimizing
transformations.

Schema drift in streaming data – handled using dynamic schema validation and fallback logic.

1. Teradata

What is Teradata and where is it used?
Teradata is a relational database management system (RDBMS) designed for large-scale data
warehousing and analytics. It is widely used for handling large volumes of structured data across
industries like retail, banking, and telecommunications.

How is Teradata different from other RDBMS?

Unlike traditional RDBMS, Teradata uses a massively parallel processing (MPP) architecture, allowing it
to process large amounts of data across multiple nodes, improving scalability and performance.

2. Parallel Processing in Teradata

Teradata uses MPP (Massively Parallel Processing) architecture where each node (AMP) processes a
portion of the data. Data is distributed across Access Module Processors (AMPs), allowing parallel
querying and

How does parallelism work in Teradata?

Teradata splits large datasets across multiple AMPs. When a query is executed, the system sends tasks
to each AMP to work on their portion of the data simultaneously, significantly improving performance.

What is the role of AMPs in parallel processing?

AMPS handle the data processing independently and work in parallel, allowing faster execution of
queries by distributing the workload evenly across nodes.
3. Order of Execution in Teradata

The order of execution of SQL components in Teradata generally follows the logical query plan:

FROM: Determine tables involved

WHERE: Filter rows

GROUP BY: Aggregate data

HAVING: Filter groups

SELECT: Choose columns

ORDER BY: Sort results

What is the typical SQL execution order in Teradata?

The typical order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. However,
this may vary based on optimizer decisions.

Can optimizer change the query execution flow?

Yes, Teradata’s optimizer may rearrange the order for efficiency to reduce resource consumption or
improve query performance.

4. Most Commonly Used Commands in Teradata

Explanation:
Some of the most used commands in Teradata include:

SELECT: To retrieve data from a table.

INSERT/UPDATE/DELETE: Used to modify data.

COLLECT STATISTICS: Collects statistics on tables or columns, which helps the optimizer choose
efficient query plans.

EXPLAIN: Displays the query execution plan.

What is the purpose of the EXPLAIN command?

The EXPLAIN command shows the execution plan for a query, detailing how the optimizer will access
data (e.g., whether it will use indexes, a full table scan, etc.).
Why do we use COLLECT STATISTICS?
COLLECT STATISTICS is used to gather data distribution information on tables and columns, helping
the optimizer make better decisions on query execution plans.

5. Utilities in Teradata : Teradata provides several utilities for managing data loads and transformations:

FastLoad: Loads data into empty tables quickly.

MultiLoad: Supports batch inserts, updates, and deletes.

TPump: Used for near real-time insert operations.

TPT (Teradata Parallel Transporter): A unified tool that combines the functionality of FastLoad,
MultiLoad, and TPump.

What is the difference between FastLoad and MultiLoad?

FastLoad is used for loading large volumes of data into empty tables, while MultiLoad supports inserts,
updates, and deletes for tables with existing data.

When would you use TPump over other utilities?

TPump is suitable when you need to load data incrementally and in near real-time, making it ideal for
environments requiring frequent, smaller updates.

6. Push Down Predicate in Teradata : Pushdown predicates involve applying filtering conditions at the
source (or on an earlier step) during data extraction, which reduces the volume of data processed and
improves performance by pushing the filtering work to the ETL source or target system.

What is pushdown predicate and why is it useful?

A pushdown predicate moves filter conditions closer to the data source to minimize the data volume
transferred, optimizing the entire ETL process.

How does Teradata handle filter conditions during ETL?

Teradata pushes down filter conditions to the source system during extraction, thereby improving
performance by reducing the amount of data being moved and processed.
7. Collecting Statistics

Collecting statistics involves gathering information about data distribution and table structure (e.g., the
number of rows, column values) to help the optimizer make informed decisions when planning query
execution.

Why is collecting statistics important in Teradata?

Collecting statistics provides the optimizer with data distribution information, enabling it to choose the
most efficient query execution plan.

How often should you collect stats?

Statistics should be collected periodically, especially after significant data changes or table modifications,
to ensure the optimizer has up-to-date information for decision-making.

8. Optimizer in Teradata

The optimizer in Teradata uses a cost-based approach to determine the most efficient way to execute
queries. It evaluates different strategies, such as the use of indexes, data distribution methods, and join
types, based on available statistics.

How does the Teradata optimizer work?

The optimizer analyzes the query and determines the most efficient execution plan based on available
statistics, access paths, and cost estimation models.

What factors influence the optimizer’s decision?

Factors include available statistics, table structure, indexing, data distribution, and the query's complexity.

9. Architecture of Teradata

Teradata architecture is built around MPP (Massively Parallel Processing). The key components are:

Parsing Engine (PE): Parses SQL queries, compiles execution plans, and communicates with the AMP.

Access Module Processors (AMP): Responsible for data storage, retrieval, and parallel processing.

BYNET: A high-speed network connecting PEs to AMPs for communication and data transfer.

Disaster Recovery: Built-in features for data backup and failover support.

Explain Teradata architecture.

Teradata has a shared-nothing MPP architecture where the Parsing Engine (PE) distributes tasks across
multiple Access Module Processors (AMPs), connected through a high-speed network (BYNET).
What is the role of BYNET in Teradata?
BYNET is the communication layer that connects Parsing Engines (PE) to Access Module Processors
(AMPs) to allow data transfer and query execution.

10. Other Basic Concepts

Primary Index (PI): A hash function determines how data is distributed across AMPs. It ensures uniform
data distribution, reducing data skew.

Secondary Index (SI): Improves query performance by providing alternative access paths to data.

Skewness: Occurs when data distribution across AMPs is uneven, which can lead to inefficiency due to
some AMPs being overloaded.

What is skewness and how do you handle it?

Skewness refers to uneven data distribution, which can lead to performance degradation. To handle it,
you might redistribute the data, optimize the primary index, or review data loading practices.

Difference between Primary and Secondary Index?

A Primary Index determines how data is distributed across AMPs, while a Secondary Index offers
additional access paths for queries that do not involve the Primary Index.

BTEQ (Basic Teradata Query) in Teradata

Explanation:
BTEQ is a command-line utility for interacting with Teradata databases. It allows users to run SQL
queries, generate reports, and export data to and from Teradata. It can be used for both interactive and
batch processing, making it suitable for administrative tasks, data migration, and reporting.

Key Features:

SQL Execution: You can run SQL queries interactively or in batch mode.

Reporting: BTEQ allows you to format the output of queries for reporting purposes, making it easier to
generate structured reports from Teradata.

Data Export/Import: It supports importing data into Teradata tables and exporting data to external files in
various formats (e.g., CSV, tab-delimited).

Error Handling: BTEQ provides error handling mechanisms, allowing users to capture errors and handle
them programmatically.

Common Commands in BTEQ:

.RUN FILE: Executes a script containing BTEQ commands or SQL queries.

.EXPORT: Exports data from a query result to a file.

.IMPORT: Imports data from a file into Teradata tables.

.QUIT: Ends the BTEQ session.

.SET ERRORLEVEL: Controls error handling levels for the session.

.IF: Conditional logic within BTEQ scripts, allowing dynamic query execution based on error levels or
other conditions.

Interactive Mode:

bteq
.logon <hostname>/<username>,<password>
SELECT * FROM my_table;
.logoff
.quit
Batch Mode (Running a Script): Save SQL queries in a file (query.txt), then run it through
bteq < query.txt
Export Data:

.EXPORT FILE=mydata.txt
SELECT * FROM my_table;
.EXPORT RESET

What is BTEQ in Teradata?

BTEQ is a command-line utility in Teradata that allows users to execute SQL queries, manage data, and
create reports in both interactive and batch modes.

What are some common use cases of BTEQ?

BTEQ is commonly used for running SQL queries, exporting/importing data, and generating reports in
Teradata databases.

How do you export data using BTEQ?

Data can be exported using the .EXPORT command, followed by the query you want to execute. The
data is saved to a file that can be specified in the command.

What does the .QUIT command do in BTEQ?

The .QUIT command ends the BTEQ session, logging you out of the Teradata database and closing the
utility.
1. Hadoop Overview:

Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It works well with structured, semi-structured, and unstructured data.

Key Components:

HDFS (Hadoop Distributed File System): A distributed file system designed to store large files across
machines.

MapReduce: A programming model to process large datasets in parallel across a distributed cluster.

2. Hadoop Architecture (Master-Slave):

Master Node:

NameNode: Manages the file system metadata and ensures fault tolerance.

ResourceManager: Manages resources and schedules tasks for processing.

Slave Node:

DataNode: Stores data in the HDFS.

NodeManager: Manages resources on individual slave nodes.

3. Data Processing in Hadoop (File System):

HDFS stores data in blocks, typically 128 MB or 256 MB, replicated across nodes for fault tolerance.

Hadoop divides data processing tasks into smaller units, which are handled by multiple nodes
(MapReduce jobs).

4. File Types in Hadoop:

Text File: Plain text storage in HDFS.

Sequence File: A binary format suitable for storing data in key-value pairs.

Avro: A binary format that is schema-based, offering compression and splitting.

Parquet: A columnar format optimized for queries.

ORC: Optimized Row Columnar format, used primarily in Hive.

5. Hadoop Commands:

HDFS Commands:

hadoop fs -ls /: List files in HDFS.

hadoop fs -copyFromLocal /local/path /hdfs/path: Copy data from local to HDFS.

hadoop fs -get /hdfs/path /local/path: Copy data from HDFS to local.

MapReduce Commands:

hadoop jar <jar-file> <main-class> <input> <output>: Running a MapReduce job.

6. Hadoop Algorithms:

Hadoop works by dividing the data into small chunks, which are processed in parallel (Map) and then
aggregated (Reduce). Data is distributed across various nodes, ensuring scalability and efficiency.

7. Drawbacks of Hadoop:

Complexity: Setting up and managing Hadoop clusters can be complex.

Latency: High latency due to disk-based processing, especially for small datasets.

Data consistency: HDFS provides eventual consistency, which may not suit all applications.

Not suitable for real-time processing: Hadoop is batch-based and doesn’t perform well for real-time
data streaming.

8. Hadoop Utilities:
Hive: Data warehouse for managing large datasets.

Pig: High-level platform for creating MapReduce programs.

HBase: NoSQL database that runs on HDFS.

Oozie: Workflow scheduler for Hadoop jobs.

Sqoop: For importing and exporting data from relational databases.

9. Order of Execution (Basic Pipeline):

Data is loaded into HDFS.

Map phase divides the data into key-value pairs.

Shuffle and Sort phase: Data is sorted based on keys.

Reduce phase aggregates results.

Output is written back to HDFS.

10. Bucketing vs Partitioning in Hadoop:

Partitioning: Dividing data into subdirectories, based on a column value (e.g., date).

Bucketing: Dividing data into multiple files with a fixed number of buckets, usually for more efficient query
performance.

11. External Tables:

External Tables in Hadoop (commonly used in Hive) allow data to reside outside the system, while still
being managed by Hive.

12. Hadoop vs Teradata:

Teradata: A traditional data warehousing solution, optimized for SQL querying and high-performance
analytics on structured data.

Hadoop: A distributed system designed for large-scale data storage and batch processing. Hadoop is
more suited for handling unstructured data and massive scale.

13. Basic Syntax & Queries (Hive):

Hive Syntax:
CREATE TABLE table_name (column1 datatype, column2 datatype, ...) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';

SELECT * FROM table_name WHERE condition;

INSERT INTO TABLE table_name VALUES (value1, value2, ...);

15. Stages in Hadoop:

Ingestion: Import data from various sources.

Storage: Data is stored in HDFS.

Processing: Data is processed using MapReduce or other frameworks like Spark.

Analysis: Data analysis is performed via Hive, Pig, or HBase.

Export: Data is exported to external systems.

1. What is Hadoop and how does it work?

Answer: Hadoop is an open-source framework designed for distributed storage and processing of large
datasets. It is highly scalable and fault-tolerant. Hadoop consists of two main components: HDFS
(Hadoop Distributed File System) for storage and MapReduce for processing. Data is stored in HDFS in
blocks, and MapReduce processes the data in parallel across a cluster of machines.

2. What are the core components of Hadoop?

Answer: The core components of Hadoop are:

HDFS: A distributed file system that stores data in blocks across multiple nodes.

MapReduce: A programming model for processing large datasets in parallel.

YARN: The resource management layer that manages and schedules resources across the cluster.

Hive: A data warehouse infrastructure that provides SQL-like queries.

Pig: A high-level platform for creating MapReduce programs.

HBase: A NoSQL database that runs on top of HDFS.

Zookeeper: A coordination service for distributed systems.

3. What is the role of NameNode and DataNode in Hadoop?

Answer:

NameNode: It is the master node in Hadoop that manages the file system metadata,
such as the locations of data blocks, the directory structure, and file permissions. It
doesn’t store the actual data but tracks where the data is stored in HDFS.

DataNode: These are the slave nodes in the Hadoop cluster that store the actual data
blocks. Each DataNode is responsible for serving read and write requests from clients.

4. What is the difference between ResourceManager and NodeManager in YARN?

Answer:

ResourceManager: It is the master daemon in YARN that manages the cluster resources
and schedules jobs. It makes decisions about where to run tasks and allocates resources
based on availability.

NodeManager: It runs on each slave node and manages resources (CPU, memory) on
that node. It monitors the health of the node and reports to the ResourceManager.

5. How does Hadoop process data in a distributed manner?

Answer: Hadoop processes data using the MapReduce programming model. In the Map phase, data is
split into smaller chunks (blocks) and distributed across various nodes in the cluster. Each node
processes its chunk of data in parallel, and then the results are shuffled and sorted. In the Reduce phase,
data is aggregated and combined to produce the final result.

6. What is the difference between Avro, Parquet, and ORC file formats in Hadoop?

Answer:

Avro: A row-based format suitable for serializing data with a schema. It supports efficient
data storage and is good for data transfer and integration.

Parquet: A columnar format optimized for analytical querying. It allows for efficient
column-based compression and is ideal for complex queries and large-scale data
analytics.

ORC: Similar to Parquet, ORC is also a columnar storage format optimized for Hive. It
provides high compression and is more efficient in terms of both storage and query
performance compared to Avro.

7. Why is SequenceFile used in Hadoop?

Answer: SequenceFile is a binary format in Hadoop that stores data in key-value pairs. It is used for
storing large amounts of data efficiently, especially in cases where multiple MapReduce jobs read and
write data in sequence. It allows for better performance and compression compared to text files.
8. How do you list files in HDFS?
Answer: You can list files in HDFS using the command:

bash
CopyEdit
hadoop fs -ls /path/to/directory

9. **
4o mini

You said:
like in the format i mentioned earlier.

ChatGPT said:

Apologies for the confusion! Here's a more structured format, with answers placed directly after each
question, as per your request:

1. What is Hadoop and how does it work?

Answer: Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It consists of two primary components:

HDFS (Hadoop Distributed File System): A distributed file system that stores data
across multiple machines.

MapReduce: A parallel programming model for processing data in a distributed manner.

Data is divided into blocks and stored across a cluster of machines, and MapReduce
processes the data by dividing tasks into smaller units and distributing them across the
cluster.

2. What are the core components of Hadoop?

HDFS: Distributed file system for storage.

MapReduce: Programming model for data processing.

YARN: Resource management layer that manages resources across the cluster.

Hive: SQL-like query language for querying large datasets.

Pig: High-level scripting language for creating MapReduce programs.

HBase: NoSQL database that runs on HDFS.

Zookeeper: Coordination service for distributed applications.

3. What is the role of NameNode and DataNode in Hadoop?

NameNode: It is the master node that stores metadata such as file locations and
directory structure. It does not store the actual data but keeps track of where it is stored
across the cluster.

DataNode: These are the worker nodes that store the actual data in blocks. They serve
read and write requests for data.

4. What is the difference between ResourceManager and NodeManager in YARN?

ResourceManager: It is the master daemon in YARN that manages resources and

schedules tasks across the cluster. It decides where tasks will run and allocates
resources.

NodeManager: It runs on each node in the cluster, managing resources (CPU, memory)
and monitoring the health of that node. It reports the status to the ResourceManager.

5. How does Hadoop process data in a distributed manner?

Answer: Hadoop processes data in a distributed manner using the MapReduce framework. The data is
split into blocks and distributed across the cluster. In the Map phase, the data is processed in parallel,
and in the Reduce phase, the results are aggregated. This parallelism allows Hadoop to efficiently
handle large datasets.

6. What is the difference between Avro, Parquet, and ORC file formats in Hadoop?

Avro: A row-based format that is schema-based and ideal for serialization of data. It
supports compression and splitting, making it suitable for use with MapReduce.

Parquet: A columnar storage format that is optimized for analytic queries. It is efficient for
reading specific columns and supports compression, making it ideal for large-scale data
analytics.

ORC: A columnar storage format optimized for Hive. It offers high compression and better
query performance, especially for complex queries.
7. Why is SequenceFile used in Hadoop?

Answer: SequenceFile is a binary format used to store data as key-value pairs. It is particularly efficient
for use in MapReduce jobs where the data needs to be read and written in sequence. It supports high
compression and is a good choice for storing large datasets.

8. How do you list files in HDFS?

hadoop fs -ls /path/to/directory

9. How do you copy a file from the local filesystem to HDFS?

hadoop fs -copyFromLocal /local/path /hdfs/path

10. How do you copy a file from HDFS to the local filesystem?
:hadoop fs -get /hdfs/path /local/path

11. What is the difference between partitioning and bucketing in Hive?

Partitioning: It divides the data into directories based on the column value (e.g., partitioning by date).
This helps with more efficient querying by narrowing down the search space.

Bucketing: It divides data into a fixed number of files (buckets) based on a column's hash value. It is
used for optimization in certain types of queries, especially when working with joins.

12. What is an external table in Hive?

Answer: An external table in Hive allows data to reside outside the Hive system, usually in HDFS or
other storage systems. Unlike internal tables, external tables do not manage the data itself; Hive only
manages the metadata. The data remains outside Hive, so deleting the table does not delete the data.

13. How does Hadoop differ from Teradata in terms of data storage and processing?

Answer:

Hadoop: It is a distributed system for storing and processing large datasets, often in a batch processing
style. It is scalable and ideal for handling unstructured and semi-structured data.

Teradata: A data warehousing solution optimized for OLAP (Online Analytical Processing) and SQL
queries on structured data. It is more suited for high-performance analytics on small to medium-sized
datasets.
14. How do you create a table in Hive?
Answer: You can create a table in Hive using the following syntax:

sql
CopyEdit
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

15. Can you explain the syntax of a SELECT statement in Hive?

Answer: The basic syntax for a SELECT statement in Hive is:

sql
CopyEdit
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column
HAVING condition
ORDER BY column;

16. What are the stages of a MapReduce job execution?

Answer:

Input Splitting: Input data is divided into chunks (splits).

Map phase: Each chunk is processed by the Map function, creating key-value pairs.

Shuffle and Sort phase: Data is shuffled and sorted by keys before being passed to the Reducer.

Reduce phase: The results from the Map phase are aggregated in the Reduce phase.

Output: The final results are written to HDFS.

17. What are the key differences between Hadoop and Teradata?

Hadoop: It is a distributed system for storage and processing of large-scale datasets, suitable for
unstructured data, batch processing, and scalability.

Teradata: It is a traditional data warehousing solution optimized for structured data and high-performance
OLAP workloads using SQL queries.
PySpark Interview Notes: Complete Guide

PySpark is the Python API for Apache Spark, an open-source, distributed computing system that
provides fast and general-purpose cluster-computing capabilities. PySpark enables parallel processing
and big data analytics using Python programming.

Core Concepts:

● Distributed Computing: PySpark divides data into partitions and processes them in parallel
across a cluster.

● Lazy Evaluation: Transformations in Spark are not executed immediately but are recorded in a
DAG and executed only when an action is called.

● Fault Tolerance: Achieved using RDD lineage information and recomputation.

Frameworks within PySpark:

1. Spark Core - The foundation for all components, handles memory management, fault recovery,
task scheduling.

2. Spark SQL - Allows querying of structured/semi-structured data using SQL.

3. MLlib - Machine Learning library with algorithms like classification, regression, clustering, and
collaborative filtering.

4. GraphX - For graph-parallel computations.

5. Spark Streaming - For real-time stream processing.

What is PySpark?

A: PySpark is a Python interface for Apache Spark. It provides the ability to work with RDDs and
DataFrames, and supports SQL, streaming, machine learning, and graph processing.

What are the main modules of Spark?

A: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX.

Data Types in PySpark

PySpark has its own data type system under pyspark.sql.types. These are used when defining
schemas manually.

● Primitive Types: IntegerType, StringType, BooleanType, FloatType, DoubleType,

LongType, DateType, TimestampType

● Complex Types: ArrayType, MapType, StructType

Example: Defining Schema Manually
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

PySpark Architecture
Apache Spark follows a master-slave architecture consisting of:

● Driver Program: Initiates the SparkContext. Translates user code into DAG (Directed Acyclic
Graph).

● Cluster Manager: Allocates resources. Examples: YARN, Mesos, Kubernetes, Standalone.

● Executors: JVM processes that execute tasks on worker nodes.

● Tasks: Basic unit of execution that operates on a single data partition.

Lifecycle of a Spark Job:

1. User submits code (transformations and actions).

2. Driver creates DAG of stages.

3. DAG scheduler breaks down DAG into stages and tasks.

4. Cluster Manager assigns executors.

5. Tasks are scheduled and executed.

6. Results are sent back to the driver.

SparkSession Initialization:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()

DataFrame Operations:
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
df.printSchema()
df.columns

Filtering and Grouping:

df.filter(df["age"] > 25).show()
df.groupBy("department").count().show()
df.orderBy("salary", ascending=False).show()
Reading Files in PySpark
PySpark supports reading and writing data in multiple formats:

# CSV
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True)

# JSON
df_json = spark.read.json("data.json")

# Parquet
df_parquet = spark.read.parquet("data.parquet")

# ORC
df_orc = spark.read.orc("data.orc")

# Avro (requires avro package)

df_avro = spark.read.format("avro").load("data.avro")

Using SQL in Spark

Spark SQL allows you to run SQL queries on DataFrames. First, register a DataFrame as a temporary
view:

df.createOrReplaceTempView("people")
result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()

Temporary Tables
Temporary View: Exists only for the current SparkSession.

Global Temporary View: Lives across multiple SparkSessions.

df.createOrReplaceTempView("local_temp")
df.createGlobalTempView("global_temp")

Performance Tuning in PySpark :

improve the performance of the Spark and PySpark applications by adjusting and optimizing system
resources

Cache/Persist: Used for reusing a DataFrame across multiple actions.

df.cache()
df.persist()
Broadcast Joins:

from pyspark.sql.functions import broadcast

joined = df1.join(broadcast(df2), "id")

Partition Tuning:

df.repartition(4)
df.coalesce(2)

Avoid Wide Transformations: Like groupByKey, reduceByKey is preferred.

Skewing and Parallel Processing

● Data Skew occurs when one key has a significantly larger number of records.

● Solutions:

○ Use salting to distribute skewed keys.

○ Broadcast small tables.

○ Use custom partitioner if necessary.

Parallelism is achieved through multiple executors working on partitions.

PySpark vs Teradata vs Hadoop

Feature PySpark Hadoop MapReduce Teradata

Language Python, Scala, Java Java SQL

Execution In-Memory DAG Disk I/O SQL Engine

Use Case ETL, ML, Streaming Batch jobs Warehousing

Speed Fast Slower Fast for SQL

Cost Open Source Open Source Expensive Licensing

Q: Why is PySpark faster than Hadoop?

A: PySpark uses in-memory processing, DAG optimization, and better fault tolerance, unlike Hadoop
which is disk-based and slower for iterative tasks.

Unique Values
df.select("column_name").distinct().show()
df.dropDuplicates(["column_name"]).show()

Finding Files in a Directory

import os

for root, dirs, files in os.walk("/path"):

for file in files:
print(os.path.join(root, file))

SQL Order of Execution

1. FROM/JOIN: Load data and perform joins.

2. WHERE: Apply row-level filters.

3. GROUP BY: Aggregate rows.

4. HAVING: Filter on aggregate results.

5. SELECT: Return desired columns.

6. ORDER BY: Sort results.

7. LIMIT: Limit number of rows.

Sample PySpark ETL Pipeline

from pyspark.sql import SparkSession

# Step 1: Initialize Spark Session

spark = SparkSession.builder \
.appName("ETL Pipeline") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

# Step 2: Extract - Load CSV file into DataFrame

df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Step 3: Transform - Drop nulls and filter rows

df_cleaned = df.dropna().filter(df["age"] > 18)
df_transformed = df_cleaned.withColumn("salary_in_k", df_cleaned["salary"] / 1000)

# Step 4: Load - Write to Parquet

output_path = "hdfs:///user/output/data"
df_transformed.write.mode("overwrite").parquet(output_path)

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Teradata Tutorial PDF
100% (1)
Teradata Tutorial PDF
120 pages
Tera Data DOCS
No ratings yet
Tera Data DOCS
92 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
DWH - Informatica-Notes PDF
100% (2)
DWH - Informatica-Notes PDF
13 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Parking Management System Parking Manage
No ratings yet
Parking Management System Parking Manage
57 pages
Oracle Real Applications Cluster
No ratings yet
Oracle Real Applications Cluster
20 pages
Module 1: Introduction To Application Utilities
No ratings yet
Module 1: Introduction To Application Utilities
15 pages
Tera Data PDF
No ratings yet
Tera Data PDF
2 pages
Integration With Teradata ISV Partner Technical Guide
No ratings yet
Integration With Teradata ISV Partner Technical Guide
80 pages
Teradata Basics
No ratings yet
Teradata Basics
45 pages
What Is Teradata
0% (1)
What Is Teradata
7 pages
Test
No ratings yet
Test
24 pages
Teradata Overview - Notes
No ratings yet
Teradata Overview - Notes
3 pages
TERADATA
No ratings yet
TERADATA
55 pages
Teradata Parallel Transporter
0% (1)
Teradata Parallel Transporter
4 pages
TERADATA
92% (13)
TERADATA
55 pages
1.1 Teradata Architecture
No ratings yet
1.1 Teradata Architecture
44 pages
Overview - Teradata For Oracle Developers
No ratings yet
Overview - Teradata For Oracle Developers
86 pages
DSS ch2
No ratings yet
DSS ch2
112 pages
Informatica Best Practices
No ratings yet
Informatica Best Practices
58 pages
Teradata Intro
No ratings yet
Teradata Intro
12 pages
Teradata Online Training
No ratings yet
Teradata Online Training
10 pages
Teradata Online Training: L/O/G/O
No ratings yet
Teradata Online Training: L/O/G/O
14 pages
Teradata Tutorial
No ratings yet
Teradata Tutorial
18 pages
07 SV TECH Harish Teradata DBA
No ratings yet
07 SV TECH Harish Teradata DBA
6 pages
Teradata Interview Questions
No ratings yet
Teradata Interview Questions
11 pages
Tera Data
No ratings yet
Tera Data
86 pages
Implementing Teradata Utilities in A Powercenter Mapping
No ratings yet
Implementing Teradata Utilities in A Powercenter Mapping
11 pages
Training Material - Teradata Basics Certification
No ratings yet
Training Material - Teradata Basics Certification
25 pages
Teradata Tutorial Point
100% (1)
Teradata Tutorial Point
120 pages
Questions
No ratings yet
Questions
5 pages
Pavan DAnalyst Q&A Resume Based All Technical, Personal Based
No ratings yet
Pavan DAnalyst Q&A Resume Based All Technical, Personal Based
6 pages
Seminar Report - Teradata Is A Relational Database Managemen
50% (2)
Seminar Report - Teradata Is A Relational Database Managemen
30 pages
Teradata Training: Hema Venkatesh Ramasamy HP Global Soft Private LTD
No ratings yet
Teradata Training: Hema Venkatesh Ramasamy HP Global Soft Private LTD
134 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
White Paper - Working With Informatica-Teradata Parallel Transporter
No ratings yet
White Paper - Working With Informatica-Teradata Parallel Transporter
23 pages
Udemy Course DP-900 Microsoft Azure Data Fundaments Guide Part 1 of 2
No ratings yet
Udemy Course DP-900 Microsoft Azure Data Fundaments Guide Part 1 of 2
12 pages
Teradata University
No ratings yet
Teradata University
8 pages
B101 Overview
No ratings yet
B101 Overview
13 pages
Course Name: Teradata
No ratings yet
Course Name: Teradata
3 pages
TPT Teradata - The Teradata Parallel Transporter
No ratings yet
TPT Teradata - The Teradata Parallel Transporter
6 pages
Teradata Database 12.0: Because Performance Counts
No ratings yet
Teradata Database 12.0: Because Performance Counts
3 pages
Teradata Database 12 (1) .0
No ratings yet
Teradata Database 12 (1) .0
3 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
Teradata Developer Besant Technologies Course Content PDF
No ratings yet
Teradata Developer Besant Technologies Course Content PDF
3 pages
Teradata and Hortonworks: The Unified Data Architecture (UDA)
No ratings yet
Teradata and Hortonworks: The Unified Data Architecture (UDA)
15 pages
Is Sorter Transformation Passive or Active ?: 1. When We Want To Get Single Return Value
No ratings yet
Is Sorter Transformation Passive or Active ?: 1. When We Want To Get Single Return Value
7 pages
Teradata Certification Hard
100% (1)
Teradata Certification Hard
137 pages
Teradata Performance Optimization
No ratings yet
Teradata Performance Optimization
7 pages
Teradata Performance Optimization
No ratings yet
Teradata Performance Optimization
7 pages
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions with Solutions for Data Warehousing & Data Mining (BCS058)
119 pages
12nov13 Big Data Analytics With Teradata Revolution Analytics Webinar 131112122130 Phpapp02
No ratings yet
12nov13 Big Data Analytics With Teradata Revolution Analytics Webinar 131112122130 Phpapp02
55 pages
Teradata - Load Utilities
0% (1)
Teradata - Load Utilities
23 pages
Load Utilities in Teradata
No ratings yet
Load Utilities in Teradata
12 pages
PRE 6 FINALS
No ratings yet
PRE 6 FINALS
9 pages
Data Warehousing and Consumer Analytics in The Era of Customer Engagement - An Interview With Bill Franks of Teradata
No ratings yet
Data Warehousing and Consumer Analytics in The Era of Customer Engagement - An Interview With Bill Franks of Teradata
12 pages
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Streamlining ETL: A Practical Guide to Building Pipelines with Python and SQL
From Everand
Streamlining ETL: A Practical Guide to Building Pipelines with Python and SQL
Peter Jones
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Unit I Dbms
0% (1)
Unit I Dbms
45 pages
Shoban Babu: Insert Profile Image in Here
No ratings yet
Shoban Babu: Insert Profile Image in Here
1 page
Pega Client Lifecycle Management For Financial Services: Upgrade Guide 8.4
No ratings yet
Pega Client Lifecycle Management For Financial Services: Upgrade Guide 8.4
8 pages
SQL Anywhere Server Programming en
No ratings yet
SQL Anywhere Server Programming en
884 pages
Instant Download Building RESTful Web Services With PHP 7 Ahmad PDF All Chapter
100% (2)
Instant Download Building RESTful Web Services With PHP 7 Ahmad PDF All Chapter
62 pages
Nosql Databases Unit-2
0% (1)
Nosql Databases Unit-2
15 pages
50 SQL Server Query 1712112772
No ratings yet
50 SQL Server Query 1712112772
51 pages
Sibd Questions Soved Theory
No ratings yet
Sibd Questions Soved Theory
14 pages
project report.. (1)
No ratings yet
project report.. (1)
68 pages
ICT616 Topic 01 - Introduction
No ratings yet
ICT616 Topic 01 - Introduction
66 pages
Power Query
No ratings yet
Power Query
16 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
14 pages
GKK Testing Resume
No ratings yet
GKK Testing Resume
6 pages
Hamdy's Oracle Ideas
No ratings yet
Hamdy's Oracle Ideas
73 pages
Lesson04-Concurrency Control
No ratings yet
Lesson04-Concurrency Control
76 pages
Pee Pee Township, Pike County, Ohio - Wikipedia
No ratings yet
Pee Pee Township, Pike County, Ohio - Wikipedia
1 page
Introduction to Hibernate-JPA
No ratings yet
Introduction to Hibernate-JPA
18 pages
Ramesh Angamuthu - Data Engineer-SSIS - SSRS - Azure - SQL-2024
No ratings yet
Ramesh Angamuthu - Data Engineer-SSIS - SSRS - Azure - SQL-2024
7 pages
Lab - 2 SQL - Assignment - 2: Section1 - Mayank Smart Vehicle Database
No ratings yet
Lab - 2 SQL - Assignment - 2: Section1 - Mayank Smart Vehicle Database
10 pages
MCA Syllabus
No ratings yet
MCA Syllabus
8 pages
Saddam Devops Resume
No ratings yet
Saddam Devops Resume
5 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Utiva Data Science Fellowship
No ratings yet
Utiva Data Science Fellowship
12 pages
DWMM Notes
No ratings yet
DWMM Notes
23 pages
Name: Student No. Group: - Expression Tree - : Csc248 - Fundamentals of Data Structures Tree: Class Exercise
No ratings yet
Name: Student No. Group: - Expression Tree - : Csc248 - Fundamentals of Data Structures Tree: Class Exercise
4 pages
Roadmap for You
No ratings yet
Roadmap for You
6 pages
II PU CS MCQ Chapter-4 & 13
No ratings yet
II PU CS MCQ Chapter-4 & 13
13 pages
Futureinternet 15 00010
No ratings yet
Futureinternet 15 00010
23 pages

Interview Prep

Uploaded by

Interview Prep

Uploaded by

JPMorgan Chase : worked on a project focused on building and optimizing ETL pipelines for a financial

data warehouse that supported risk and compliance reporting."

Tools Used:Teradata SQL, BTEQ, TPT, FastLoad, JIRA, Python, Agile

Made this data available to BI tools like Tableau.​

Tools Used:PySpark, Hive, Sqoop, HDFS, Kafka, MapReduce, Python

How is Teradata different from other RDBMS?​

2. Parallel Processing in Teradata

How does parallelism work in Teradata?​

What is the role of AMPs in parallel processing?​

FROM: Determine tables involved​

WHERE: Filter rows​

GROUP BY: Aggregate data​

HAVING: Filter groups​

SELECT: Choose columns​

ORDER BY: Sort results

What is the typical SQL execution order in Teradata?​

Can optimizer change the query execution flow?​

4. Most Commonly Used Commands in Teradata

SELECT: To retrieve data from a table.​

INSERT/UPDATE/DELETE: Used to modify data.​

EXPLAIN: Displays the query execution plan.

What is the purpose of the EXPLAIN command?​

FastLoad: Loads data into empty tables quickly.​

MultiLoad: Supports batch inserts, updates, and deletes.​

TPump: Used for near real-time insert operations.​

What is the difference between FastLoad and MultiLoad?​

When would you use TPump over other utilities?​

What is pushdown predicate and why is it useful?​

How does Teradata handle filter conditions during ETL?​

Why is collecting statistics important in Teradata?​

How often should you collect stats?​

How does the Teradata optimizer work?​

What factors influence the optimizer’s decision?​

Explain Teradata architecture.​

10. Other Basic Concepts

What is skewness and how do you handle it?​

Difference between Primary and Secondary Index?​

BTEQ (Basic Teradata Query) in Teradata

Common Commands in BTEQ:

.EXPORT: Exports data from a query result to a file.​

.IMPORT: Imports data from a file into Teradata tables.​

.QUIT: Ends the BTEQ session.​

.SET ERRORLEVEL: Controls error handling levels for the session.​

What is BTEQ in Teradata?​

What are some common use cases of BTEQ?​

How do you export data using BTEQ?​

What does the .QUIT command do in BTEQ?​

2. Hadoop Architecture (Master-Slave):

ResourceManager: Manages resources and schedules tasks for processing.

DataNode: Stores data in the HDFS.​

NodeManager: Manages resources on individual slave nodes.​

3. Data Processing in Hadoop (File System):

4. File Types in Hadoop:

Text File: Plain text storage in HDFS.​

Avro: A binary format that is schema-based, offering compression and splitting.​

Parquet: A columnar format optimized for queries.​

ORC: Optimized Row Columnar format, used primarily in Hive.​

hadoop fs -ls /: List files in HDFS.​

hadoop fs -copyFromLocal /local/path /hdfs/path: Copy data from local to HDFS.​

hadoop fs -get /hdfs/path /local/path: Copy data from HDFS to local.​

hadoop jar <jar-file> <main-class> <input> <output>: Running a MapReduce job.​

Complexity: Setting up and managing Hadoop clusters can be complex.

Pig: High-level platform for creating MapReduce programs.

HBase: NoSQL database that runs on HDFS.

Oozie: Workflow scheduler for Hadoop jobs.

Sqoop: For importing and exporting data from relational databases.​

9. Order of Execution (Basic Pipeline):

Data is loaded into HDFS.

Map phase divides the data into key-value pairs.

Shuffle and Sort phase: Data is sorted based on keys.

Reduce phase aggregates results.

Output is written back to HDFS.​

10. Bucketing vs Partitioning in Hadoop:

11. External Tables:

12. Hadoop vs Teradata:

13. Basic Syntax & Queries (Hive):

Made this data available to BI tools like Tableau.

How is Teradata different from other RDBMS?

How does parallelism work in Teradata?

What is the role of AMPs in parallel processing?

FROM: Determine tables involved

WHERE: Filter rows

GROUP BY: Aggregate data

HAVING: Filter groups

SELECT: Choose columns

What is the typical SQL execution order in Teradata?

Can optimizer change the query execution flow?

SELECT: To retrieve data from a table.

INSERT/UPDATE/DELETE: Used to modify data.

What is the purpose of the EXPLAIN command?

FastLoad: Loads data into empty tables quickly.

MultiLoad: Supports batch inserts, updates, and deletes.

TPump: Used for near real-time insert operations.

What is the difference between FastLoad and MultiLoad?

When would you use TPump over other utilities?

What is pushdown predicate and why is it useful?

How does Teradata handle filter conditions during ETL?

Why is collecting statistics important in Teradata?

How often should you collect stats?

How does the Teradata optimizer work?

What factors influence the optimizer’s decision?

Explain Teradata architecture.

What is skewness and how do you handle it?

Difference between Primary and Secondary Index?

.EXPORT: Exports data from a query result to a file.

.IMPORT: Imports data from a file into Teradata tables.

.QUIT: Ends the BTEQ session.

.SET ERRORLEVEL: Controls error handling levels for the session.

What is BTEQ in Teradata?

What are some common use cases of BTEQ?

How do you export data using BTEQ?

What does the .QUIT command do in BTEQ?

DataNode: Stores data in the HDFS.

NodeManager: Manages resources on individual slave nodes.

Text File: Plain text storage in HDFS.

Avro: A binary format that is schema-based, offering compression and splitting.

Parquet: A columnar format optimized for queries.

ORC: Optimized Row Columnar format, used primarily in Hive.

hadoop fs -ls /: List files in HDFS.

hadoop fs -copyFromLocal /local/path /hdfs/path: Copy data from local to HDFS.

hadoop fs -get /hdfs/path /local/path: Copy data from HDFS to local.

hadoop jar <jar-file> <main-class> <input> <output>: Running a MapReduce job.

Sqoop: For importing and exporting data from relational databases.

Output is written back to HDFS.

Zookeeper: A coordination service for distributed systems.

HDFS: Distributed file system for storage.

MapReduce: Programming model for data processing.

Hive: SQL-like query language for querying large datasets.

Pig: High-level scripting language for creating MapReduce programs.

3. What is the role of NameNode and DataNode in Hadoop?

4. What is the difference between ResourceManager and NodeManager in YARN?

11. What is the difference between partitioning and bucketing in Hive?

Output: The final results are written to HDFS.

● Fault Tolerance: Achieved using RDD lineage information and recomputation.

2. Spark SQL - Allows querying of structured/semi-structured data using SQL.

4. GraphX - For graph-parallel computations.

5. Spark Streaming - For real-time stream processing.

What are the main modules of Spark?

A: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX.

● Primitive Types: IntegerType, StringType, BooleanType, FloatType, DoubleType,

● Complex Types: ArrayType, MapType, StructType

● Cluster Manager: Allocates resources. Examples: YARN, Mesos, Kubernetes, Standalone.

● Executors: JVM processes that execute tasks on worker nodes.

● Tasks: Basic unit of execution that operates on a single data partition.

1. User submits code (transformations and actions).

2. Driver creates DAG of stages.

4. Cluster Manager assigns executors.

5. Tasks are scheduled and executed.

6. Results are sent back to the driver.

Global Temporary View: Lives across multiple SparkSessions.

Avoid Wide Transformations: Like groupByKey, reduceByKey is preferred.

○ Use salting to distribute skewed keys.

○ Broadcast small tables.