Interview Prep
Interview Prep
process daily transaction data from various internal systems, transform it to meet compliance rules,
and make it available for dashboards and audits.
Pipeline Breakdown:
was working with large volumes of financial data. used Teradata tools like BTEQ and TPT to extract data
from different internal systems. After that, wrote complex SQL to clean, join, and transform the data
based on business rules. Finally, loaded everything into Teradata data marts with partitioning and
indexing to support reporting and performance.
Extract: used BTEQ scripts and Teradata Parallel Transporter (TPT) to extract large volumes of
transactional and customer data from multiple systems and flat files.
Transform:I used complex SQL in Teradata to clean, join, and apply business logic.
Tasks included identifying high-value transactions, standardizing formats, and removing duplicates.
Load:Loaded the transformed data into Teradata data marts using MultiLoad and TPT.
Added indexing and partitioning for performance and supported daily refresh jobs.
Challenges Faced:
Query performance issues due to large volumes – solved with indexing, proper joins, and partitioning.
Data mismatches between systems – handled with validation logic and error handling scripts.
Expandtree Infotech: building big data pipelines to integrate customer behavior data into a central
analytics platform using Hadoop and Spark."
Project Overview:processed structured and unstructured data from multiple sources to generate reports
on customer engagement and product insights.
the setup was more focused on big data. We pulled data from databases and streaming sources
like Kafka using tools like Sqoop and Python scripts. Then, we used PySpark and HiveQL to
transform and clean the data—doing things like joining tables, removing duplicates, and applying
logic. Once the data was ready, we loaded it into Hive tables and HDFS so analysts could run
reports or use it in dashboards."
Pipeline Breakdown:
Extract:Ingested data from Oracle and SQL Server using Sqoop, and streamed log files using Kafka
into HDFS.
Transform:Used PySpark and HiveQL to clean the data, perform joins, and apply business logic (e.g.,
tagging users based on activity).
Automated data enrichment and type conversions using Python scripts and Spark UDFs.
Load:Loaded processed data into Hive tables, HBase, and exported Parquet files into HDFS.
Challenges Faced:
Out-of-memory errors in Spark jobs – resolved by tuning Spark configurations and optimizing
transformations.
Schema drift in streaming data – handled using dynamic schema validation and fallback logic.
1. Teradata
What is Teradata and where is it used?
Teradata is a relational database management system (RDBMS) designed for large-scale data
warehousing and analytics. It is widely used for handling large volumes of structured data across
industries like retail, banking, and telecommunications.
Teradata uses MPP (Massively Parallel Processing) architecture where each node (AMP) processes a
portion of the data. Data is distributed across Access Module Processors (AMPs), allowing parallel
querying and
The order of execution of SQL components in Teradata generally follows the logical query plan:
Explanation:
Some of the most used commands in Teradata include:
COLLECT STATISTICS: Collects statistics on tables or columns, which helps the optimizer choose
efficient query plans.
5. Utilities in Teradata : Teradata provides several utilities for managing data loads and transformations:
TPT (Teradata Parallel Transporter): A unified tool that combines the functionality of FastLoad,
MultiLoad, and TPump.
6. Push Down Predicate in Teradata : Pushdown predicates involve applying filtering conditions at the
source (or on an earlier step) during data extraction, which reduces the volume of data processed and
improves performance by pushing the filtering work to the ETL source or target system.
Collecting statistics involves gathering information about data distribution and table structure (e.g., the
number of rows, column values) to help the optimizer make informed decisions when planning query
execution.
8. Optimizer in Teradata
The optimizer in Teradata uses a cost-based approach to determine the most efficient way to execute
queries. It evaluates different strategies, such as the use of indexes, data distribution methods, and join
types, based on available statistics.
9. Architecture of Teradata
Teradata architecture is built around MPP (Massively Parallel Processing). The key components are:
Parsing Engine (PE): Parses SQL queries, compiles execution plans, and communicates with the AMP.
Access Module Processors (AMP): Responsible for data storage, retrieval, and parallel processing.
BYNET: A high-speed network connecting PEs to AMPs for communication and data transfer.
Disaster Recovery: Built-in features for data backup and failover support.
Primary Index (PI): A hash function determines how data is distributed across AMPs. It ensures uniform
data distribution, reducing data skew.
Secondary Index (SI): Improves query performance by providing alternative access paths to data.
Skewness: Occurs when data distribution across AMPs is uneven, which can lead to inefficiency due to
some AMPs being overloaded.
Explanation:
BTEQ is a command-line utility for interacting with Teradata databases. It allows users to run SQL
queries, generate reports, and export data to and from Teradata. It can be used for both interactive and
batch processing, making it suitable for administrative tasks, data migration, and reporting.
Key Features:
SQL Execution: You can run SQL queries interactively or in batch mode.
Reporting: BTEQ allows you to format the output of queries for reporting purposes, making it easier to
generate structured reports from Teradata.
Data Export/Import: It supports importing data into Teradata tables and exporting data to external files in
various formats (e.g., CSV, tab-delimited).
Error Handling: BTEQ provides error handling mechanisms, allowing users to capture errors and handle
them programmatically.
.IF: Conditional logic within BTEQ scripts, allowing dynamic query execution based on error levels or
other conditions.
Interactive Mode:
bteq
.logon <hostname>/<username>,<password>
SELECT * FROM my_table;
.logoff
.quit
Batch Mode (Running a Script): Save SQL queries in a file (query.txt), then run it through
bteq < query.txt
Export Data:
.EXPORT FILE=mydata.txt
SELECT * FROM my_table;
.EXPORT RESET
Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It works well with structured, semi-structured, and unstructured data.
Key Components:
HDFS (Hadoop Distributed File System): A distributed file system designed to store large files across
machines.
MapReduce: A programming model to process large datasets in parallel across a distributed cluster.
Master Node:
NameNode: Manages the file system metadata and ensures fault tolerance.
Slave Node:
HDFS stores data in blocks, typically 128 MB or 256 MB, replicated across nodes for fault tolerance.
Hadoop divides data processing tasks into smaller units, which are handled by multiple nodes
(MapReduce jobs).
5. Hadoop Commands:
HDFS Commands:
MapReduce Commands:
6. Hadoop Algorithms:
Hadoop works by dividing the data into small chunks, which are processed in parallel (Map) and then
aggregated (Reduce). Data is distributed across various nodes, ensuring scalability and efficiency.
7. Drawbacks of Hadoop:
Latency: High latency due to disk-based processing, especially for small datasets.
Data consistency: HDFS provides eventual consistency, which may not suit all applications.
Not suitable for real-time processing: Hadoop is batch-based and doesn’t perform well for real-time
data streaming.
8. Hadoop Utilities:
Hive: Data warehouse for managing large datasets.
Partitioning: Dividing data into subdirectories, based on a column value (e.g., date).
Bucketing: Dividing data into multiple files with a fixed number of buckets, usually for more efficient query
performance.
External Tables in Hadoop (commonly used in Hive) allow data to reside outside the system, while still
being managed by Hive.
Teradata: A traditional data warehousing solution, optimized for SQL querying and high-performance
analytics on structured data.
Hadoop: A distributed system designed for large-scale data storage and batch processing. Hadoop is
more suited for handling unstructured data and massive scale.
Hive Syntax:
CREATE TABLE table_name (column1 datatype, column2 datatype, ...) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
Answer: Hadoop is an open-source framework designed for distributed storage and processing of large
datasets. It is highly scalable and fault-tolerant. Hadoop consists of two main components: HDFS
(Hadoop Distributed File System) for storage and MapReduce for processing. Data is stored in HDFS in
blocks, and MapReduce processes the data in parallel across a cluster of machines.
HDFS: A distributed file system that stores data in blocks across multiple nodes.
YARN: The resource management layer that manages and schedules resources across the cluster.
Answer:
NameNode: It is the master node in Hadoop that manages the file system metadata,
such as the locations of data blocks, the directory structure, and file permissions. It
doesn’t store the actual data but tracks where the data is stored in HDFS.
DataNode: These are the slave nodes in the Hadoop cluster that store the actual data
blocks. Each DataNode is responsible for serving read and write requests from clients.
Answer:
ResourceManager: It is the master daemon in YARN that manages the cluster resources
and schedules jobs. It makes decisions about where to run tasks and allocates resources
based on availability.
NodeManager: It runs on each slave node and manages resources (CPU, memory) on
that node. It monitors the health of the node and reports to the ResourceManager.
Answer: Hadoop processes data using the MapReduce programming model. In the Map phase, data is
split into smaller chunks (blocks) and distributed across various nodes in the cluster. Each node
processes its chunk of data in parallel, and then the results are shuffled and sorted. In the Reduce phase,
data is aggregated and combined to produce the final result.
6. What is the difference between Avro, Parquet, and ORC file formats in Hadoop?
Answer:
Avro: A row-based format suitable for serializing data with a schema. It supports efficient
data storage and is good for data transfer and integration.
Parquet: A columnar format optimized for analytical querying. It allows for efficient
column-based compression and is ideal for complex queries and large-scale data
analytics.
ORC: Similar to Parquet, ORC is also a columnar storage format optimized for Hive. It
provides high compression and is more efficient in terms of both storage and query
performance compared to Avro.
Answer: SequenceFile is a binary format in Hadoop that stores data in key-value pairs. It is used for
storing large amounts of data efficiently, especially in cases where multiple MapReduce jobs read and
write data in sequence. It allows for better performance and compression compared to text files.
8. How do you list files in HDFS?
Answer: You can list files in HDFS using the command:
bash
CopyEdit
hadoop fs -ls /path/to/directory
9. **
4o mini
You said:
like in the format i mentioned earlier.
ChatGPT said:
Apologies for the confusion! Here's a more structured format, with answers placed directly after each
question, as per your request:
Answer: Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It consists of two primary components:
HDFS (Hadoop Distributed File System): A distributed file system that stores data
across multiple machines.
YARN: Resource management layer that manages resources across the cluster.
NameNode: It is the master node that stores metadata such as file locations and
directory structure. It does not store the actual data but keeps track of where it is stored
across the cluster.
DataNode: These are the worker nodes that store the actual data in blocks. They serve
read and write requests for data.
NodeManager: It runs on each node in the cluster, managing resources (CPU, memory)
and monitoring the health of that node. It reports the status to the ResourceManager.
Answer: Hadoop processes data in a distributed manner using the MapReduce framework. The data is
split into blocks and distributed across the cluster. In the Map phase, the data is processed in parallel,
and in the Reduce phase, the results are aggregated. This parallelism allows Hadoop to efficiently
handle large datasets.
6. What is the difference between Avro, Parquet, and ORC file formats in Hadoop?
Avro: A row-based format that is schema-based and ideal for serialization of data. It
supports compression and splitting, making it suitable for use with MapReduce.
Parquet: A columnar storage format that is optimized for analytic queries. It is efficient for
reading specific columns and supports compression, making it ideal for large-scale data
analytics.
ORC: A columnar storage format optimized for Hive. It offers high compression and better
query performance, especially for complex queries.
7. Why is SequenceFile used in Hadoop?
Answer: SequenceFile is a binary format used to store data as key-value pairs. It is particularly efficient
for use in MapReduce jobs where the data needs to be read and written in sequence. It supports high
compression and is a good choice for storing large datasets.
10. How do you copy a file from HDFS to the local filesystem?
:hadoop fs -get /hdfs/path /local/path
Partitioning: It divides the data into directories based on the column value (e.g., partitioning by date).
This helps with more efficient querying by narrowing down the search space.
Bucketing: It divides data into a fixed number of files (buckets) based on a column's hash value. It is
used for optimization in certain types of queries, especially when working with joins.
Answer: An external table in Hive allows data to reside outside the Hive system, usually in HDFS or
other storage systems. Unlike internal tables, external tables do not manage the data itself; Hive only
manages the metadata. The data remains outside Hive, so deleting the table does not delete the data.
13. How does Hadoop differ from Teradata in terms of data storage and processing?
Answer:
Hadoop: It is a distributed system for storing and processing large datasets, often in a batch processing
style. It is scalable and ideal for handling unstructured and semi-structured data.
Teradata: A data warehousing solution optimized for OLAP (Online Analytical Processing) and SQL
queries on structured data. It is more suited for high-performance analytics on small to medium-sized
datasets.
14. How do you create a table in Hive?
Answer: You can create a table in Hive using the following syntax:
sql
CopyEdit
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Answer:
Map phase: Each chunk is processed by the Map function, creating key-value pairs.
Shuffle and Sort phase: Data is shuffled and sorted by keys before being passed to the Reducer.
Reduce phase: The results from the Map phase are aggregated in the Reduce phase.
17. What are the key differences between Hadoop and Teradata?
Hadoop: It is a distributed system for storage and processing of large-scale datasets, suitable for
unstructured data, batch processing, and scalability.
Teradata: It is a traditional data warehousing solution optimized for structured data and high-performance
OLAP workloads using SQL queries.
PySpark Interview Notes: Complete Guide
PySpark is the Python API for Apache Spark, an open-source, distributed computing system that
provides fast and general-purpose cluster-computing capabilities. PySpark enables parallel processing
and big data analytics using Python programming.
Core Concepts:
● Distributed Computing: PySpark divides data into partitions and processes them in parallel
across a cluster.
● Lazy Evaluation: Transformations in Spark are not executed immediately but are recorded in a
DAG and executed only when an action is called.
1. Spark Core - The foundation for all components, handles memory management, fault recovery,
task scheduling.
3. MLlib - Machine Learning library with algorithms like classification, regression, clustering, and
collaborative filtering.
What is PySpark?
A: PySpark is a Python interface for Apache Spark. It provides the ability to work with RDDs and
DataFrames, and supports SQL, streaming, machine learning, and graph processing.
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])
PySpark Architecture
Apache Spark follows a master-slave architecture consisting of:
● Driver Program: Initiates the SparkContext. Translates user code into DAG (Directed Acyclic
Graph).
3. DAG scheduler breaks down DAG into stages and tasks.
SparkSession Initialization:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
DataFrame Operations:
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
df.printSchema()
df.columns
# CSV
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True)
# JSON
df_json = spark.read.json("data.json")
# Parquet
df_parquet = spark.read.parquet("data.parquet")
# ORC
df_orc = spark.read.orc("data.orc")
df.createOrReplaceTempView("people")
result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()
Temporary Tables
Temporary View: Exists only for the current SparkSession.
df.createOrReplaceTempView("local_temp")
df.createGlobalTempView("global_temp")
improve the performance of the Spark and PySpark applications by adjusting and optimizing system
resources
Partition Tuning:
df.repartition(4)
df.coalesce(2)
● Solutions:
A: PySpark uses in-memory processing, DAG optimization, and better fault tolerance, unlike Hadoop
which is disk-based and slower for iterative tasks.
Unique Values
df.select("column_name").distinct().show()
df.dropDuplicates(["column_name"]).show()