0% found this document useful (0 votes)
137 views

AWS Data Engineering Questions by Deepa Vasanth Kumar 1721182233

Uploaded by

olegruchinsky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views

AWS Data Engineering Questions by Deepa Vasanth Kumar 1721182233

Uploaded by

olegruchinsky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

ar

m
ku
AWS Data Engineering
th
an
Interview Questions
as
V
pa

Deepa Vasanthkumar
ee
D
AWS Data Engineering Interview Questions - 2-

ar
AWS General Topics 3
Amazon S3 7

m
AWS Virtual Private Cloud (VPC) 10
Amazon EC2 (Elastic Compute Cloud) 13

ku
Amazon EMR (Elastic MapReduce) 16
AWS Glue 20

th
Glue & EMR Comparison 23
Glue Data Catalog 24
Glue Streaming ETL 27
AWS Lambda
Amazon Kinesis
an 31
35
as
Amazon Athena 39
Amazon Redshift 43
Amazon RDS for PostgreSQL 47
V

Amazon DynamoDB 49
AWS Step functions 53
pa

AWS Identity and Access Management (IAM) 58


Amazon Simple Notification Service (SNS) 61
Amazon Simple Queue Service (SQS) 64
ee
D

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 3-

ar
m
ku
th
an
as
V
pa

AWS General Topics


ee

AWS (Amazon Web Services) is a comprehensive, evolving cloud computing platform


provided by Amazon. It provides a mix of infrastructure as a service (IaaS), platform as
D

a service (PaaS), and packaged software as a service (SaaS) offerings. AWS services
can offer organizations tools such as compute power, database storage, and content
delivery services.

Here are some common AWS interview questions along with detailed answers to help
you prepare for your interview.

Q. Explain the difference between Amazon S3 and EBS.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 4-

ar
Amazon S3 (Simple Storage Service) is an object storage service that offers
industry-leading scalability, data availability, security, and performance. This means
customers of all sizes and industries can use it to store and protect any amount of data

m
for a range of use cases, such as websites, mobile applications, backup and restore,

ku
archive, enterprise applications, IoT devices, and big data analytics.

th
Amazon Elastic Block Store (EBS) provides block level storage volumes for use with EC2
instances. EBS volumes are highly available and reliable storage volumes that can be

an
attached to any running instance that is in the same Availability Zone. EBS is particularly
suited for applications that require a database, file system, or access to raw block level
storage.
as
V
pa

Q. What is IAM in AWS?


ee

IAM (Identity and Access Management) is a web service that helps you securely control
access to AWS resources. You use IAM to control who is authenticated (signed in) and
D

authorized (has permissions) to use resources. IAM provides the ability to control users,
security credentials such as access keys, and permissions that control which AWS
resources users and applications can access.

Q. Describe a VPC and its components.

A VPC (Virtual Private Cloud) allows you to provision a logically isolated section of the
AWS Cloud where you can launch AWS resources in a virtual network that you define.
The main components of VPC are:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 5-

ar
- Subnets: A range of IP addresses in your VPC.

m
- Route Tables: They determine where network traffic is directed.

ku
- Internet Gateways: Connects a network to the internet.

th
- NAT Gateways: Allow private subnets to connect to the internet or other AWS services
an
but prevent the internet from initiating a connection with those instances.
as
- Security Groups: Act as a virtual firewall for your instance to control inbound and
outbound traffic.
V

- Network ACLs: A layer of security for your VPC that acts as a firewall for controlling
pa

traffic in and out of one or more subnets.


ee

Q. How do you secure data at rest on AWS?


D

Securing data at rest on AWS can be achieved by several methods:

- Encryption: Using AWS services such as Amazon S3, EBS, and RDS which support
encryption at rest. AWS offers integrated tools like AWS KMS (Key Management
Service) and AWS CloudHSM to manage and rotate encryption keys.

- Access Control: Implementing fine-grained access control using IAM roles and policies
to ensure only authorized users and systems can access your data.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 6-

ar
- Network Security: Utilizing VPCs and associated security features such as security
groups and network ACLs to isolate resources and control network access.

m
Q. What are some strategies to reduce costs in AWS?

ku
- Reserved Instances: Purchase Reserved Instances for services like Amazon EC2 and

th
Amazon RDS to save up to 75% compared to on-demand pricing.

an
- Auto Scaling: Use Auto Scaling to automatically adjust the amount of computational
resources based on the server load, thus reducing costs by minimizing idle resources.
as
V

- Right Sizing: Regularly review and adjust your configurations to ensure you are using
the optimal resources for your workloads.
pa

- Delete Unattached EBS Volumes: Regularly delete unattached EBS volumes, as you are
ee

billed for the storage that you provision.


D

- Use Cost Explorer: AWS Cost Explorer helps you visualize and manage your AWS
spending over time.

Amazon S3

Amazon S3 (Simple Storage Service) is a key service in AWS, offering object storage
through a web service interface. Here are some common interview questions about
Amazon S3, accompanied by detailed answers to help you prepare for a technical
interview.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 7-

ar
Amazon S3 is an object storage service offered by Amazon Web Services that provides
scalable, high-speed, web-based storage for data backup, archival, and analytics. Unlike
file storage, which organizes data into a directory hierarchy, S3 stores data as objects

m
within buckets. Each object is identified by a unique key and can contain up to 5 TB of

ku
data.

th
Q. Can you explain the difference between S3 and EBS? - Amazon S3 is object storage
good for storing vast amounts of data in a non-structured format. Data is accessible

an
from anywhere and is suited to static files, backups, and big data analytics.
as
- Amazon EBS (Elastic Block Store) provides block-level storage volumes for persistent
data storage for use with EC2 instances. EBS is suitable for applications that require a
V

database, file system, or access to raw block level storage.


pa

Q. What are S3 Buckets?


ee

An S3 bucket is a container for storing objects in Amazon S3. Each object is stored in a
bucket and retrieved via a unique, developer-assigned key. Buckets serve as the basic
D

container in which data is stored in S3, and every object must be contained in a bucket.

Q. Describe the durability and availability of Amazon S3.

Amazon S3 provides high durability and availability. S3 is designed to deliver


9Q.999999999% (11 9's) durability over a given year, ensuring that data is virtually
indistinguishable from being lost. Furthermore, S3 Standard aims for 9Q.99% availability,
while the S3 Standard-IA (Infrequent Access) and One Zone-IA provide slightly lower
availability at a reduced cost.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 8-

ar
Q. What are some of the storage classes available in S3?

m
S3 offers several storage classes:

ku
- S3 Standard: For frequently accessed data, offers high durability, availability, and
performance.

th
an
- S3 Intelligent-Tiering: Moves data automatically between two access tiers when
access patterns change.
as
- S3 Standard-IA: For data that is less frequently accessed but requires rapid access
V

when needed.
pa

- S3 One Zone-IA: Similar to Standard-IA but data is stored in a single availability zone.
ee

- S3 Glacier and S3 Glacier Deep Archive: For archiving data with retrieval times ranging
from minutes to hours.
D

Q. What is S3 Versioning?

S3 Versioning is a feature that allows you to keep multiple versions of an object in the
same bucket. This is used to preserve, retrieve, and restore every version of every object
stored in your S3 bucket. It helps protect from unintended overwrites and deletions.

Q. How does S3 Encryption work?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 9-

ar
S3 provides two means of encryption:

m
- Server-Side Encryption (SSE): Where Amazon handles the encryption process, the
decryption, and key management. You can choose among three keys management

ku
options: SSE-S3 (uses keys managed by S3), SSE-KMS (uses AWS Key Management
Service), and SSE-C (where you manage the encryption keys).

th
- Client-Side Encryption: Your data is encrypted on the client side before uploading it to
S3. an
as
Q. Explain the S3 Lifecycle Policies.
V

Lifecycle policies in S3 are used to manage your objects automatically as they transition
to different stages of their lifecycle. Policies can be used to transition objects to less
pa

expensive storage classes or schedule objects for automatic deletion after certain
periods, helping reduce costs by managing data according to its lifecycle.
ee
D

Q. What is the difference between S3 and S3 Glacier?

S3 is used for general, high-speed storage accessible at any time, suitable for a wide
range of applications including websites, mobile apps, and backups. S3 Glacier is a
lower-cost storage service optimized for data archiving and long-term backup, where
retrieval times of several minutes to hours are acceptable.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 10 -

ar
AWS Virtual Private Cloud (VPC)

m
AWS Virtual Private Cloud (VPC) is a crucial component of AWS that allows you to
provision a logically isolated section of the AWS Cloud where you can launch AWS

ku
resources in a virtual network that you define.

th
Q. What is a VPC? An Amazon Virtual Private Cloud (VPC) is a service that allows you
to launch AWS resources into a virtual network that you've defined. This virtual network
an
closely resembles a traditional network that you'd operate in your own data center, with
the benefits of using scalable infrastructure on AWS. It enables you to control your
as
virtual networking environment, including selection of your IP address range, creation of
subnets, and configuration of route tables and network gateways.
V

Q. What are subnets?


pa

In AWS, a subnet is a range of IP addresses in your VPC. You can launch AWS resources
ee

into a specified subnet. Use a public subnet for resources that need to be connected to
the internet, and a private subnet for resources that won’t be connected to the internet.
D

Each subnet must reside entirely within one Availability Zone and cannot span zones.

Q. Explain the difference between a security group and a network access control list
(NACL).

- Security Group: Acts as a virtual firewall for your EC2 instances to control inbound and
outbound traffic. Security groups operate at the instance level, they support allow rules
only, and are stateful—responses to allowed inbound traffic are allowed to flow
outbound regardless of outbound rules.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 11 -

ar
- Network Access Control List (NACL): Acts as a firewall for controlling traffic in and out
of one or more subnets. NACLs operate at the subnet level, they support allow and deny
rules, and are stateless—responses to allowed inbound traffic are subject to the rules for

m
outbound traffic (and vice versa).

ku
Q. What is an Internet Gateway, and why is it used? An Internet Gateway (IGW) is a VPC

th
component that allows communication between instances in your VPC and the internet.
It provides a target in your VPC route tables for internet-routable traffic, and performs

an
network address translation for instances that have been assigned public IPv4
addresses.
as
Q. How do you securely connect your on-premises network to a VPC?
V

We can securely connect your on-premises network to your Amazon VPC using one of
pa

the following options:


ee

- AWS Direct Connect: Establishes a dedicated private connection from a remote


network to your VPC. Direct Connect is often used to reduce network costs, increase
D

bandwidth throughput, and provide a more consistent network experience than


internet-based connections.

- VPN Connection: Utilizes the public internet to make a secure and encrypted
connection between your premises and your VPC. This is accomplished by creating a
VPN gateway in your AWS environment, which provides a highly available and secure
connection.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 12 -

ar
Q. What is a NAT Gateway, and why would you use it? A Network Address Translation
(NAT) Gateway enables instances in a private subnet to connect to the internet or other
AWS services, but prevent the internet from initiating a connection with those instances.

m
This is particularly useful for updating software in private subnets, or enabling secure

ku
outbound internet access for processing tasks.

th
an
Q. Describe how routing works in a VPC.
as
Each VPC has a routing table that contains a set of rules, called routes, that are used to
determine where network traffic from your VPC is directed. Each subnet in your VPC
V

must be associated with a route table; the table controls the routing for the subnet. The
default route table includes a local route for communication within the VPC, and
pa

additional routes can be added that direct traffic to specific destinations, such as an
internet gateway, NAT gateway, VPN connection, or AWS Direct Connect.
ee

Q. What are VPC Endpoints? VPC endpoints enable you to privately connect your VPC
D

to supported AWS services and VPC endpoint services powered by PrivateLink without
requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect
connection. Instances in your VPC do not require public IP addresses to communicate
with resources in the service. Traffic between your VPC and the other service does not
leave the Amazon network.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 13 -

ar
Amazon EC2 (Elastic Compute Cloud)

m
Amazon EC2 (Elastic Compute Cloud) is a central part of Amazon's cloud computing
platform, AWS. It allows users to rent virtual computers on which they run their own

ku
computer applications.

th
Q. What is Amazon EC2? Amazon Elastic Compute Cloud (EC2) is a web service that
provides resizable compute capacity in the cloud. It is designed to make web-scale
an
cloud computing easier for developers. EC2’s simple web service interface allows you to
obtain and configure capacity with minimal friction. It provides you with complete
as
control of your computing resources and lets you run on Amazon’s proven computing
environment.
V

Q. Can you explain the different types of instances available in EC2? EC2 provides a
pa

variety of instance types optimized to fit different use cases. Instance types comprise
varying combinations of CPU, memory, storage, and networking capacity and give you
ee

the flexibility to choose the appropriate mix of resources for your applications. The main
categories include:
D

- General Purpose: Balanced CPU, memory, and networking, suitable for a variety of
applications.

- Compute Optimized: Ideal for compute-bound applications that benefit from


high-performance processors.

- Memory Optimized: Designed to deliver fast performance for workloads that process
large data sets in memory.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 14 -

ar
- Storage Optimized: Optimized for workloads that require high, sequential read and
write access to very large data sets on local storage.

m
- Accelerated Computing: Use hardware accelerators, or co-processors, to perform

ku
functions such as floating-point number calculations, graphics processing, or data
pattern matching more efficiently than is possible in software running on CPUs.

th
Q. What are Amazon Machine Images (AMIs)? An Amazon Machine Image (AMI)
an
provides the information required to launch an instance, which is a virtual server in the
cloud. You can think of an AMI as a template of the operating system (OS), the
as
application server, and applications that configure the EC2 instance. AMIs are
region-specific and can include one or more EBS snapshots, instance store-backed
V

volumes, permissions to specify which AWS accounts can use the AMI, and a block
device mapping.
pa
ee

Q. How do you secure an EC2 instance?


D

Securing an EC2 instance involves several steps:

- Security Groups: Control inbound and outbound traffic to instances. You should
configure security groups to allow the minimum necessary traffic for your instances.

- Key Pairs: Used for secure SSH access to your instances. Always keep your private
keys secure.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 15 -

ar
- IAM Roles: Use IAM roles to securely give applications that run on your EC2 instances
permission to AWS API requests.

m
- Network ACLs: Acts as a firewall for associated subnets, controlling both inbound and

ku
outbound traffic at the subnet level.

th
- Patch Management: Regularly apply security patches to your instance’s operating
system.
an
- Encryption: Use AWS services like EBS encryption to protect data at rest.
as
V

Q. Explain Elastic Load Balancing. Elastic Load Balancing automatically distributes


incoming application traffic across multiple targets, such as EC2 instances, containers,
and IP addresses. It can handle the varying load of your application traffic in a single
pa

Availability Zone or across multiple Availability Zones. Elastic Load Balancing offers
three types of load balancers that all feature high availability, automatic scaling, and
ee

robust security necessary to fault-tolerantly distribute application loads.


D

Q. What are Spot Instances?

Spot Instances allow you to request spare Amazon EC2 computing capacity for up to
90% off the On-Demand price. Spot Instances are recommended for applications that
have flexible start and end times, applications that are only feasible at very low compute
prices, and users with urgent computing needs for large amounts of additional capacity.

Q. Describe how you can vertically scale an EC2 instance. Vertical scaling refers to
increasing the size of an EC2 instance. For example, if you start with a smaller instance

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 16 -

type and realize that your application needs more CPU, memory, or IO, you can stop your
instance and change its instance type to a more powerful one, then start it again. This is

ar
a simple way to scale for applications that are not designed to scale out across multiple
instances.

m
ku
Amazon EMR (Elastic MapReduce)

th
Amazon EMR (Elastic MapReduce) is a cloud-native big data platform, allowing
businesses to process vast amounts of data quickly and cost-effectively across
an
resizable clusters of Amazon EC2 instances. Here are some common interview
questions related to Amazon EMR, along with detailed answers, that might be asked in a
technical interview:
as
V

Q. What is Amazon EMR? Amazon EMR is a managed cluster platform that simplifies
running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to
process and analyze large datasets. The service manages the provisioning,
pa

configuration, and tuning of the cloud infrastructure so that users can focus on
processing their data. EMR is designed to be cost-efficient and to reduce the complexity
ee

of the hardware provisioning, cluster setup, configuration, and tuning of big data
frameworks.
D

Q. How does EMR handle data processing? EMR helps in processing large amounts of
data quickly by distributing the data across a resizable cluster of Amazon EC2
instances. It uses popular data processing frameworks like Hadoop, Spark, HBase,
Presto, and Hive. Users can write their data processing jobs in various languages like
Python, Scala, or Java. EMR takes care of distributing the code and data to the
instances, running the jobs, and storing the results in an Amazon S3 bucket or passing
them on to other AWS services for further processing.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 17 -

ar
Q. What are the main components of an EMR cluster?

m
An EMR cluster typically consists of the following components:

ku
- Master node: Manages the distribution of data and tasks to the other nodes in the
cluster. There is only one master node.

th
an
- Core nodes: These nodes run tasks and store data in the Hadoop Distributed File
System (HDFS) across your cluster.
as
- Task nodes: Optional nodes that only process data and do not store data in HDFS.
V

They are used to enhance processing power.


pa
ee
D

Q. What file systems are supported by Amazon EMR?

EMR supports several file systems:

- HDFS (Hadoop Distributed File System): The default, distributed file system used by
Hadoop components.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 18 -

ar
- EMR File System (EMRFS): An extension of Amazon S3, allowing Hadoop to directly
interact with data stored in S3. EMRFS is used for scenarios where you want the data to
be stored in S3 for durability, cost-effectiveness, or easy scalability.

m
ku
- Local File System: The disk file system of the EC2 instances in the cluster.

th
Q. How does Amazon EMR integrate with other AWS services?

an
Amazon EMR integrates with several AWS services to enhance its capabilities:
as
- Amazon S3: For durable, cost-effective storage of big data processing results or as a
V

data lake.
pa

- Amazon RDS and Amazon DynamoDB: For direct database queries within EMR jobs.
ee

- AWS Data Pipeline: For workflow management of data between various AWS compute
and storage services.
D

- Amazon CloudWatch: For monitoring cluster performance and running automated


actions based on specific conditions.

- AWS Identity and Access Management (IAM): For securing access to EMR resources.

Q. What are the cost optimization strategies for EMR?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 19 -

ar
To optimize costs in EMR, consider the following strategies:

m
- Use Spot Instances: Utilize Spot Instances for task nodes where applicable, as they can
reduce the cost significantly.

ku
- Right-size the cluster: Choose the right number and type of nodes based on the

th
workload to avoid overprovisioning.

an
- Shut down idle clusters: Terminate clusters when not in use, or use auto-scaling to
scale down resources automatically.
as
V

- Use Reserved Instances: For long-running clusters, Reserved Instances provide a


significant discount over On-Demand pricing.
pa

Q. Explain how security is managed in EMR. Security in Amazon EMR is managed


ee

through:
D

- IAM Roles: To control access to AWS resources.

- Security Groups: Acts as virtual firewalls to control inbound and outbound traffic to the
EC2 instances.

- Encryption: At rest using Amazon S3 with EMRFS, or in transit using TLS across nodes.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 20 -

ar
- Kerberos Authentication: Provides a mechanism for authentication of users accessing
the cluster.

m
AWS Glue

ku
th
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it
easy for customers to prepare and load their data for analytics. AWS Glue is a
serverless data integration service that makes it easy to discover, prepare, and combine
an
data for analytics, machine learning, and application development. AWS Glue provides
both visual and code-based interfaces to make data integration simple. It automatically
as
discovers and catalogs metadata about your data stores into a central catalog that
makes the data readily searchable and queryable by services like Amazon Athena and
V

Amazon Redshift Spectrum.


pa

Q. Can you explain the components of AWS Glue?


ee

AWS Glue consists of several key components:


D

- Data Catalog: A central repository to store structural and operational metadata for all
your data assets.

- Crawler: A tool that connects to your data source, progresses through a prioritized list
of classifiers to determine the schema for your data, and then creates metadata tables
in your data catalog.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 21 -

ar
- ETL Jobs: Scripts generated automatically by AWS Glue or written by you to transform,
flatten, and enrich data in various formats across data stores.

m
- Triggers: Conditions that you specify for starting ETL jobs and crawlers. Triggers can

ku
be time-based or event-based.

th
- Development Endpoint: An environment to develop and test your ETL scripts
interactively.
an
Q. How does AWS Glue handle schema evolution? AWS Glue can handle schema
as
evolution. When schema changes are detected, AWS Glue updates the schema in its
Data Catalog and automatically revises ETL jobs accordingly. This means that when
V

your data structures change, Glue can adapt to accommodate these changes without
manual intervention, ensuring that downstream processes are not disrupted.
pa

Q. What are AWS Glue Crawlers and what do they do? AWS Glue Crawlers are used to
ee

populate the AWS Glue Data Catalog with metadata tables. These are generated based
on the data source's schema it assesses. Crawlers can connect to a source or target
D

data store, classify the data format, infer schemas, and create metadata tables in the
Data Catalog. They can be scheduled to run periodically, ensuring that the Data Catalog
is kept up-to-date with changes in the data store.

Q. What is the AWS Glue Data Catalog? The AWS Glue Data Catalog is a central
repository to store structural and operational metadata for all your data assets. It is fully
managed and serves as a persistent metadata store for all your ETL and data discovery

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 22 -

activities across all your data silos. It integrates with services like Amazon Athena,
Amazon Redshift Spectrum, and Amazon EMR, providing a unified view of all your data

ar
assets.

m
Q. Explain the types of triggers in AWS Glue.

ku
Triggers in AWS Glue can be classified into three types:

th
- Schedule-based Triggers: These triggers start jobs at specific times using a cron-like
syntax. an
as
- Event-based Triggers: These are activated by specific events, such as the successful
completion of another job.
V

- On-demand Triggers: These are manual triggers that start jobs whenever they are
pa

activated by the user.


ee

Q. What is a Job Bookmark in AWS Glue? Job bookmarks in AWS Glue help manage
D

state information and prevent the reprocessing of old data during subsequent runs of
an ETL job. When enabled, Glue keeps track of data that has already been processed,
which means that only new and changed data is processed in subsequent runs. This
feature is particularly useful when dealing with incremental loads.

Q. How does AWS Glue integrate with other AWS services?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 23 -

ar
AWS Glue integrates with various AWS services to enhance its data integration
capabilities:

m
- Amazon S3: Used as a data source and target for ETL jobs.

ku
- AWS Lambda: Can trigger ETL jobs in response to events.

th
an
- Amazon Redshift: Can directly run transformation jobs and output the results into
Redshift.
as
- Amazon Athena: Uses the Data Catalog as a central schema repository.
V

- Amazon RDS and Amazon DynamoDB: Can be sources or targets for ETL jobs.
pa

Glue & EMR Comparison


ee
D

AWS Glue is a managed ETL service focused on simplifying data preparation and
integration tasks, while Amazon EMR is a fully managed big data platform designed for
running distributed data processing frameworks at scale. AWS Glue is a fully managed
extract, transform, and load (ETL) service designed to automate data preparation tasks.
Amazon EMR is a fully managed big data platform that provides frameworks like
Apache Hadoop, Apache Spark, and Presto for processing large datasets.

AWS Glue is primarily used for data integration, cataloging, and ETL tasks. It helps
automate the process of discovering, preparing, and combining data for analytics.
Amazon EMR is used for processing and analyzing large datasets using distributed

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 24 -

computing frameworks. It's ideal for running big data processing tasks like batch
processing, data transformation, and machine learning.

ar
AWS Glue provides Serverless architecture where you define ETL jobs using a visual

m
interface or code. Glue manages the underlying infrastructure, including provisioning

ku
resources and scaling based on demand. Amazon EMR has a Managed cluster platform
where you provision and configure clusters of EC2 instances to run big data
frameworks. You have more control over the cluster configuration and can customize

th
the environment to meet specific requirements.

an
AWS Glue Provides a simpler and more streamlined approach to ETL tasks, suitable for
users who prefer a managed service and don't require fine-grained control over the
as
underlying infrastructure. Amazon EMR offers more flexibility and customization
options, allowing users to choose from a wide range of big data frameworks, adjust
V

cluster configurations, and install custom software packages.


pa

Glue Data Catalog


ee

The AWS Glue Data Catalog is a centralized metadata repository that stores structural
D

and operational metadata for all your data assets. It provides a unified view of your data,
making it easy to discover, catalog, and query data across different data stores and
AWS services.

Q. What are the key features of AWS Glue Data Catalog?

Key features of AWS Glue Data Catalog include:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 25 -

ar
- Metadata Storage: Stores metadata such as table definitions, partitions, schemas, and
statistics.

m
- Data Discovery: Allows users to search and explore data assets using a simple

ku
interface.

th
- Schema Evolution: Supports schema evolution to handle changes in data structures
over time.
an
- Integration: Integrates seamlessly with other AWS services like Amazon Athena,
as
Amazon Redshift Spectrum, and Amazon EMR.
V

- Custom Metadata: Allows users to define custom metadata attributes and tags for
their data assets.
pa
ee

Q. How does AWS Glue Data Catalog compare to traditional metadata management
solutions?
D

Compared to traditional metadata management solutions, AWS Glue Data Catalog


offers several advantages:

- Fully Managed: AWS Glue Data Catalog is a fully managed service, eliminating the need
for infrastructure management.

- Scalability: It scales automatically to handle large volumes of metadata.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 26 -

ar
- Integration with AWS Services: It seamlessly integrates with other AWS services,
simplifying data integration and analysis workflows.

m
- Serverless Architecture: It follows a serverless architecture, enabling users to focus on

ku
data management tasks rather than infrastructure maintenance.

th
Q. What are the benefits of using AWS Glue Data Catalog for data governance?

an
AWS Glue Data Catalog provides several benefits for data governance:
as
- Centralized Metadata Repository: It serves as a centralized repository for storing
V

metadata, making it easier to manage and govern data assets.


pa

- Data Lineage: It tracks the lineage of data assets, providing insights into data
movement and transformation processes.
ee

- Access Control: It supports fine-grained access control, allowing administrators to


D

control who can access and modify metadata.

- Data Quality Monitoring: It enables data quality monitoring by capturing statistics and
metrics about data assets.

Q. How can you integrate AWS Glue Data Catalog with other AWS services?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 27 -

ar
AWS Glue Data Catalog integrates seamlessly with other AWS services through its APIs
and native integrations. For example:

m
- It can be used as a metadata repository for Amazon Athena, enabling users to run SQL

ku
queries directly against data stored in Amazon S3.

th
- It can be integrated with Amazon Redshift Spectrum to query data in Amazon S3 using
Redshift.
an
- It can be used with AWS Glue ETL jobs to transform and load data into various data
as
stores.
V

Q. What is the significance of custom metadata in AWS Glue Data Catalog? Custom
metadata in AWS Glue Data Catalog allows users to define additional attributes and
pa

tags for their data assets. This can include business metadata, such as data ownership,
data lineage, and data classification, as well as technical metadata, such as data
ee

format, compression type, and encryption settings. Custom metadata enhances data
governance, data lineage tracking, and data discovery capabilities.
D

Glue Streaming ETL

AWS Glue Streaming ETL is a feature of AWS Glue that enables real-time data
processing and analytics by continuously ingesting and transforming streaming data. It

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 28 -

allows you to build scalable, serverless streaming data pipelines for processing data in
motion.

ar
Q. How does AWS Glue Streaming ETL differ from batch ETL?

m
ku
AWS Glue Streaming ETL processes data in real-time as it arrives, while batch ETL
processes data in discrete batches. With streaming ETL, data is processed as soon as it

th
becomes available, enabling low-latency processing and real-time analytics. Batch ETL,
on the other hand, processes data in fixed-size batches, which may introduce additional
latency. an
as
Q. What are some use cases for AWS Glue Streaming ETL?
V

Some use cases for AWS Glue Streaming ETL include:


pa

- Real-time analytics and monitoring


ee

- Fraud detection and anomaly detection


D

- Real-time recommendation engines

- Clickstream analysis and user behavior tracking

- IoT data processing and analysis

Q. How does AWS Glue Streaming ETL handle fault tolerance and scalability?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 29 -

ar
AWS Glue Streaming ETL automatically handles fault tolerance and scalability by
distributing data processing across multiple worker nodes. It monitors the health and
performance of worker nodes and automatically scales the resources up or down based

m
on the workload. In case of failures, it retries failed tasks and ensures that data

ku
processing continues without interruption.

th
Q. Can you explain the architecture of AWS Glue Streaming ETL?

an
The architecture of AWS Glue Streaming ETL consists of the following components:
as
- Streaming Sources: Data streams from sources such as Amazon Kinesis Data
Streams, Apache Kafka, or Amazon Managed Streaming for Apache Kafka (Amazon
V

MSK).
pa

- AWS Glue Streaming ETL Job: A Glue job that ingests streaming data, applies
transformations, and outputs the results to a destination.
ee
D

- Worker Nodes: EC2 instances provisioned by AWS Glue to process data in parallel.

- Data Destination: The target data store where transformed data is stored, such as
Amazon S3, Amazon Redshift, or Amazon DynamoDB.

Q. How does AWS Glue Streaming ETL integrate with other AWS services?

AWS Glue Streaming ETL integrates seamlessly with other AWS services to build
end-to-end streaming data pipelines. For example:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 30 -

ar
- It can ingest data from Amazon Kinesis Data Streams or Apache Kafka.

m
- It can output transformed data to Amazon S3, Amazon Redshift, or Amazon
DynamoDB.

ku
- It can trigger AWS Lambda functions or AWS Step Functions based on specific events

th
or conditions.

an
as
Q. What are some best practices for using AWS Glue Streaming ETL?
V

Some best practices for using AWS Glue Streaming ETL include:
pa

- Designing fault-tolerant and scalable data pipelines.


ee
D

- Optimizing data partitioning and compression for efficient processing.

- Monitoring job performance and resource utilization.

- Using schema validation and data validation to ensure data quality.

- Leveraging encryption and access controls to secure sensitive data.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 31 -

ar
Q. How does AWS Glue Streaming ETL handle late-arriving data?

m
AWS Glue Streaming ETL provides built-in support for handling late-arriving data. It
maintains a watermark to track the progress of data ingestion and processing. If data

ku
arrives late, it is processed with a timestamp that falls within the watermark window.
This ensures that late-arriving data is correctly processed and integrated into the output.

th
AWS Lambda an
as
AWS Lambda is a serverless computing service provided by Amazon Web Services
(AWS) that allows you to run code without provisioning or managing servers. You can
V

upload your code to Lambda and AWS takes care of automatically scaling and
managing the underlying infrastructure required to run your code in response to events.
pa

Q. How does AWS Lambda work? AWS Lambda works by executing code in response to
events triggered by other AWS services or custom events. You upload your code to
ee

Lambda and define the event sources (e.g., Amazon S3, Amazon DynamoDB, Amazon
Kinesis) that trigger the execution of your code. When an event occurs, Lambda
D

automatically provisions the necessary compute resources, runs your code, and then
scales down to zero when the code has finished executing.

Q. What are the benefits of using AWS Lambda?

Some benefits of using AWS Lambda include:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 32 -

ar
- No server management: Lambda automatically scales and manages the underlying
infrastructure for you.

m
- Cost-effective: You only pay for the compute time used by your code.

ku
- Scalable: Lambda can scale automatically to handle high volumes of requests.

th
an
- Fully managed: AWS takes care of patching, monitoring, and maintaining the
infrastructure.
as
- Integrated with AWS services: Lambda integrates seamlessly with other AWS services,
V

enabling you to build serverless architectures easily.


pa

Q. What languages are supported by AWS Lambda?


ee

AWS Lambda supports several programming languages, including:


D

- Node.js

- Python

- Java

- Go

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 33 -

ar
- .NET Core (C)

m
Q. What is a Lambda function?

ku
A Lambda function is the code that you upload to AWS Lambda to be executed in
response to events. It contains the code logic that defines what happens when the

th
function is invoked.

an
Q. How do you trigger a Lambda function?
as
Lambda functions can be triggered by various event sources, including:
V

- Changes in data stored in Amazon S3


pa

- Updates to records in Amazon DynamoDB


ee
D

- Messages published to Amazon SNS topics

- Events generated by AWS services such as AWS CloudWatch Events or AWS IoT

Q. Can you explain the concept of cold starts in AWS Lambda?

Cold start is the term used to describe the latency that occurs the first time a Lambda
function is invoked or when it hasn't been invoked for a while. During a cold start, AWS
Lambda needs to provision the required compute resources to run the function, which

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 34 -

can result in increased latency. Subsequent invocations of the function reuse the
existing resources, resulting in lower latency (known as warm starts).

ar
Q. How can you optimize the performance of AWS Lambda functions?

m
ku
Some ways to optimize the performance of Lambda functions include:

th
- Reducing the size of the deployment package

an
- Minimizing initialization code and dependencies
as
- Using provisioned concurrency to keep functions warm
V

- Implementing efficient error handling and retries


pa

- Tuning memory allocation and function timeouts


ee
D

Q. What are the limitations of AWS Lambda?

Limitations of AWS Lambda include:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 35 -

ar
- Maximum execution duration (default is 15 minutes)

m
- Maximum memory allocation (up to 10 GB)

ku
- Maximum deployment package size (up to 250 MB)

th
- Concurrency limits (default is 1000 concurrent executions per region)
an
- Statelessness (Lambda functions are stateless and have no persistent storage)
as
V

Q. How can you monitor and debug AWS Lambda functions?


pa

We can monitor and debug Lambda functions using AWS CloudWatch logs, which
capture log output from your function. We can also use CloudWatch metrics to monitor
function performance and errors.
ee
D

Amazon Kinesis

Amazon Kinesis is a platform on AWS for real-time streaming data ingestion,


processing, and analysis. It enables you to collect, process, and analyze large streams

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 36 -

of data in real-time from various sources such as website clickstreams, IoT devices,
logs, and social media feeds.

ar
Q. What are the main components of Amazon Kinesis? Amazon Kinesis consists of the

m
following main components:

ku
- Kinesis Data Streams: Allows you to build custom applications that process or analyze

th
streaming data.

an
- Kinesis Data Firehose: Automatically loads streaming data into AWS data stores and
analytics services for near real-time analysis.
as
- Kinesis Data Analytics: Allows you to process and analyze streaming data using SQL or
V

Apache Flink.
pa
ee

Q. What is a Kinesis Data Stream?


D

A Kinesis Data Stream is a scalable and durable real-time data streaming service that
enables you to continuously collect and process large streams of data records in
real-time. Data records in a stream are distributed across multiple shards, allowing for
parallel processing and high throughput.

Q. What is a Kinesis Data Firehose?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 37 -

ar
Kinesis Data Firehose is a fully managed service that automatically loads streaming
data into AWS data stores and analytics services. It can capture, transform, and load
streaming data into destinations such as Amazon S3, Amazon Redshift, Amazon

m
Elasticsearch Service, and Splunk.

ku
Q. What is a Kinesis Data Analytics application? Kinesis Data Analytics application is a

th
real-time analytics application that allows you to process and analyze streaming data
using SQL or Apache Flink. You can use Kinesis Data Analytics to run SQL queries,

an
aggregate data, detect anomalies, and perform real-time analytics on streaming data.
as
Q. How does Amazon Kinesis ensure scalability and fault tolerance? Amazon Kinesis
ensures scalability and fault tolerance by automatically distributing data records across
V

multiple shards in a data stream. Each shard can handle a specific throughput of data
records, and Kinesis dynamically scales the number of shards based on the incoming
data rate. Additionally, Kinesis replicates data records across multiple availability zones
pa

to ensure durability and fault tolerance.


ee

Q. What are the use cases for Amazon Kinesis?


D

Some common use cases for Amazon Kinesis include:

- Real-time analytics and dashboarding

- Clickstream analysis and user behavior tracking

- Log and event data ingestion and analysis

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 38 -

ar
- Internet of Things (IoT) data processing and analytics

m
- Fraud detection and anomaly detection

ku
Q. How does Kinesis Data Firehose differ from Kinesis Data Streams? Kinesis Data
Streams is a low-level service that allows you to build custom applications for

th
processing and analyzing streaming data. It provides full control over data processing
and allows for custom data transformations. On the other hand, Kinesis Data Firehose is
an
a fully managed service that simplifies the process of loading streaming data into AWS
data stores and analytics services. It automates the data ingestion process and
as
supports out-of-the-box integrations with various AWS services.
V

Q. How can you monitor Amazon Kinesis? We can monitor Amazon Kinesis using
Amazon CloudWatch, which provides metrics, logs, and alarms for Kinesis Data
pa

Streams, Kinesis Data Firehose, and Kinesis Data Analytics. CloudWatch allows to
monitor data ingestion rates, throughput, latency, and error rates for your Kinesis
ee

applications.
D

Q. What are the best practices for designing Amazon Kinesis applications? Some best
practices for designing Amazon Kinesis applications include:

- Properly sizing shards to handle the expected data throughput.

- Implementing retries and error handling mechanisms to handle transient failures.

- Using fine-grained partition keys to evenly distribute data across shards.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 39 -

ar
- Monitoring application performance and scaling resources based on workload
patterns.

m
- Encrypting data in transit and at rest to ensure data security and compliance.

ku
Amazon Athena

th
an
Amazon Athena is an interactive query service provided by AWS that allows you to
analyze data in Amazon S3 using standard SQL queries. It enables you to query data
as
stored in S3 without the need for infrastructure provisioning or data movement.
V

Q. How does Amazon Athena work?


pa

Amazon Athena works by executing SQL queries against data stored in Amazon S3. It
uses a serverless architecture, which means that you don't need to provision or manage
ee

any infrastructure. Athena uses Presto, an open-source distributed SQL query engine,
under the hood to execute SQL queries in parallel across multiple data files stored in S3.
D

Q. What types of data formats does Amazon Athena support? Amazon Athena
supports various data formats, including:

- Apache Parquet

- Apache ORC

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 40 -

ar
- Apache Avro

m
- JSON

ku
- CSV

th
- TSV
an
Q. What are the main use cases for Amazon Athena? Some common use cases for
as
Amazon Athena include:
V

- Ad-hoc querying and analysis of data stored in Amazon S3


pa

- Log analysis and troubleshooting


ee

- Data exploration and visualization


D

- Business intelligence (BI) and reporting

- ETL (Extract, Transform, Load) data processing

Q. How does Amazon Athena handle data partitioning?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 41 -

ar
Amazon Athena supports data partitioning, which can significantly improve query
performance and reduce costs. Data partitioning involves organizing data in Amazon S3
into directories based on one or more partition keys. Athena automatically prunes

m
partitions based on query predicates, allowing it to scan only the relevant partitions

ku
when executing queries.

th
Q. What are the benefits of using Amazon Athena? benefits of using Amazon Athena
include:

an
- Serverless architecture: No need to provision or manage any infrastructure.
as
- Pay-per-query pricing: You only pay for the queries you run, with no upfront costs or
V

commitments.
pa

- Scalability: Athena can automatically scale to handle large volumes of data and
concurrent queries.
ee
D

- Integration: It integrates seamlessly with other AWS services, such as Amazon S3,
AWS Glue, and AWS Identity and Access Management (IAM).

Q. Can you explain the difference between Amazon Athena and Amazon Redshift?

Amazon Athena and Amazon Redshift are both data query services provided by AWS,
but they have different use cases and characteristics. Amazon Athena is ideal for
ad-hoc querying and analysis of data stored in Amazon S3 using SQL queries, while
Amazon Redshift is a fully managed data warehousing service optimized for analytics
workloads, with support for large-scale data storage, data loading, and complex queries.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 42 -

ar
Q. How can you optimize query performance in Amazon Athena?

m
Ways to optimize query performance in Amazon Athena include:

ku
- Partitioning data based on query patterns

th
- Using columnar data formats like Parquet or ORC
an
- Reducing the number of columns scanned by projecting only the necessary columns
as
V

- Filtering data early in the query using WHERE clauses


pa

- Using the appropriate data types and data compression techniques


ee

Q. How does Amazon Athena handle security and access control? Amazon Athena
integrates with AWS Identity and Access Management (IAM) to control access to data
D

and resources. You can use IAM policies to grant or deny access to specific Athena
actions and resources based on users, groups, or roles. Athena also supports
encryption of data at rest and in transit for enhanced security.

Q. What are some limitations of Amazon Athena?

Some limitations of Amazon Athena include:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 43 -

ar
- No support for updates or deletes: Athena is read-only and does not support modifying
data.

m
- Limited data types: Athena supports a subset of SQL data types and functions.

ku
- Performance variability: Query performance can be impacted by factors such as data

th
volume, complexity of queries, and data partitioning strategies.

Amazon Redshift
an
as
Amazon Redshift is a fully managed data warehousing service provided by AWS. It
V

allows you to analyze large datasets using standard SQL queries and provides
high-performance, scalable data warehousing capabilities.
pa

Q. How does Amazon Redshift differ from traditional relational databases? Amazon
ee

Redshift differs from traditional relational databases in several ways, including:


D

- Columnar storage: Redshift stores data in a columnar format, which improves query
performance and reduces storage requirements.

- Massively parallel processing (MPP): Redshift distributes data and query processing
across multiple nodes, enabling high concurrency and scalability.

- Optimized for analytics: Redshift is optimized for running complex analytics queries on
large datasets, whereas traditional relational databases are more suited for
transactional workloads.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 44 -

ar
Q. What are the main components of Amazon Redshift?

m
The main components of Amazon Redshift include:

ku
- Cluster: A Redshift cluster consists of one or more compute nodes, each containing
CPU, memory, and storage.

th
an
- Leader node: Manages client connections and query execution plans.
as
- Compute nodes: Store and process data in parallel.
V

- Node slices: Each compute node is divided into slices, which represent a portion of the
node's CPU and storage.
pa

Q. What types of data sources can you load into Amazon Redshift? We can load data
ee

into Amazon Redshift from various sources, including:


D

- Amazon S3: Load data from files stored in Amazon S3 using the COPY command.

- Amazon DynamoDB: Use the COPY command to load data from DynamoDB tables.

- Amazon EMR: Use the COPY command with the EMRFS protocol to load data from
Hadoop Distributed File System (HDFS) or other EMR-compatible storage.

Q. How does Amazon Redshift handle data compression and distribution?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 45 -

ar
Amazon Redshift uses data compression and distribution techniques to improve query
performance and reduce storage requirements. It supports multiple compression
encodings and automatically selects the most appropriate encoding based on the data

m
type and distribution. Redshift also distributes data across slices based on a

ku
distribution style (e.g., key distribution, even distribution) to optimize query execution.

th
Q. What are some best practices for optimizing query performance in Amazon
Redshift?

an
Some best practices for optimizing query performance in Amazon Redshift include:
as
- Choosing the appropriate distribution style for tables based on query patterns.
V

- Analyzing query execution plans and using sort and distribution keys to minimize data
pa

movement.
ee

- Using column compression and encoding to reduce storage requirements.


D

- Regularly vacuuming and analyzing tables to reclaim disk space and update statistics.

- Monitoring system performance using Amazon CloudWatch metrics and query


monitoring features.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 46 -

ar
Q. How does Amazon Redshift handle concurrency and scalability?

m
Amazon Redshift uses a shared-nothing architecture and massively parallel processing
(MPP) to handle concurrency and scalability. It can scale horizontally by adding

ku
additional compute nodes to a cluster and supports high levels of concurrency by
parallelizing query execution across nodes and slices.

th
Q. What is the difference between dense compute and dense storage node types in
Amazon Redshift? an
as
Dense compute and dense storage are two types of node configurations available in
Amazon Redshift:
V

- Dense compute: Optimized for compute-intensive workloads and provide


pa

high-performance query processing capabilities.


ee

- Dense storage: Optimized for storage-intensive workloads and provide high storage
D

capacity at a lower cost per terabyte.

Q. How does Amazon Redshift handle backups and data durability? Amazon Redshift
automatically takes incremental backups of your cluster and stores them in Amazon S3.
You can also manually trigger snapshots to create point-in-time backups. Redshift uses
replication and checksums to ensure data durability and automatically replaces failed
drives or nodes to maintain high availability.

Q. How does Amazon Redshift integrate with other AWS services?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 47 -

ar
Amazon Redshift integrates with other AWS services to enhance its capabilities,
including:

m
- Amazon S3: Load data into Redshift from S3 and export query results to S3.

ku
- AWS Glue: Catalog metadata and orchestrate ETL workflows for Redshift.

th
an
- Amazon EMR: Analyze data stored in Redshift using EMR clusters running Apache
Spark or other big data frameworks.
as
- AWS Lambda: Trigger Lambda functions in response to events in Redshift, such as
V

data loading or query completion.


pa
ee
D

Amazon RDS for PostgreSQL

Amazon RDS for PostgreSQL is a managed relational database service provided by


AWS. It allows you to deploy, operate, and scale PostgreSQL databases in the cloud.

Q. How does Amazon RDS for PostgreSQL differ from Amazon Redshift?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 48 -

ar
- Unlike Amazon Redshift, which is optimized for analytics workloads, Amazon RDS
for PostgreSQL is a general-purpose relational database service. It is suitable for OLTP
(Online Transaction Processing) and OLAP (Online Analytical Processing) workloads.

m
ku
Q. What are the key features of Amazon RDS for PostgreSQL?

th
- Key features of Amazon RDS for PostgreSQL include automated backups and
snapshots, high availability with multi-AZ deployments, read replicas for read scaling,
an
and support for various PostgreSQL extensions.
as
Q. How does data replication work in Amazon RDS for PostgreSQL?
V

- Amazon RDS for PostgreSQL supports synchronous and asynchronous replication.


Multi-AZ deployments use synchronous replication to replicate data to standby
pa

instances in different availability zones for high availability.


ee
D

Q. What are some best practices for optimizing performance in Amazon RDS for
PostgreSQL?

- Best practices for optimizing performance in Amazon RDS for PostgreSQL include
choosing the appropriate instance type and storage configuration, tuning PostgreSQL
parameters, and monitoring database performance using CloudWatch metrics.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 49 -

ar
Amazon DynamoDB

m
Amazon DynamoDB is a fully managed NoSQL database service provided by AWS. It
offers seamless scalability, high availability, and low-latency performance for

ku
applications requiring single-digit millisecond response times.

th
Q. What are the key features of Amazon DynamoDB?

an
Key features of Amazon DynamoDB include:
as
- Fully managed: AWS handles administrative tasks such as hardware provisioning,
V

setup, configuration, monitoring, and scaling.


pa

- Scalable: DynamoDB automatically scales to accommodate growing workloads by


partitioning data across multiple servers.
ee
D

- Performance: Delivers consistent single-digit millisecond latency for read and write
operations.

- Flexible data model: Supports both document and key-value data models with
JSON-like syntax.

- Built-in security: Offers encryption at rest and in transit, fine-grained access control,
and integration with AWS Identity and Access Management (IAM).

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 50 -

ar
Q. What are the different types of primary keys supported by DynamoDB?

m
DynamoDB supports two types of primary keys:

ku
- Partition key (or hash key): A simple primary key composed of a single attribute.
DynamoDB uses the partition key's value to determine the partition in which the item is

th
stored.

an
- Composite primary key (or hash-and-range key): A primary key composed of two
attributes: a partition key and a sort key. DynamoDB uses both the partition key and sort
as
key values to determine the partition and sort order of the item.
V

Q. How does DynamoDB ensure scalability and high availability?


pa

DynamoDB ensures scalability and high availability through partitioning and replication:
ee

- Partitioning: DynamoDB automatically partitions data across multiple servers based on


D

the partition key. This allows DynamoDB to handle large volumes of traffic by
distributing the workload evenly across servers.

- Replication: DynamoDB replicates data across multiple availability zones within a


region to ensure high availability and fault tolerance. Each write is synchronously
replicated to at least three availability zones.

Q. What is the difference between provisioned throughput and on-demand capacity


modes in DynamoDB?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 51 -

ar
- Provisioned throughput: In this mode, you specify the read and write capacity units
(RCUs and WCUs) required for your DynamoDB table upfront. You are billed based on
the provisioned capacity, regardless of the actual usage.

m
ku
- On-demand capacity mode: In this mode, DynamoDB automatically scales read and
write capacity based on your actual usage. You pay per request for the capacity

th
consumed.

an
Q. How does DynamoDB handle consistency?
as
DynamoDB offers two types of consistency models:
V

- Eventual consistency: In this model, DynamoDB prioritizes availability over consistency.


It ensures that all copies of data are eventually consistent within seconds.
pa
ee

- Strong consistency: In this model, DynamoDB provides immediate consistency for read
operations. It guarantees that a read will return the most up-to-date data.
D

Q. What are DynamoDB streams? DynamoDB streams capture a time-ordered


sequence of item-level modifications in a DynamoDB table. Each stream record
represents a data modification event, such as an insert, update, or delete operation.
DynamoDB streams can be used to trigger AWS Lambda functions, replicate data
across tables or regions, and implement cross-region replication.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 52 -

ar
Q. How can you monitor and manage DynamoDB? DynamoDB provides several tools for
monitoring and managing tables, including:

m
- AWS CloudWatch: Monitors and logs metrics such as read/write capacity utilization,

ku
error rates, and throttling events.

th
- AWS DynamoDB Console: Provides a graphical user interface for managing tables,
monitoring performance, and configuring settings.
an
- AWS Command Line Interface (CLI) and SDKs: Allow programmatic access to
as
DynamoDB for automation, scripting, and integration with other AWS services.
V

Q. What are some best practices for designing DynamoDB tables?


pa

Some best practices for designing DynamoDB tables include:


ee

- Choose the right partition key to evenly distribute data and avoid hot partitions.
D

- Use sparse indexes to reduce storage costs and improve query performance.

- Use the right data types to minimize storage space and optimize query performance.

- Implement DynamoDB streams for data change capture and real-time processing.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 53 -

ar
- Leverage secondary indexes and materialized views to support different access
patterns.

m
Q. How does DynamoDB encryption work? DynamoDB offers encryption at rest to

ku
protect your data stored in tables. It uses AWS Key Management Service (KMS) to
manage encryption keys. When you enable encryption for a DynamoDB table, all data

th
stored in the table, including backups and replicas, is encrypted using an AWS-managed
encryption key or a customer-managed encryption key.

AWS Step functions


an
as
V

AWS Step Functions is a fully managed serverless orchestration service provided by


AWS. It allows you to coordinate and orchestrate multiple AWS services and Lambda
functions into serverless workflows using visual workflows.
pa

Q. How does AWS Step Functions work? AWS Step Functions works by defining state
ee

machines that represent the workflow logic using JSON-based Amazon States
Language. Each state in the state machine represents a specific task or action to be
D

executed. Step Functions manages the execution and coordination of tasks, handles
retries and error handling, and provides visibility into the state of the workflow.

Q. What are the main components of AWS Step Functions? The main components of
AWS Step Functions include:

- State machine: Represents the workflow logic defined using Amazon States Language.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 54 -

ar
- States: Individual tasks or actions within the state machine.

m
- Execution: An instance of a state machine execution.

ku
- Input and output: Data passed between states during execution.

th
- Execution history: A log of state transitions and events during the execution of a state
machine. an
as
Q. What are the benefits of using AWS Step Functions?
V

Some benefits of using AWS Step Functions include:


pa

- Simplicity: Allows you to build and visualize complex workflows using a graphical
interface.
ee
D

- Scalability: Automatically scales to handle high volumes of workflow executions.

- Reliability: Handles retries, error handling, and state management automatically.

- Integration: Integrates seamlessly with other AWS services, including Lambda, ECS,
SNS, SQS, and more.

- Visibility: Provides detailed execution logs and monitoring metrics for workflows.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 55 -

ar
Q. How can you trigger an AWS Step Functions workflow?

m
AWS Step Functions workflows can be triggered in several ways, including:

ku
- Direct invocation: You can invoke a state machine directly using the Step Functions
API.

th
an
- Event-based invocation: You can trigger a state machine execution in response to
events from other AWS services, such as S3 events, SNS notifications, or CloudWatch
Events.
as
V

- Scheduled invocation: You can schedule state machine executions to run at specified
intervals using CloudWatch Events.
pa

Q. What are the different types of states supported by AWS Step Functions? AWS Step
ee

Functions supports several types of states, including:


D

- Task states: Represent a single unit of work to be performed, such as invoking a


Lambda function or calling an AWS service API.

- Choice states: Define conditional branching logic based on the result of a previous
state.

- Parallel states: Enable concurrent execution of multiple branches of execution.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 56 -

ar
- Wait states: Pause the execution of the state machine for a specified duration or until a
specific event occurs.

m
- Pass states: Simply pass input to output without performing any work.

ku
Q. How does error handling work in AWS Step Functions?

th
an
AWS Step Functions provides built-in error handling capabilities, including retries and
catch clauses. You can specify retry policies for individual states or the entire state
machine to handle transient errors. Catch clauses allow you to define error handling
as
logic and transitions to handle specific error conditions.
V

Q. Can you integrate AWS Step Functions with other AWS services? Yes, AWS Step
Functions integrates seamlessly with other AWS services, including:
pa
ee

- AWS Lambda: Invoke Lambda functions as tasks within Step Functions workflows.
D

- Amazon ECS: Coordinate container-based workflows using ECS tasks.

- Amazon SNS and Amazon SQS: Send notifications and messages between states in a
state machine.

- AWS Glue: Orchestrate ETL workflows using Glue jobs within Step Functions.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 57 -

ar
Q. How does AWS Step Functions handle state transitions and retries? AWS Step
Functions manages state transitions and retries automatically based on the defined
workflow logic. If a state fails or encounters an error, Step Functions retries the state

m
based on the configured retry policy. If the maximum number of retries is exceeded,

ku
Step Functions transitions to an error state or a catch clause based on the error
handling logic defined in the state machine.

th
an
Q. What are some use cases for AWS Step Functions? Some common use cases for
as
AWS Step Functions include:
V

- Workflow automation: Orchestrating multi-step business processes and workflows.


pa

- Microservices orchestration: Coordinating interactions between microservices in


distributed systems.
ee
D

- Data processing pipelines: Managing complex ETL (Extract, Transform, Load)


workflows for data processing and analysis.

- Application workflows: Implementing stateful workflows for application logic and


business processes.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 58 -

ar
AWS Identity and Access Management (IAM)

m
AWS Identity and Access Management (IAM) is a web service provided by AWS that
enables you to securely control access to AWS services and resources. IAM allows you

ku
to manage users, groups, roles, and permissions to securely delegate access to AWS
resources.

th
Q. What are the main components of AWS IAM
an
as
The main components of AWS IAM include:
V

- Users: Represent individuals or entities that interact with AWS services and resources.
pa

- Groups: Collections of users with similar permissions, making it easier to manage


access.
ee
D

- Roles: Define a set of permissions that can be assumed by users or AWS services.

- Policies: Documents that define permissions and are attached to users, groups, or
roles.

- Access keys: Used for programmatic access to AWS services through APIs.

Q. What is an IAM policy?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 59 -

ar
An IAM policy is a document that defines permissions for actions and resources in
AWS. Policies are written in JSON format and can be attached to IAM identities (users,
groups, roles) or resources to specify who has access to what. Policies can grant or

m
deny permissions, and they are evaluated when an IAM identity attempts to access a

ku
resource.

th
Q. What is the principle of least privilege?

an
The principle of least privilege is a security best practice that states that users, groups,
and roles should only be granted the minimum level of access necessary to perform
as
their tasks. By following this principle, you reduce the risk of unauthorized access and
potential security vulnerabilities.
V

Q. How do you grant access to an AWS resource using IAM? To grant access to an
pa

AWS resource using IAM, you create an IAM policy that specifies the desired
permissions and attach the policy to the IAM identity (user, group, or role) that needs
ee

access to the resource. The policy defines which actions are allowed or denied and
which resources the actions can be performed on.
D

Q. What is IAM role chaining? IAM role chaining is the process of assuming multiple
IAM roles in a sequence to escalate privileges or access resources across multiple AWS
accounts. It involves granting a role permission to assume another role, allowing the
initial role to temporarily assume the permissions of the second role.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 60 -

ar
Q. How do you secure access to IAM resources? To secure access to IAM resources,
you can implement the following best practices:

m
- Enable multi-factor authentication (MFA) for IAM users with elevated privileges.

ku
- Regularly review and audit IAM policies and permissions to ensure they align with

th
security requirements.

an
- Use IAM roles and temporary security credentials instead of long-term access keys
whenever possible.
as
V

- Enable AWS CloudTrail to monitor and log API activity for IAM actions.
pa

- Apply IAM password policies to enforce strong password requirements for IAM users.
ee

Q. What is IAM Federation? IAM Federation is the process of linking your existing
identity management system with AWS IAM to allow users to access AWS resources
D

using their existing corporate credentials. This can be achieved through integration with
identity providers (IdPs) such as Active Directory, LDAP, or SAML-based identity
providers.

Q. How do you rotate IAM access keys? To rotate IAM access keys, you can follow
these steps:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 61 -

ar
Generate a new access key for the IAM user.
Update any applications or scripts that use the old access key with the new

m
access key.
Test the updated applications or scripts to ensure they work correctly with the

ku
new access key.
Delete the old access key from the IAM user once you confirm that the new
access key is working properly.

th
an
Q. What is IAM policy evaluation logic? IAM policy evaluation logic follows a
deny-by-default model, where all requests are denied by default unless explicitly allowed
by an attached policy. When a request is made to AWS, IAM evaluates all policies
as
attached to the identity and resource involved in the request. If any policy explicitly
denies the requested action, the request is denied. Otherwise, if all policies either
V

explicitly allow or do not mention the action, the request is allowed.


pa

Amazon Simple Notification Service (SNS)


ee

Amazon Simple Notification Service (SNS) is a fully managed messaging service


D

provided by AWS. It enables you to send messages or notifications to distributed


systems, mobile devices, or other AWS services in a highly reliable and scalable manner.

Q. How does Amazon SNS work? Amazon SNS works by allowing you to create topics,
to which subscribers can subscribe. When a message is published to a topic, Amazon
SNS delivers copies of the message to each subscribed endpoint, such as HTTP/S,
email, SMS, Lambda, SQS, or mobile push notification.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 62 -

ar
Q. What are the main components of Amazon SNS? The main components of Amazon
SNS include:

m
- Topics: Logical channels for publishing messages.

ku
- Subscriptions: Endpoints that receive messages published to a topic.

th
an
- Messages: The content sent from publishers to subscribers via topics.
as
- Publishers: Entities that send messages to SNS topics.
V

- Subscribers: Entities that receive messages published to SNS topics.


pa

Q. What are the different delivery protocols supported by Amazon SNS?


ee

Amazon SNS supports various delivery protocols, including:


D

- HTTP/S

- Email

- SMS

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 63 -

ar
- Mobile push notifications (Apple Push Notification Service, Google Cloud Messaging,
etc.)

m
- Amazon SQS

ku
- AWS Lambda

th
an
Q. How does message filtering work in Amazon SNS? Message filtering in Amazon
SNS allows subscribers to receive only the messages that are of interest to them. You
can set filter policies on subscriptions to specify which messages are delivered based
as
on message attributes or message structure.
V

Q. What are some use cases for Amazon SNS? Some common use cases for Amazon
SNS include:
pa
ee

- Push notifications to mobile devices or web browsers.


D

- Event-driven communication between microservices in distributed systems.

- Sending alerts and notifications for system monitoring and alarms.

- Decoupling communication between application components using publish/subscribe


messaging.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 64 -

ar
Q. How does message ordering work in Amazon SNS? By default, Amazon SNS does
not guarantee the order in which messages are delivered. However, you can use
message attributes or sequencing information to implement message ordering and

m
ensure that messages are processed in the desired order by subscribers.

ku
Q. What are message attributes in Amazon SNS? Message attributes are key-value

th
pairs associated with messages published to Amazon SNS topics. They allow you to
provide additional metadata or context to messages and enable advanced message

an
filtering and processing based on attribute values.
as
Q. How does message delivery retry and error handling work in Amazon SNS?
V

Amazon SNS automatically retries message delivery for transient failures, such as
network errors or service throttling. If delivery attempts fail repeatedly, Amazon SNS
pa

sends the message to a dead-letter queue (DLQ) or triggers a delivery failure notification
to the publisher.
ee

Q. How can you secure access to Amazon SNS? We can secure access to Amazon SNS
D

by using AWS Identity and Access Management (IAM) to control user permissions,
encrypting messages in transit using HTTPS, and enabling server-side encryption for
messages stored in SNS topics.

Amazon Simple Queue Service (SQS)

Amazon Simple Queue Service (SQS) is a fully managed message queuing service
provided by AWS. It enables you to decouple and scale microservices, distributed

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 65 -

systems, and serverless applications by reliably passing messages between


components.

ar
Q. How does Amazon SQS work? Amazon SQS works by allowing you to create

m
message queues, where messages are temporarily stored until they are processed by a

ku
consumer. Producers send messages to the queue, and consumers retrieve messages
from the queue to process them asynchronously.

th
Q. What are the main components of Amazon SQS?

an
The main components of Amazon SQS include:
as
- Message: The content sent from producers to consumers via queues.
V

- Queue: A container that holds messages until they are processed by consumers.
pa
ee

- Producer: Entities that send messages to SQS queues.


D

- Consumer: Entities that retrieve messages from SQS queues for processing.

Q. What are the types of queues supported by Amazon SQS?

Amazon SQS supports two types of queues:

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 66 -

ar
- Standard queues: Provide best-effort ordering and at-least-once delivery of messages.
They offer nearly unlimited throughput and can scale horizontally to handle high
volumes of messages.

m
ku
- FIFO (First-In-First-Out) queues: Guarantee that messages are delivered exactly once
and in the order in which they are sent. They are suitable for applications that require

th
strict message ordering and deduplication.

an
as
Q. How does message visibility timeout work in Amazon SQS? Message visibility
timeout in Amazon SQS determines how long a message remains invisible to other
V

consumers after being retrieved by a consumer for processing. If the message


processing exceeds the visibility timeout, the message becomes visible again in the
pa

queue and can be retried by other consumers.


ee
D

Q. What is the purpose of dead-letter queues (DLQs) in Amazon SQS?

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 67 -

ar
Dead-letter queues (DLQs) in Amazon SQS are used to capture and store messages that
cannot be processed successfully after a certain number of retries. DLQs enable you to
isolate and investigate failed messages separately from the main processing flow.

m
ku
Q. How does message deduplication work in Amazon SQS FIFO queues?

th
Message deduplication in Amazon SQS FIFO queues prevents duplicate messages from
being sent or processed. When a message with a specific message deduplication ID is
an
sent to a FIFO queue within a specified deduplication interval, SQS checks if a message
with the same ID has been sent recently. If a matching message is found, the new
as
message is discarded.
V

Q. What are some use cases for Amazon SQS?


pa

Some common use cases for Amazon SQS include:


ee

- Decoupling components in distributed systems to improve scalability and reliability.


D

- Implementing message-driven architectures for event-driven processing and


asynchronous communication.

- Offloading time-consuming or background tasks from synchronous applications to


asynchronous queues.

- Handling bursts of traffic and load spikes in web applications and microservices.

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn
AWS Data Engineering Interview Questions - 68 -

ar
Q. How does Amazon SQS handle message delivery retries and error handling?
Amazon SQS automatically retries message delivery for transient failures, such as
network errors or service throttling. If message processing fails repeatedly, SQS moves

m
the message to a dead-letter queue (DLQ) or triggers a visibility timeout to allow for

ku
retry attempts by other consumers.

th
Q. How can you secure access to Amazon SQS? You can secure access to Amazon
SQS by using AWS Identity and Access Management (IAM) to control user permissions,

an
encrypting messages in transit using HTTPS, and restricting access to queues using
resource-based policies.
as
V
pa
ee
D

Deepa Vasanthkumar – Medium


Deepa Vasanthkumar -| LinkedIn

You might also like