0% found this document useful (0 votes)
4 views

ADF - Realtime Problem Statement

The document outlines the architecture of Spark, detailing the roles of the Driver Program, Cluster Manager, Worker Node, and Executor in managing and executing applications on a cluster. It also presents a system design for an automated stock market data analysis tool that retrieves, sorts, and merges stock data with relevant news, culminating in a graphical dashboard for users. Additionally, it discusses data orchestration requirements and capabilities within a data factory environment, emphasizing automation and monitoring of pipeline executions.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ADF - Realtime Problem Statement

The document outlines the architecture of Spark, detailing the roles of the Driver Program, Cluster Manager, Worker Node, and Executor in managing and executing applications on a cluster. It also presents a system design for an automated stock market data analysis tool that retrieves, sorts, and merges stock data with relevant news, culminating in a graphical dashboard for users. Additionally, it discusses data orchestration requirements and capabilities within a data factory environment, emphasizing automation and monitoring of pipeline executions.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Spark Architecture

Architecture
Driver Program

The Driver Program is a process that runs the main() function of the application and
creates the SparkContext object.
The purpose of SparkContext is to coordinate the spark applications, running as
independent sets of processes on a cluster.

To run on a cluster, the SparkContext connects to a different type of cluster managers


and then perform the following tasks: -

It acquires executors on nodes in the cluster.

Then, it sends your application code to the executors.


Here, the application code can be defined by JAR or Python files passed to the
SparkContext.

At last, the SparkContext sends tasks to the executors to run.


Cluster Manager
• The role of the cluster manager is to allocate resources
across applications. The Spark is capable enough of
running on a large number of clusters.
• It consists of various types of cluster managers such as
Hadoop YARN, Apache Mesos and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark
cluster manager that facilitates to install Spark on an
empty set of machines.
Worker Node
• The worker node is a slave node

• Its role is to run the application code in the cluster.


Executor
• An executor is a process launched for an application on
a worker node.
• It runs tasks and keeps data in memory or disk storage
across them.
• It read and write data to the external sources.
• Every application contains its executor.
Architecture Design for
Stock Market Data
Problem Statement
On Daily Basis, Customers should be getting top 50 Stocks, Volumes & History which
are in News so that they can Analyze & take Wise Decisions before Investments.

We want an automated system that does the following:


 Daily get data from US stock market in large volume
 Identify and get top 50 stocks from the large volume dataset
 Combine above top 50 stock with relevant news for the purpose of understanding
stock rates on historical basis.
Maintain the data in the proposed system
Generate graphical representations (Dashboard) to display above data for any date
in past and current
High Level DFD

Graphical
Stock Data Integration & Report Stock Market
Source APIs Dashboard System Customer

Stock News
Level-1 DFD Dashboard
Reporting

UnSorted
Raw Stock Raw Stock
Data 1. Pull Data for Stock Data
Current 2. Sorting
Stock Data Today with relevant
Date & Take
Source APIs News
Stock Top(N)
Data Data

Sorted
Top(N)
3. Merge
Stock Data
the sorted
data with
Stock News relevant
Raw Stock news
News Data
ER Diagram
System Architecture
Data Factory Orchestration Sink
Source

Source Data
Stock API

Transformed Data Structured data


Technical Architecture

Pipeline Sink
Source

Sink Linked Service


Stock API Source Linked Service Source Dataset

Sink CSV Dataset


Data Flow Transformations

Structured data
Generate Reports
Schedule
Trigger

Trigger Tumbling Window Trigger

s
Event Trigger
Runs on a calendar/
Clock

Supports periodic and specific times

Schedule Trigger to Pipeline is Many to Many

Trigger Can only be scheduled for a future time to start


Runs at periodic
intervals

Windows are fixed sized, non-overlapping

Tumblin Can be scheduled for the past windows/

g slices

Window Trigger to Pipeline is one to one

Trigger
Runs in response to
events

Events can be creation or deletion of Blobs/


Files
Event Trigger to Pipeline is Many to Many
Trigge
r
Code free data
transformations

Executed on Data Factory managed


Data Flows Databricks Spark clusters

Features
Benefits from Data factory scheduling and
monitoring capabilities.
Data Flows

Type
s
Only available in some regions
https://docs.microsoft.com/en-us/azure/data-factory/concepts-
data-flow- overview#available-regions

Limited set of connectors available


Data Flows https://docs.microsoft.com/en-us/azure/data-factory/data-flow-source#supported-
sources

Not suitable for very complex logic


Limitations
Data Orchestration
Requirements
Pipeline executions are full automated

Pipelines run at regular intervals or on an event occurring

Activities only run once the upstream dependency has been satisfied

Easier to monitor for execution progress and issues


Data Factory
Capability
Dependency between activities inside a pipeline

Dependency between pipelines within a parent pipeline

Dependency between triggers [Only tumbling window triggers]

Custom-made Solution

You might also like