S - Hadoop Ecosystem
S - Hadoop Ecosystem
1. hadoop
1.1. wtf is hadoop
- wtf is hadoop: Apache Hadoop is a collection of open-source software utilities that facilitates using a
network of many computers to solve problems involving massive amounts of data and computation. It
provides a software framework for distributed storage and processing of big data using the MapReduce
programming model.
1.2. HDFS
Purpose: collect data from data ingestion tool and distribute the data across multiple node and
managing them in a cluster
> Splits/partitioning/sharding files into small blocks (16/32mb), creates replicas of the
blocks, and stores them on different machines/nodes
Terminology
rack: the collection of about 40 to 50 data nodes using the same network switch
node:
primary/name node: This node regulates file access to the clients and maintains,
manages, and assigns tasks to the data node, maintains the meta data of the files in
the RAM for quick access
> name node have rack awareness mechanism: choose to r/w data to the node rack
closest to it, hence reduce network traffic. Beside that, it also keep replication in
different racks
1.2.1.1.1.1 Question: cluster với rack khác gì nhau > ? một cluster có thể có nhiều rack
secondary/data node: These nodes are the actual workers in the HDFS system and
take instructions from the primary node
rep factor: number of time the block is copied
cluster: one cluster have one name node and many secondary node
using YARN as resource manager across the clusters
HDFS uses a command line interface to interact with Hadoop
key feature:
not require expensive hardware
suuport store big amount of data
relication
fault tolerant
scalable
portable
write once read many operations: HDFS don’t allow edit on written file, but can append new
data to them
read process: client send req to name node, name node verify client, then provide client
with location of location of required data, then client read file from those data node
write process: client send req to name node, name node verify client, then check if
already exist or not, if already exist, return error
1.3. mapreduce
purpose: base on Java, Processes Data by splitting the data into smaller units to asign it to
difference node process in parallel, replaced by `Spark`
Can be coded in many programming languages like Java, C ++, Python, Ruby and R
Consists of a Map task and a Reduce task
Map:
Map: Processes data into key value pairs
Further data sorting and organizing
Reducer: Aggregates and computes a set of result and produces a final output
MapReduce keeps track of its task by creating a unique key
1.4. YARN
Yet Another Resource Negotiator: the resource manager that comes with Hadoop, Prepares
Hadoop for batch, stream, interactive and graph processing, and it is the default resource
manager for big data app, being slowly replaced by `kubernetes`
1.5. Hive
Hive is data warehouse software within Hadoop that is designed for reading, writing and
managing tabular-type datasets and data analysis:
HiveQL is nearly same as SQL
support data cleaning and filtering
suited for static data analysis (SQL can work with static/dynamic data)
design for write one read many
don’t require verify schema when loading data
support partitioning (devide table by columns value, EX: date, city )
1.6. Hbase
Column-oriented non-relational database management system runs on top of HDFS
Provides a fault-tolerant way of storing sparse datasets
Works well with real-time data and random read and write access to Big Data
used for heavy write app
require Predefine the table schema and specify column families (same as doc contain
object)
New columns can be added to column families at anytime
compare with HDFS:
1.7. Common tool in The Hadoop ecosystem
The Hadoop ecosystem is made up of components that support one another, other apps build on
hadoop framework:
flume / sqoop : ingest data and tranfer them to storage component
Flume
Collects, aggregates, and transfers big data
Has a simple and flexible architecture based on streaming data flows
Uses a simple extensible data model that allows for online analytic
application
Sqoop
- Designed to transfer data between relational database systems and
Hadoop
- Accesses the database to understand the schema of the data
- Generates a MapReduce application to impor or export the data
HDFS and Hbase as storage component
HBase
A non-relational database that runs on top of HDFS
Provides real time wrangling on data (process of cleaning, structuring, and
transforming raw data into a format suitable for analysis.)
Stores data as indexes to allow for random and faster access to data
Cassandra: A scalable, NoSQL database designed to have no single point of failure
pig / hive / mapreduce: process and analyzed on stored data, the processing is done by
parallel computing.
Pig
- Analyzes large amounts of data
- Operates on the client side of a cluster
- A procedural data flow language
Hive
- Used for creating reports
- Operates on the server side of a cluster
- A declarative programming language (?? allow user to express which data
they wish to receive)
impala / hue : tools used to access analyzed/refined data:
impala: allows non-technical users to search and access the data in Hadoop.
hue (Hadoop user experience):
Allows you to upload, browse, and query data
Runs Pig jobs and workflow
Provides editors for several SQL query languages like Hive and MySQL
HDP : provide tool to downdload and setup big data app
`hadoop common` : collection of common utilities and libraries that support other Hadoop
modules
1.8. disadvantage of hadoop
hadoop is bad when:
processing transactions (random access)
work cannot be parallelized
there are dependencies in the data (EX: record 1 depend on record 2)
processing lots of small files
intensive calculations with little data
1.9. fix error missing JAVAHOME when installing hadoop
- set java var environment:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
2. Spark
2.1. wtf is spark and basic properties
- Spark is an open source in-memory (=in-RAM) application framework for distributed data processing
and iterative analysis (series of repeated operations or calculations are performed on a dataset) on
massive data volumes
is written predominantly in Scala and runs on Java virtual machines (JVMs)
comprehensive, unified frame work enable programming flexibility with easy-to-use Python,
Scala, and Java APIs
2.2. compare to mapreduce
mapreduce involves lots of disk I/O while spark take advantage of RAM (so-called in-memory)
2.3. Distributed computing
definition: group/cluster of computers working together to appear as one system to end user
what's the difference between distributed and parallel computing: parallel computing processors
access shared memory while distributed computing processor access their own
private/distirbuted memory
(meaningless) benefits:
Scalability and modular growth: easily adding more node to the system
Fault tolerance and redundancy: one node may down but the system can continously
running
2.4. Functional Programming
Programming that focus on function support, which emphasizes "what" instead of "how"
EX: compare to procedure programming:
functional: f(x) = x + 1 input n item into f for result arr
procedure: f(arr) = return arr with each item + 1 input an arr an return an new arr with
modified value
main benefit:Parallelization,
Spark parallelizes computations using the lambda calculus.
2.5. Parallel programming
is the simultaneous use of multiple compute resources to solve a computational problem by breaking
down the problems into discrete parts that can be solved concurrently
2.1. Spark workflow
Communication occurs among the driver and the executors. The driver contains the Spark jobs that the
application needs to run and splits the jobs into tasks submitted to the executors. The driver receives
the task results when the executors complete the tasks.
2.2. Spark components
data stotage: Any Hadoop compatible data source is acceptable. (EX: HDFS)
High-level programming APIs in Scala, Python, and Java
the cluster management framework which handles the distributed computing aspects of Spark
(EX: YARN, Kubernetes)
2.3. Base engine in spark: spark core
is:
fault tolerance
used for performs large scale parallel and distributed data processing by managing memory and
task scheduling.
define RDDs and other datatypes
parallelizes a distributed collection of elements across the cluster.
2.4. Spark RDDs
> resilient (flexible) distributed dataset/databases (RDDs)
is fundamental data structure in Apache Spark
faul tolerant, allowing to recover lost data partitions due to node failures. (?? thank to DAG
structure and data is immutable)
hence can be:
partitioned across the cluster's nodes
capable of receiving parallel operations
immutable (data cannot be changed once created)
provide a low-level API for distributed data processing, they lack optimizations that higher-level
abstractions like DataFrames can offer.
RDD support multiple file format as input data
2.5. > RDDs transformation and action, DAG
- RDD tranformation: tranform RRDs to create new one RDDs
is "lazied" because result only computed only when evalued by actions
EX:
map(func) # same as map() in python arr, with func is function to tranform item in object
filter(func) # filter
distinct()
flatmap(func) # similar to map, but map each input item to zero or more output items,
func should return a Seq (??arr)
> RDDs Action: return value to driver program (EX: python program) after running a transformation
EX:
recduce(func) # aggrerates all RDD elements, with func is function take 2 agrs and return
one that is commutative, associative, can be computed correctly in parallel
take(n) # return n first items
collect # return all element as an arr (more detail: collect all result from partition into the
driver program)
takeOredered(n[, key=func])
> Directed Acyclic Graph (DAG)
Graphical data structure with edge and vertices
In Apache Spark DAG, the vertices represents RDDs and the edges represent operations such as
transformations or actions
2.6. Spark Dataframe
Conceptually equivalent to a table in a relational database or a data frame in R/Python (tabular
structure with named columns), but with richer optimizations
higher-level abstraction built on top of RDDs
> more user-friendly and optimized API for distributed data processing compared to RDDs.
> Under the hood, operations on DataFrames are translated into a series of transformations on
RDDs.
immutable distributed collection of data. (as RDD)
support numerous data format as input (as RDD)
scalable
each column must have a consistent data type across all rows
DF have row indexes (row labels) and columns name (column labels), Indexes can be numbers,
dates, or strings/tuples
User-Defined Functions (UDFs) operate one-row-at-a-time, and thus suffer from high
serialization and invocation overhead.
> serialization overhead: Serialization is the process of converting an object or data
structure into a format that can be easily stored or transmitted. In distributed computing
frameworks when data is moved between nodes in a cluster or when it's stored, it needs
to be serialized before transmission and deserialized upon reception. This is because data
in memory on one node might not be in the same programming language or format as on
another node. Serialization helps ensure that data can be effectively transmitted and
reconstructed.
> Invocation overhead refers to the additional time and resources required to initiate or
call a function or operation. invoking a function written in Python can have more
overhead compared to invoking a function written in a language like Java or Scala.
calling and executing these functions one row at a time incurs a significant amount of
additional processing time and resources
To address the performance issues, the practice is mentioned of defining UDFs in
languages like Java and Scala, which are known for their performance, and then invoking
these UDFs from Python.
> Inote: in present, there tool to write scala-like code directly in python
2.7. Spark SQL
module for structured data processing
provides a SQL interface and tools for working with structured data in Spark that can
operate on both RDDs and DataFrames API.
many actions and transformations can be performed directly on DataFrames without
explicitly using Spark SQL
valuable when you need to execute complex SQL queries, join data from different
sources, or work with structured data in a more declarative manner (Instead of providing
step-by-step instructions or explicitly detailing the procedure, you declare the desired
outcome or result, and the underlying system is responsible for figuring out the most
efficient way to achieve it.).
cost-base optimizer (The goal of the optimizer is to find an execution plan that minimizes
the overall cost of executing the query, where the cost is typically measured in terms of
time or resources.) (handled automatically by the database engine, don’t mind it)
support numerous data format as input
Creating a table view in Spark SQL is required to run SQL queries programmatically on a
DataFrame:
local scope: the view exists only within the current spark session on the current
node.
global: exists within the general Spark application, shareableacross different Spark
sessions.
scalable, fault-tolerance
2.8. Spark Dataset
Datasets are:
Datasets are newest Spark data abstraction, an extension of DataFrames and provide a type-safe
(or strongly typed Java Virtual Machine object), object-oriented programming interface.
> strongly typed language: language that the type of a variable is explicitly defined and enforced
at compile time or runtime, and conversions between different types often require explicit
casting. "strongly type" providing additional safety and helping to prevent certain types of
runtime errors
> support convert JVM objects to tabular representations
> Extend DataFrame type-safe and object oriented API capabilities
> improved memory usage and caching (because Spark understands the data structure of the
datasets and optimizes the layout within memory.)
immutable (as DF)
(miss explaination) compute faster than RDDs
offer the additional query optimization enabled by Catalyst and Tungsten. (detail in later
chapter)
support high-level aggregate operations (EX: sum, min, max, join, group by)
example code:
val ds = Seq(“Alpha”, “Beta”, “Gamma”). toDS() # create ds from sequence
val ds = spark.read.text("/text_folder/file.txt").as[String]
case class Customer(name: String, id: Int, phone: Double)
val ds_cust = spark. read.json("/customer.json").as[Customer]
2.9. Spark optimization: Catalyst and Tungsten
- Spark SQL Optimization goals is to reduce
Query time
Memory consumption
- Catalyst Optimizer
query optimization framework that is responsible for optimizing the logical and physical plans
of Spark SQL queries.
operates on the logical plan, which represents the abstract syntax tree of the user's query, and
transforms it into an optimized physical plan, which specifies the actual steps that Spark should
take to execute the query.
is rule-based and Cost-based Optimization
rule-based: SQL optimizer follows predefined rule that
EX: is the table is indexed
EX: Does the query contain only the required columns?
> Catalyst applies a set of optimization rules to transform and improve the query plan.
These rules cover various aspects such as predicate pushdown, constant folding, and
projection pruning.
> cost-base: measures and calculates cost based on the time and memory that the query
consumes selects the query path/plan that results in the lowest time and memory
consumption.
example of cost and rule base: you are riding a car, how to optimize the cost spent:
A car owner can optimize the vehicle's performance with the right oil, tires, fuel,
and so on. (rule base optimizer)
plan a travel route to save both time and wear and tear on your car (cost base
optimizer)
base on functional programming constructs in Scala
allowing developers to add custom optimization rules or introduce new features without
modifying the core Spark codebase
core phases of optimization in catalyst: Analysis Logistical Optimization Physical
Planning Code Generation
: This is the rule-based optimization step of Spark SQL and rules
such as folding, pushdown, and pruning
cost base step: choose best physical plan
Driver Program:
a single process that creates work for the cluster.
The driver runs the application’s user code, creates work (job>task : computations that
can be performed in parallel) and sends it to the cluster.
The Spark Context
starts when-the application launches and must be created in the driver before
DataFrames or RDDs.
one SContext per Application
divides the jobs into tasks can be performed in parallel to be executed on the
clusters
Tasks from a given Job operate on different data subsets called Partitions
NOTE: It is not advised to perform a collection to the driver on a large data set as it could
easily use up the driver process memory. If the data set is large, apply a reduction before
collection
Can be run on a cluster node or another machine as a client to the cluster
> deploy mode:
Client Mode - the application submitter (EX: user machine terminal) launches the
driver process outside the cluster
Cluster Mode - ... inside ...
An executor (one in each worker node)
is a process running multiple threads to perform work concurrently and independently for
the cluster
A Worker node is a node that can launch executor processes to run tasks
A Spark Executor utilizes a set portion of local resources (memory and compute cores),
running one task per available core. Each executor manages its data caching as dictated
by the driver. In general, increasing executors and available cores increases the cluster
paralellism ability
When a task finishes, the executor puts the results in a new RDD partition or transfers
them back to the driver.
A Stage and shuffle:
When a task requires other data partitions, Spark must perform a "shuffle"
(separate job into 2 or more stages, re partitioning the data to cluster node).
stage is a set of tasks within a job that can be completed on the current local data
partition. Subsequent tasks in later stages must wait for that stage to be completed
before beginning execution, creating a dependency from one stage to the next.
shuffle:
costly: require data serialization, disk and network io to tranfer data
between other dataset partitions
cluster manager:
Communicates with a particular cluster to acquire resources for the running application
run as service outside the application
While an application is running, the Spark Context creates tasks and communicates to the
cluster manager what resources are needed. Then the cluster manager reserves
executorcores and memory resources. Once the resources are reserved, tasks canbe
transferred to the executor processes to run
type of cluster manager
spark stand alone:
default install with spark, best for setting up a simple cluster
Spark Standalone is specifically designed to run Spark and is often the
fastest way to setting up and get running
core component:
> one or more worker: They start an executor process with ne or more
reserved cores.
> one master: can be runned on any cluster node, It connects workers to the
cluster and keeps track of them
YARN
create by Hadoop project, support many another framework
Has dependencies, making it more complex to deploy than Spark
Standalone
Mesos
Kubernetes
open-source system for running containerized applications (software
applications that have been packaged and isolated with all of their
dependencies, libraries, and runtime environments into containers.)
more portable and automate ability, simplify dependdencies
management and scale the cluster as needed
spark local mode: toy app to run simple spark app without the need of set cluster, ...
Spark Shell simplifies working with data by automatically initializing the SparkContext and
SparkSession variables and providing Spark API access.
df.filter(df["age"] > 21).show() # filter() and where() functions are used for the same purpose.
spark.sql("SELECT age, name FROM people WHERE age > 21").show()
df.groupBy("age").count().show() # some other aggregate functions: max, min, countdistinct, avg, ...
spark.sql("SELECT age, COUNT(age) as count FROM people GROUP BY age").show()
car_counts = sdf.groupby(['cyl'])\
.agg({"wt": "count"})\
.sort("count(wt)", ascending=False)\
.show(5)
2.15.3.4. UDFs
3. Kubernetes
is tool to run containerized applications on a cluster
open source
portable: can be run in the same way in any machine
scalable, flexible
4. Example usage
4.1. install, count word from single txt file with mapreduce: link
4.2. deploy hadoop cluster using pre-build docker compose and view it using GUI: link