0% found this document useful (0 votes)
25 views

S - Hadoop Ecosystem

Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models. It allows for the parallel processing of large datasets across nodes in a cluster. The core components of Hadoop include HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data across clusters and provides redundancy, while MapReduce provides a programming model for distributed processing of large datasets. Spark is another framework for large-scale data processing that improves on MapReduce by using an in-memory approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

S - Hadoop Ecosystem

Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models. It allows for the parallel processing of large datasets across nodes in a cluster. The core components of Hadoop include HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data across clusters and provides redundancy, while MapReduce provides a programming model for distributed processing of large datasets. Spark is another framework for large-scale data processing that improves on MapReduce by using an in-memory approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Hadoop

1. hadoop
1.1. wtf is hadoop
- wtf is hadoop: Apache Hadoop is a collection of open-source software utilities that facilitates using a
network of many computers to solve problems involving massive amounts of data and computation. It
provides a software framework for distributed storage and processing of big data using the MapReduce
programming model.
1.2. HDFS
 Purpose: collect data from data ingestion tool and distribute the data across multiple node and
managing them in a cluster
> Splits/partitioning/sharding files into small blocks (16/32mb), creates replicas of the
blocks, and stores them on different machines/nodes
 Terminology
 rack: the collection of about 40 to 50 data nodes using the same network switch
 node:
 primary/name node: This node regulates file access to the clients and maintains,
manages, and assigns tasks to the data node, maintains the meta data of the files in
the RAM for quick access
> name node have rack awareness mechanism: choose to r/w data to the node rack
closest to it, hence reduce network traffic. Beside that, it also keep replication in
different racks
1.2.1.1.1.1 Question: cluster với rack khác gì nhau > ? một cluster có thể có nhiều rack
 secondary/data node: These nodes are the actual workers in the HDFS system and
take instructions from the primary node
 rep factor: number of time the block is copied
 cluster: one cluster have one name node and many secondary node
 using YARN as resource manager across the clusters
 HDFS uses a command line interface to interact with Hadoop
 key feature:
 not require expensive hardware
 suuport store big amount of data
 relication
 fault tolerant
 scalable
 portable
 write once read many operations: HDFS don’t allow edit on written file, but can append new
data to them
 read process: client send req to name node, name node verify client, then provide client
with location of location of required data, then client read file from those data node
 write process: client send req to name node, name node verify client, then check if
already exist or not, if already exist, return error
1.3. mapreduce
 purpose: base on Java, Processes Data by splitting the data into smaller units to asign it to
difference node process in parallel, replaced by `Spark`
 Can be coded in many programming languages like Java, C ++, Python, Ruby and R
 Consists of a Map task and a Reduce task
 Map:
 Map: Processes data into key value pairs
 Further data sorting and organizing
 Reducer: Aggregates and computes a set of result and produces a final output
 MapReduce keeps track of its task by creating a unique key
1.4. YARN
 Yet Another Resource Negotiator: the resource manager that comes with Hadoop, Prepares
Hadoop for batch, stream, interactive and graph processing, and it is the default resource
manager for big data app, being slowly replaced by `kubernetes`
1.5. Hive
 Hive is data warehouse software within Hadoop that is designed for reading, writing and
managing tabular-type datasets and data analysis:
 HiveQL is nearly same as SQL
 support data cleaning and filtering
 suited for static data analysis (SQL can work with static/dynamic data)
 design for write one read many
 don’t require verify schema when loading data
 support partitioning (devide table by columns value, EX: date, city )
1.6. Hbase
 Column-oriented non-relational database management system runs on top of HDFS
 Provides a fault-tolerant way of storing sparse datasets
 Works well with real-time data and random read and write access to Big Data
 used for heavy write app
 require Predefine the table schema and specify column families (same as doc contain
object)
 New columns can be added to column families at anytime
 compare with HDFS:


1.7. Common tool in The Hadoop ecosystem
 The Hadoop ecosystem is made up of components that support one another, other apps build on
hadoop framework:
 flume / sqoop : ingest data and tranfer them to storage component
 Flume
 Collects, aggregates, and transfers big data
 Has a simple and flexible architecture based on streaming data flows
 Uses a simple extensible data model that allows for online analytic
application
 Sqoop
 - Designed to transfer data between relational database systems and
Hadoop
 - Accesses the database to understand the schema of the data
 - Generates a MapReduce application to impor or export the data
 HDFS and Hbase as storage component
 HBase
 A non-relational database that runs on top of HDFS
 Provides real time wrangling on data (process of cleaning, structuring, and
transforming raw data into a format suitable for analysis.)
 Stores data as indexes to allow for random and faster access to data
 Cassandra: A scalable, NoSQL database designed to have no single point of failure
 pig / hive / mapreduce: process and analyzed on stored data, the processing is done by
parallel computing.
 Pig
 - Analyzes large amounts of data
 - Operates on the client side of a cluster
 - A procedural data flow language
 Hive
 - Used for creating reports
 - Operates on the server side of a cluster
 - A declarative programming language (?? allow user to express which data
they wish to receive)
 impala / hue : tools used to access analyzed/refined data:
 impala: allows non-technical users to search and access the data in Hadoop.
 hue (Hadoop user experience):
 Allows you to upload, browse, and query data
 Runs Pig jobs and workflow
 Provides editors for several SQL query languages like Hive and MySQL
 HDP : provide tool to downdload and setup big data app
 `hadoop common` : collection of common utilities and libraries that support other Hadoop
modules
1.8. disadvantage of hadoop
hadoop is bad when:
 processing transactions (random access)
 work cannot be parallelized
 there are dependencies in the data (EX: record 1 depend on record 2)
 processing lots of small files
 intensive calculations with little data
1.9. fix error missing JAVAHOME when installing hadoop
- set java var environment:
 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
2. Spark
2.1. wtf is spark and basic properties
- Spark is an open source in-memory (=in-RAM) application framework for distributed data processing
and iterative analysis (series of repeated operations or calculations are performed on a dataset) on
massive data volumes
 is written predominantly in Scala and runs on Java virtual machines (JVMs)
 comprehensive, unified frame work enable programming flexibility with easy-to-use Python,
Scala, and Java APIs
2.2. compare to mapreduce
 mapreduce involves lots of disk I/O while spark take advantage of RAM (so-called in-memory)
2.3. Distributed computing
 definition: group/cluster of computers working together to appear as one system to end user
 what's the difference between distributed and parallel computing: parallel computing processors
access shared memory while distributed computing processor access their own
private/distirbuted memory
 (meaningless) benefits:
 Scalability and modular growth: easily adding more node to the system
 Fault tolerance and redundancy: one node may down but the system can continously
running
2.4. Functional Programming
Programming that focus on function support, which emphasizes "what" instead of "how"
 EX: compare to procedure programming:
 functional: f(x) = x + 1  input n item into f for result arr
 procedure: f(arr) = return arr with each item + 1  input an arr an return an new arr with
modified value
 main benefit:Parallelization,
 Spark parallelizes computations using the lambda calculus.
2.5. Parallel programming
is the simultaneous use of multiple compute resources to solve a computational problem by breaking
down the problems into discrete parts that can be solved concurrently
2.1. Spark workflow

Communication occurs among the driver and the executors. The driver contains the Spark jobs that the
application needs to run and splits the jobs into tasks submitted to the executors. The driver receives
the task results when the executors complete the tasks.
2.2. Spark components
 data stotage: Any Hadoop compatible data source is acceptable. (EX: HDFS)
 High-level programming APIs in Scala, Python, and Java
 the cluster management framework which handles the distributed computing aspects of Spark
(EX: YARN, Kubernetes)
2.3. Base engine in spark: spark core
is:
 fault tolerance
 used for performs large scale parallel and distributed data processing by managing memory and
task scheduling.
 define RDDs and other datatypes
 parallelizes a distributed collection of elements across the cluster.
2.4. Spark RDDs
> resilient (flexible) distributed dataset/databases (RDDs)
 is fundamental data structure in Apache Spark
 faul tolerant, allowing to recover lost data partitions due to node failures. (?? thank to DAG
structure and data is immutable)
hence can be:
 partitioned across the cluster's nodes
 capable of receiving parallel operations
 immutable (data cannot be changed once created)

 provide a low-level API for distributed data processing, they lack optimizations that higher-level
abstractions like DataFrames can offer.
 RDD support multiple file format as input data
2.5. > RDDs transformation and action, DAG
- RDD tranformation: tranform RRDs to create new one RDDs
 is "lazied" because result only computed only when evalued by actions
 EX:
 map(func) # same as map() in python arr, with func is function to tranform item in object
 filter(func) # filter
 distinct()
 flatmap(func) # similar to map, but map each input item to zero or more output items,
func should return a Seq (??arr)
> RDDs Action: return value to driver program (EX: python program) after running a transformation
 EX:
 recduce(func) # aggrerates all RDD elements, with func is function take 2 agrs and return
one that is commutative, associative, can be computed correctly in parallel
 take(n) # return n first items
 collect # return all element as an arr (more detail: collect all result from partition into the
driver program)
 takeOredered(n[, key=func])
> Directed Acyclic Graph (DAG)
 Graphical data structure with edge and vertices
 In Apache Spark DAG, the vertices represents RDDs and the edges represent operations such as
transformations or actions
2.6. Spark Dataframe
 Conceptually equivalent to a table in a relational database or a data frame in R/Python (tabular
structure with named columns), but with richer optimizations
 higher-level abstraction built on top of RDDs
> more user-friendly and optimized API for distributed data processing compared to RDDs.
> Under the hood, operations on DataFrames are translated into a series of transformations on
RDDs.
 immutable distributed collection of data. (as RDD)
 support numerous data format as input (as RDD)
 scalable
 each column must have a consistent data type across all rows
 DF have row indexes (row labels) and columns name (column labels), Indexes can be numbers,
dates, or strings/tuples
 User-Defined Functions (UDFs) operate one-row-at-a-time, and thus suffer from high
serialization and invocation overhead.
 > serialization overhead: Serialization is the process of converting an object or data
structure into a format that can be easily stored or transmitted. In distributed computing
frameworks when data is moved between nodes in a cluster or when it's stored, it needs
to be serialized before transmission and deserialized upon reception. This is because data
in memory on one node might not be in the same programming language or format as on
another node. Serialization helps ensure that data can be effectively transmitted and
reconstructed.
 > Invocation overhead refers to the additional time and resources required to initiate or
call a function or operation. invoking a function written in Python can have more
overhead compared to invoking a function written in a language like Java or Scala.
calling and executing these functions one row at a time incurs a significant amount of
additional processing time and resources
 To address the performance issues, the practice is mentioned of defining UDFs in
languages like Java and Scala, which are known for their performance, and then invoking
these UDFs from Python.
> Inote: in present, there tool to write scala-like code directly in python
2.7. Spark SQL
 module for structured data processing
 provides a SQL interface and tools for working with structured data in Spark that can
operate on both RDDs and DataFrames API.
 many actions and transformations can be performed directly on DataFrames without
explicitly using Spark SQL
 valuable when you need to execute complex SQL queries, join data from different
sources, or work with structured data in a more declarative manner (Instead of providing
step-by-step instructions or explicitly detailing the procedure, you declare the desired
outcome or result, and the underlying system is responsible for figuring out the most
efficient way to achieve it.).
 cost-base optimizer (The goal of the optimizer is to find an execution plan that minimizes
the overall cost of executing the query, where the cost is typically measured in terms of
time or resources.) (handled automatically by the database engine, don’t mind it)
 support numerous data format as input
 Creating a table view in Spark SQL is required to run SQL queries programmatically on a
DataFrame:
 local scope: the view exists only within the current spark session on the current
node.
 global: exists within the general Spark application, shareableacross different Spark
sessions.
 scalable, fault-tolerance
2.8. Spark Dataset
Datasets are:
 Datasets are newest Spark data abstraction, an extension of DataFrames and provide a type-safe
(or strongly typed Java Virtual Machine object), object-oriented programming interface.
> strongly typed language: language that the type of a variable is explicitly defined and enforced
at compile time or runtime, and conversions between different types often require explicit
casting. "strongly type" providing additional safety and helping to prevent certain types of
runtime errors
> support convert JVM objects to tabular representations
> Extend DataFrame type-safe and object oriented API capabilities
> improved memory usage and caching (because Spark understands the data structure of the
datasets and optimizes the layout within memory.)
 immutable (as DF)
 (miss explaination) compute faster than RDDs
 offer the additional query optimization enabled by Catalyst and Tungsten. (detail in later
chapter)
 support high-level aggregate operations (EX: sum, min, max, join, group by)
 example code:
 val ds = Seq(“Alpha”, “Beta”, “Gamma”). toDS() # create ds from sequence
 val ds = spark.read.text("/text_folder/file.txt").as[String]
 case class Customer(name: String, id: Int, phone: Double)
val ds_cust = spark. read.json("/customer.json").as[Customer]
2.9. Spark optimization: Catalyst and Tungsten
- Spark SQL Optimization goals is to reduce
 Query time
 Memory consumption
- Catalyst Optimizer
 query optimization framework that is responsible for optimizing the logical and physical plans
of Spark SQL queries.
 operates on the logical plan, which represents the abstract syntax tree of the user's query, and
transforms it into an optimized physical plan, which specifies the actual steps that Spark should
take to execute the query.
 is rule-based and Cost-based Optimization
 rule-based: SQL optimizer follows predefined rule that
 EX: is the table is indexed
 EX: Does the query contain only the required columns?

> Catalyst applies a set of optimization rules to transform and improve the query plan.
These rules cover various aspects such as predicate pushdown, constant folding, and
projection pruning.
> cost-base: measures and calculates cost based on the time and memory that the query
consumes  selects the query path/plan that results in the lowest time and memory
consumption.
 example of cost and rule base: you are riding a car, how to optimize the cost spent:
 A car owner can optimize the vehicle's performance with the right oil, tires, fuel,
and so on. (rule base optimizer)
 plan a travel route to save both time and wear and tear on your car (cost base
optimizer)
 base on functional programming constructs in Scala
 allowing developers to add custom optimization rules or introduce new features without
modifying the core Spark codebase
 core phases of optimization in catalyst: Analysis  Logistical Optimization  Physical
Planning  Code Generation
 : This is the rule-based optimization step of Spark SQL and rules
such as folding, pushdown, and pruning
 cost base step: choose best physical plan

- (skip, deprecated) Tungsten


 Tungsten optimizes the performance of underlying hardware focusing on CPU performance of
Spark's execution engine instead of IO by using a method more suited to data processing for the
JVM to process data.
 Manages memory explicitly and does not rely on the JVM object model or garbage collection
 (not learned keywork) Enables cache-friendly computation of algorithms and data
structures using both STRIDE-based memory access (instead of random memory access)
 (meaningless) Supports on-demand JVM byte code generation
 (not learned term) Does not generate virtual function dispatches
 (?) Places intermediate data in CPU registers
 (?) Enables Loop unrolling

2.10. Spark application process

 Driver Program:
 a single process that creates work for the cluster.

 The driver runs the application’s user code, creates work (job>task : computations that
can be performed in parallel) and sends it to the cluster.
 The Spark Context
 starts when-the application launches and must be created in the driver before
DataFrames or RDDs.
 one SContext per Application
 divides the jobs into tasks can be performed in parallel to be executed on the
clusters
 Tasks from a given Job operate on different data subsets called Partitions
 NOTE: It is not advised to perform a collection to the driver on a large data set as it could
easily use up the driver process memory. If the data set is large, apply a reduction before
collection
 Can be run on a cluster node or another machine as a client to the cluster
> deploy mode:
 Client Mode - the application submitter (EX: user machine terminal) launches the
driver process outside the cluster
 Cluster Mode - ... inside ...
 An executor (one in each worker node)
 is a process running multiple threads to perform work concurrently and independently for
the cluster
 A Worker node is a node that can launch executor processes to run tasks
 A Spark Executor utilizes a set portion of local resources (memory and compute cores),
running one task per available core. Each executor manages its data caching as dictated
by the driver. In general, increasing executors and available cores increases the cluster
paralellism ability
 When a task finishes, the executor puts the results in a new RDD partition or transfers
them back to the driver.
 A Stage and shuffle:
 When a task requires other data partitions, Spark must perform a "shuffle"
(separate job into 2 or more stages, re partitioning the data to cluster node).
 stage is a set of tasks within a job that can be completed on the current local data
partition. Subsequent tasks in later stages must wait for that stage to be completed
before beginning execution, creating a dependency from one stage to the next.
 shuffle:
 costly: require data serialization, disk and network io to tranfer data
between other dataset partitions
 cluster manager:
 Communicates with a particular cluster to acquire resources for the running application
 run as service outside the application
 While an application is running, the Spark Context creates tasks and communicates to the
cluster manager what resources are needed. Then the cluster manager reserves
executorcores and memory resources. Once the resources are reserved, tasks canbe
transferred to the executor processes to run
 type of cluster manager
 spark stand alone:
 default install with spark, best for setting up a simple cluster
 Spark Standalone is specifically designed to run Spark and is often the
fastest way to setting up and get running
 core component:
> one or more worker: They start an executor process with ne or more
reserved cores.
> one master: can be runned on any cluster node, It connects workers to the
cluster and keeps track of them

 YARN
 create by Hadoop project, support many another framework
 Has dependencies, making it more complex to deploy than Spark
Standalone
 Mesos
 Kubernetes
 open-source system for running containerized applications (software
applications that have been packaged and isolated with all of their
dependencies, libraries, and runtime environments into containers.)
 more portable and automate ability, simplify dependdencies
management and scale the cluster as needed
 spark local mode: toy app to run simple spark app without the need of set cluster, ...

2.11. run spark app


- `spark-submit`:
 command for submitting the app
 Works the same way regardless of cluster manager type or application language (can run python
and java simultenouesly)
 workflow:
 Parse command line options
 Read additional configuration specified in 'conf/spark-defaults.conf'
 Connect to the cluster manager specified with the '--master' argument or run in local
mode
 Transfer application (JARs or Python files) and any additional files specified to be
distributed and run in the cluster
 syntax: spark-submit [options] <app-file> [<app args>]
options:
 (require) --master <what_cluster_manager_to_connect_with>
 (require when run a java or scala app) --class <option> # Specify the program entry point
 --executor-cores <?> --executor-memory <?> # Set CPU core and memory usage in
executors
 --help # see available option
 <app-file and app args>:
> deploy-mode : client

2.12. > application dependency


 for python:
 pass dependencies explicitly to cluster:
 spark-submit ... --py-files <python_dependencies> <app file and app agrs>

2.13. Spark Shell

Spark Shell simplifies working with data by automatically initializing the SparkContext and
SparkSession variables and providing Spark API access.

2.14. Spark Configuration Types


 type of spark configuration:
 properties: Adjust and control application behavior (including setting properties with the
driver and sharing them with the cluster.)
 Environment variables: Adjust settings on a per-machine basis
 Logging: dictates what level of messages, such as info or errors, are logged to file or
output to the driver during application execution.
 You can set properties:
 Programmatically when creating SparkSession or using a SparkConf object
 In the file conf/spark-defaults.conf'
 When launching 'spark-submit' with arguments '--master', or `--conf --<key>=<value>`
> precedence (ưu tiên) level: program > spark-submit > config file
 new strange term:
 static configuration: configuration that is setted programmatically for each application
itself (EX: app name)
 Dynamic configuration: configuration that is setted when launching app with `spark-
submit` (as an options) (EX: which cluster manager to submit to)

2.15. spark code example


2.15.1. Import and init spark context and spark section
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf # Spark API for Python
from pyspark.sql import SparkSession
# Spark Context and Spark session needed for SparkSQL and DataFrames
sc = SparkContext() # Creating a spark context class
spark = SparkSession \
.builder \
.appName("Python Spark DataFrames basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate() # Creating a spark session
spark # print out object info
# Inote: open UI at port 4040
2.15.2. RDDs
data = range(1,30)
xrangeRDD = sc.parallelize(data, 4) # copy data to RDD with 4 partition
xrangeRDD # recheck, output: 1
# A transformation is an operation on an RDD that results in a new RDD. New RDD is lazily evaluated
(the calculation is not carried out when the new RDD is generated). The RDD will contain a series of
transformations, or computation instructions, that will only be carried out when an action is called (or
require)
subRDD = xrangeRDD.map(lambda x: x-1)
filteredRDD = subRDD.filter(lambda x : x<10)
# filteredRDD.cache() # cache the calculation (result)
print(filteredRDD.collect()) #retrieve the result at this RDD, output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
filteredRDD.count() # output: 10
2.15.3. Read df and DataFrames-SparkSQL operation
2.15.3.1. create/import DF
df = spark.read.json("people.json").cache() # read data from json file
sdf = spark.createDataFrame(mtcars) # read df from pandas df
dataframe_1 = spark.createDataFrame(data, columns) # create df from data (list of tuple) and columns
(list of string)
2.15.3.2. read
df.show() # print all df, show(5) # show first 5 line
df.printSchema() # print schema
df.createTempView("people") # Register the DataFrame as a local temporary view ,
createGlobalTempView("people") for global
df.select("name").show() # print only specific col
df.select(df["name"]).show()
spark.sql("SELECT name FROM people").show()
2.15.3.3. update (create new ver of df from cur df)
tmp_df =sdf.withColumn('wtTon', sdf['wt'] * 0.45).show(5) # create new row name wtTon, which is
modify version on col wt

tmp_df = sdf.withColumnRenamed("vs", "versus") # rename col

df.filter(df["age"] > 21).show() # filter() and where() functions are used for the same purpose.
spark.sql("SELECT age, name FROM people WHERE age > 21").show()

df.groupBy("age").count().show() # some other aggregate functions: max, min, countdistinct, avg, ...
spark.sql("SELECT age, COUNT(age) as count FROM people GROUP BY age").show()
car_counts = sdf.groupby(['cyl'])\
.agg({"wt": "count"})\
.sort("count(wt)", ascending=False)\
.show(5)

combined_df = dataframe_1.join(dataframe_2, on="emp_id", how="inner")

filled_df = dataframe_1.fillna({"salary": 3000})

2.15.3.4. UDFs

from pyspark.sql.functions import pandas_udf, PandasUDFType


# below line is called decorator
@pandas_udf("float")
def convert_wt(s: pd.Series) -> pd.Series:
# The formula for converting from imperial to metric tons
return s * 0.45
spark.udf.register("convert_weight", convert_wt)
spark.sql("SELECT *, wt AS weight_imperial, convert_weight(wt) as weight_metric FROM
cars").show()

3. Kubernetes
 is tool to run containerized applications on a cluster
 open source
 portable: can be run in the same way in any machine
 scalable, flexible

4. Example usage
4.1. install, count word from single txt file with mapreduce: link
4.2. deploy hadoop cluster using pre-build docker compose and view it using GUI: link

You might also like