0% found this document useful (0 votes)
36 views

Big Data Management

The document discusses big data and its management. It defines big data as large volumes of structured, semi-structured and unstructured data that is generated quickly and is challenging to process using traditional systems. It then covers characteristics, frameworks like Hadoop and Spark, and the Hadoop ecosystem including components like HDFS, MapReduce, Pig and Hive.

Uploaded by

sehun twin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Big Data Management

The document discusses big data and its management. It defines big data as large volumes of structured, semi-structured and unstructured data that is generated quickly and is challenging to process using traditional systems. It then covers characteristics, frameworks like Hadoop and Spark, and the Hadoop ecosystem including components like HDFS, MapReduce, Pig and Hive.

Uploaded by

sehun twin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

BIG DATA &

MANAGEMENT
1. WHAT IS BIG DATA?
When does data become BIG?

Large Volume of Data – Structured and Unstructured

2.5 quintillion bytes of data are generated


everyday [Discussed in Lesson 1]

2,500,000,000,000,000,000
3
What is Big Data?

2.5 quintillion pennies would, if laid out flat, cover


the Earth five times

Bill Gates’s projected fortune times 2.5 million,


assuming he lives to see 2042
[$102 Billion ~ 419,985,000,000.00]

Can we process this data on traditional computing 4


systems?
2. CHARACTERISTICS
How do you classify data as BIG Data?

Volume : Size

Velocity : High Speed of Accumulation

Variety : Nature [Structured, Semi, Unstructured]

Veracity : Inconsistencies & Uncertainties (Quality)

Value : Information - Knowledge


6
Volume Velocity Variety

~ 2500 Exabytes Per Year - Excel Files (S)


- System Logs (SS)
- CT Scans (U)

Veracity Value

- Accuracy - Diseases Detection


- Trustworthiness - Drug Detection
- Misdiagnosis
- Reduced Cost
7
“ Huge amount of complex,
variously formatted data
generated at high speed, that
cannot be handled,
processed by the traditional
system

8
3. MANAGEMENT
FRAMEWORKS / TOOLS

Popular: Hadoop, Storm, Hive and Spark

Promising: Flink and Heron

Most Useful: Presto and Map Reduce

Kafka, TEZ, Impala, Beam, Apex, etc


10
STORAGE - HADOOP

HDFS – Hadoop Distributed File System


400 MB

PART A PART B PART C PART D

Machine A Machine B Machine C Machine D

100 MB 100 MB 100 MB 100 MB

11
STORAGE - HADOOP

HDFS – Hadoop Distributed File System


400 MB

PART A PART D PART B PART C PART C PART B PART D PART A

Machine A Machine B Machine C Machine D

100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB

12
PROCESSING – HADOOP

Map Reduce – Parallel Processing


TASK A

TASK A1 TASK A2 TASK A3 TASK A4

Machine A Machine B Machine C Machine D

RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A

13
HADOOP - HDFS

Designed for storing huge datasets in commodity hardware


Name Node [Master]

400 MB

PART A PART B PART C PART D

Data Node Data Node Data Node Data Node


[Slave] [Slave] [Slave] [Slave]
Machine A Machine B Machine C Machine D
14
PART A PART D PART B PART C PART C PART B PART D PART A
HADOOP – MAP REDUCE

Infrastructure
Master Node

TASK A

Slave Node Slave Node Slave Node Slave Node

Machine A Machine B Machine C Machine D

TASK A1 TASK A2 TASK A3 TASK A4

15
RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A
HADOOP – MAP REDUCE PROCESSING/PROGRAMMING

SHUFFLE &
INPUT SPLIT MAP REDUCE
SORT
Malaysia, Saudi Malaysia, 1 Malaysia, 1
Arabia, Comoros Saudi Arabia, 1 Malaysia, 1
Comoros, 1
Malaysia, Saudi Bangladesh, Bangladesh, 1 Saudi Arabia, 1
Arabia, Comoros. Algeria, Malaysia Algeria, 1 Saudi Arabia, 1 Malaysia, 2
Bangladesh, Malaysia, 1 Saudi Arabia, 2
Algeria, Malaysia. Comoros, 2
Comoros. Comoros Comoros, 1 Comoros, 1 Bangladesh, 1
Algeria, Saudi Comoros, 1 Algeria, 2
Arabia Algeria, Saudi Algeria, 1 Bangladesh, 1
Arabia Saudi Arabia, 1
Algeria, 1
Algeria, 1 16
HADOOP - YARN

YARN – Yet Another Resource Negotiator


CLIENT A Node Manager

App Master Container

CLIENT B

Node Manager
YARN
CLIENT C App Master Container

CLIENT D Node Manager


17
App Master Container
4. HADOOP ECOSYSTEM
ECOSYSTEM

Core Hadoop

Query Engines

External Data Storage

19
CORE HADOOP

20
PIG [Procedural Language Platform]

High level scripting language

Complex data transformation without Java

Simple SQL-like scripting called Pig Latin

Works with data from many sources, including structured and


unstructured data

Store the results into the Hadoop Data File System

Pig scripts are translated into a series of MapReduce jobs before 21


execution
Components

Pig Latin script language


Procedural Data Flow Language
Examples: LOAD, STORE, etc.

A runtime engine
Compiler producing Sequences
Parsing, Validation & Compilation into a
sequence of MapReduce jobs. 22
Example

1. A = LOAD ‘myfile’
2. AS (x, y, z);
3. B = FILTER A by x > 0;
4. C = GROUP B BY x;
5. D = FOREACH A GENERATE
6. x, COUNT(B);
7. STORE D INTO ‘output’; 23
Data Model

Nested Model

24
HIVE

Data warehouse infrastructure tool to process structured data in Hadoop

It stores schema in a database and processed data into HDFS.

It is designed for OLAP

It provides SQL type language for querying called HiveQL or HQL

It is familiar, fast, scalable, and extensible


25
Architecture

26
Data Flow

27
Data Modeling

Tables
Same as RDMS
Partitions
Partitioned tables of same data
connected by a key
Buckets
Smaller partitions for efficient querying
28
Example

1. create database office;


2. show databases;
3. drop database office; - if empty
4. drop database office cascade; - if not empty
5. create database office;
6. use office;

29
APACHE AMBARI

Provisioning, managing, and monitoring Apache Hadoop clusters

Intuitive, easy-to-use Hadoop management web UI backed by its RESTful


APIs

Provisioning:

Step-by-step wizard for installing Hadoop services across any


number of hosts

Handles configuration of Hadoop services for the cluster 30


Managing:

Central management for starting, stopping, and reconfiguring Hadoop


services across the entire cluster

Monitoring:

Dashboard for monitoring health and status of the Hadoop cluster

Leverages Ambari Metrics System for metrics collection

Leverages Ambari Alert Framework for system alerting and will notify you
when your attention is needed (e.g., a node goes down, remaining disk space
is low, etc.)
31
Architecture

32
MESOS – Another Resource Negotiator

33
Example

34
MESOS vs YARN

MESOS YARN

Language C++ Java

Scheduler Non-Monolithic Monolithic

Scheduling Memory & CPU Memory

Scalability Highly Scalable Less Scalable

Data Centre Complete Hadoop Job


Management
Availability Multiple Masters YARN Only

Fault Tolerance
35
Security Trusted Entities Multiple Layers
SPARK

Speed : Run workloads 100x faster


Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art
DAG scheduler, a query optimizer, and a physical execution engine

Ease of Use : Program using Java, Scala, Python, R, and SQL


Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use
it interactively from the Scala, Python, R, and SQL shells

Generality : Combine SQL, streaming, and complex analytics


Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and
Spark Streaming. You can combine these libraries seamlessly in the same application

Runs Everywhere : Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the 36
cloud. It can access diverse data sources
Architecture

Standalone
Mesos
YARN
Kubernetes

37
BARAKALLAH FEEKUM!
Any questions?

Feel free to contact me using designated


channels

38

You might also like