0% found this document useful (0 votes)
65 views

5-Overiview of Big Data Technologies - Hadoop

The document discusses big data technologies, including Hadoop, Spark, and NoSQL. It provides an overview of Hadoop, describing its core components like HDFS and MapReduce. It also discusses Spark and how it improves upon MapReduce. Additionally, it covers different NoSQL database categories like key-value, document, and column-based stores and compares their features to SQL-based databases. The goal is to help readers understand these technologies and how to evaluate options for solving big data problems.

Uploaded by

Wong pi wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

5-Overiview of Big Data Technologies - Hadoop

The document discusses big data technologies, including Hadoop, Spark, and NoSQL. It provides an overview of Hadoop, describing its core components like HDFS and MapReduce. It also discusses Spark and how it improves upon MapReduce. Additionally, it covers different NoSQL database categories like key-value, document, and column-based stores and compares their features to SQL-based databases. The goal is to help readers understand these technologies and how to evaluate options for solving big data problems.

Uploaded by

Wong pi wen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Big Data Analytics &

Technologies
CT047-3-M

Overview of Big Data Technologies


- Hadoop
Topic & Structure of The Lesson

• The lesson covers:


• Overview of Big data Technologies
– Hadoop-HDFS
– Hadoop-MapReduce
– NoSQL

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <2> of 9
Learning Outcomes

• At the end of this topic, You should be


able to
• Demonstrate the theories involved in big
data technologies
• Critically evaluate and present technology
choices to solve real world big data and
Data science problems

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <3> of 9
Key Terms You Must Be Able To
Use
• If you have mastered this topic, you should be able to use the
following terms correctly in your assignments and exams:

• Hadoop MapReduce
• Key-value
• Document-based
• NOSQL
• RDBMS

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <4> of 9
What Technology Do We Have
For Big Data ??

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <5> of 9
Hadoop for Big Data

• Apache Hadoop is a framework that allows for the distributed processing of


large data sets across clusters of commodity computers using a simple
programming model.
• It is an Open-source Data Management with scale-out storage & distributed
processing.

Source: https://hadoop.apache.org/
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <6> of 9
Hadoop Creation History

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <7> of 9
Hadoop: Assumptions
• It is written with large clusters of computers in mind and is
built around the following assumptions:
• Hardware will fail.
• Processing will be run in batches. Thus there is an emphasis
on high throughput as opposed to low latency.
• Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <8> of 9
Hadoop Features
• Scalable: Can reliably store and process petabytes.

• Cost effective: Distributes the data and processing across


clusters of commonly available computers (in thousands).

• Efficient: By distributing the data, it can process in parallel on


the nodes where the data is located.

• Flexible: Can easily access new data source and tap into
different types of data (structured and unstructured)

• Reliable: Automatically maintains multiple copies of data and


automatically redeploys computing tasks based on failures.

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <9> of 9
Apache Hadoop – key
components
• Hadoop Common: Common utilities
• (Storage Component) Hadoop Distributed File System (HDFS): A
distributed file system that provides high-throughput access
– Many other data storage approaches also in use
– E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA-contributed)
• (Scheduling) Hadoop YARN: A framework for job scheduling and
cluster resource management.
• (Processing) Hadoop MapReduce (MR2): A YARN-based system for
parallel processing of large data sets
– Other execution engines increasingly in use, e.g., Spark
• Note:
– All of these key components are OSS under Apache 2.0 license

David A. Wheeler
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <10> of 9
RDBMS vs. Hadoop

Source : Hadoop :The Definition Guide


CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <11> of 9
Apache Spark

• A new general framework, which solves many of the short comings


of MapReduce
• It is capable of leveraging the Hadoop ecosystem, e.g. HDFS,
YARN, HBase, …
• Has many other workflows, i.e. join, filter, flatMapdistinct,
groupByKey, reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine
learning algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <12> of 9
13

NOSQL

• The Name:
– Stands for Not Only SQL
– The term NOSQL was introduced by Carl Strozzi
in 1998 to name his file-based database
– It was again re-introduced by Eric Evans when an
event was organized to discuss open source
distributed databases
– Eric states that “… but the whole point of seeking
alternatives is that you need to solve a problem
that relational databases are a bad fit for. …”

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <13> of 9
Key features (advantages)
– non-relational
– don’t require schema
– data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
– horizontal scalable
– cheap, easy to implement
(open-source)
– massive write performance
– fast key-value access

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <14> of 9
Disadvantages

– Don’t fully support relational features


• no join, group by, order by operations (except
within partitions)
• no referential integrity constraints across partitions
– No declarative query language (e.g., SQL) 
more programming
– No easy integration with other applications
that support SQL

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <15> of 9
Who is using them?

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <16> of 9
NOSQL categories

1.Key-value
• Example: DynamoDB, Voldermort, Scalaris
2.Document-based
• Example: MongoDB, CouchDB
3.Column-based
• Example: BigTable, Cassandra, Hbase
4.Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics
of most NOSQL storage systems
• Provide “flexible” data types
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <17> of 9
Key-value

• Focus on scaling to huge amounts of data


• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Dynamo ring partitioning and replication
• Example: (DynamoDB)
– items having one or more attributes (name, value)
– An attribute can be single-valued or multi-valued
like set.
– items are combined into a table
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <18> of 9
Key-value

• Basic API access:


– get(key): extract the value given a key
– put(key, value): create or update the value
given its key
– delete(key): remove the key and its
associated value
– execute(key, operation, parameters): invoke
an operation to the value (given its key) which
is a special data structure (e.g. List, Set, Map
.... etc)

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <19> of 9
Key-value

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Key-value

Pros:
– very fast
– very scalable (horizontally distributed to nodes based on
key)
– simple data model
– eventual consistency
– fault-tolerance

Cons:
- Can’t model more complex data structure such
as objects
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <21> of 9
Document-based

• Can model more complex objects


• Inspired by Lotus Notes
• Data model: collection of documents
• Document: JSON (JavaScript Object Notation is a
data model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean
with nesting), XML, other semi-structured formats.

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <22> of 9
Document-based

• Example: (MongoDB) document


– {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten:
"1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <23> of 9
Document-based

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order)
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
– Collection of Column Families
– Column family = (key, value) where value = set of related columns (standard, super)
– indexed by row key, column key and timestamp

allow key-value pairs to be stored (and retrieved on key) in a massively parallel


system
storing principle: big hashed distributed tables
properties: partitioning (horizontally and/or vertically), high availability etc.
completely transparent to application

* Better: extendible records

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
– Query on multiple tables
• RDBMS: must fetch data from several places on disk and
glue together
• Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a
column family are stored together on the disk, so multiple
rows can be retrieved in one read operation  data locality)

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Graph-based

• Focus on modeling the structure of data (interconnectivity)


• Scales to the complexity of data
• Inspired by mathematical Graph Theory (G=(E,V))
• Data model:
– (Property Graph) nodes and edges
• Nodes may have properties (including ID)
• Edges may have labels or roles
– Key-value pairs on both
• Interfaces and query languages vary
• Single-step vs path expressions vs full recursion
• Example:
– Neo4j, FlockDB, Pregel, InfoGrid …

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Apache Hive

• “Hive is a data warehouse infrastructure


tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize
Big Data, and makes querying and
analyzing easy”

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Hive is not

• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and row-
level updates

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Features of Hive

• It stores schema in a database and


processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
SQL-on-Hadoop

• Enable the use of SQL commands in


Hadoop for assessing and processing big
data.
• Hive Data warehouse is one of the earliest
applications which was made to integrate
SQL with Hadoop.
• Some other examples of such applications
are the Apache Drill, Apache Spark, H-
SQL, BigSQL, Tez.
CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <#> of 9
Quick Review Question

• What Technology Do We Have For Big


Data ?
• Explain the difference between NoSQL
v/s Relational database?
• Explain the categories NOSQL?

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <33> of 9
Summary of Main Teaching Points

• Hadoop for Big Data


• Key features of NoSQL
• NOSQL categories
• SQL-on-Hadoop

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <34> of 9
Question and Answer Session

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <35> of 9
What we will cover next

• Hadoop – HDFS and MapReduce


– Hadoop Framework
– HDFS file system
– Hadoop Map reduce
– Hadoop Streaming

CT047-3-M-BDAT - Big Data Analytics & Technologies Overview of Big Data Technologies Slide <36> of 9

You might also like