Hadoop Design

This document discusses designing a Hadoop architecture to meet end user requirements. It emphasizes starting with the end user needs, understanding data access patterns and availability/consistency demands. It also covers assessing technical requirements like data size, retention policies, security needs and performance metrics. The document provides examples of designing systems to track top sellers and provide movie recommendations. It demonstrates working through requirements and choosing appropriate technologies like Kafka, Cassandra, Spark and HBase.

Uploaded by

Nouhaila

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Hadoop Design

Uploaded by

Nouhaila

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

HADOOP

ARCHITECTURE
DESIGN
Putting the pieces together
Working backwards

■ Start with the end user’s needs, not from where your data is coming from
– Sometimes you need to meet in the middle
■ What sort of access patterns do you anticipate from your end users?
– Analytical queries that span large date ranges?
– Huge amounts of small transactions for very specific rows of data?
– Both?
■ What availability do these users demand?
■ What consistency do these users demand?
Thinking about requirements

■ Just how big is your big data?

– Do you really need a cluster?
■ How much internal infrastructure and expertise is available?
– Should you use AWS or something similar?
– Do systems you already know fit the bill?
■ What about data retention?
– Do you need to keep data around forever, for auditing?
– Or do you need to purge it often, for privacy?
■ What about security?
– Check with Legal
More requirements to understand

■ Latency
– How quickly do end users need to get a response?
■ Milliseconds? Then something like HBase or Cassandra will be needed
■ Timeliness
– Can queries be based on day-old data? Minute-old?
■ Oozie-scheduled jobs in Hive / Pig / Spark etc may cut it
– Or must it be near-real-time?
■ Use Spark Streaming / Storm / Flink with Kafka or Flume
Judicious future-proofing

■ Once you decide where to store your “big data”, moving it will be really
difficult later on
– Think carefully before choosing proprietary solutions or cloud-based
storage
■ Will business analysts want your data in addition to end users (or vice versa?)
Cheat to win

■ Does your organization have existing components you can use?

– Don’t build a new data warehouse if you already have one!
– Rebuilding existing technology always has negative business value
■ What’s the least amount of infrastructure you need to build?
– Import existing data with Sqoop etc. if you can
– If relaxing a “requirement” saves lots of time and money – at least ask
EXAMPLE: TOP
SELLERS
Designing a system to keep track of top-selling
items
What we want to build

■ A system to track and display the

top 10 best-selling items on an
e-commerce website
What are our requirements? Work
backwards!
■ There are millions of end-users, generating thousands of queries per second
– It MUST be fast – page latency is important
– So, we need some distributed NoSQL solution
– Access pattern is simple: “Give me the current top N sellers in category
X”
■ Hourly updates probably good enough (consistency not hugely important)
■ Must be highly available (customers don’t like broken websites)
■ So – we want partition-tolerance and availability more than consistency
Sounds like Cassandra
But how does data get into
Cassandra?
■ Spark can talk to Cassandra…
■ And Spark Streaming can add things up over windows
OK, how does data get into Spark
Streaming?
■ Kafka or Flume – either works
■ Flume is purpose-built for HDFS, which so far we haven’t said we need
■ But Flume is also purpose-built for log ingestion, so it may be a good choice
– Log4j interceptor on the servers that process purchases?
Don’t forget about security

■ Purchase data is sensitive – get a security review

– Blasting around raw logs that include PII* is
probably a really bad idea
– Strip out data you don’t need at the source
■ Security considerations may even force you into a
totally different design
– Instead of ingesting logs as they are generated,
some intermediate database or publisher may be
involved where PII is scrubbed

*Personally Identifiable Information

So, something like this might work:
Interestingly, you *could* build this without Hadoop at all

Zookeeper

Purchase Spark Web app

servers Flume Cassandra servers
Streaming
But there’s more than one way to do
it.
■ Maybe you have an existing purchase database
– Instead of streaming, hourly batch jobs would also meet your
requirements
– Use Sqoop + Spark -> Cassandra perhaps?
■ Maybe you have in-house expertise to leverage
– Using Hbase, MongoDB, or even Redis instead of Cassandra would
probably be OK.
– Using Kafka instead of Flume – totally OK.
■ Do people need this data for analytical purposes too?
– Might consider storing on HDFS in addition to Cassandra.
EXAMPLE: MOVIE
RECOMMENDATIONS
Other movies you may like…
Working backwards

■ Users want to discover movies they haven’t yet

seen that they might enjoy
■ Their own behavior (ratings, purchases, views)
are probably the best predictors
■ As before, availability and partition-tolerance
are important. Consistency not so much.
Cassandra’s our first choice

■ But any NoSQL approach would do these days

How do movie recommendations get
into Cassandra?
■ We need to do machine learning
– Spark MLLib
– Flink could also be an alternative.
■ Timeliness requirements need to be thought out
– Real-time ML is a tall order – do you really need recommendations based
on the rating you just left?
– That kinda would be nice.
Creative thinking

■ Pre-computing recommendations for every user

– Isn’t timely
– Wastes resources
■ Item-based collaborative filtering
– Store movies similar to other movies (these relationships don’t change quickly)
– At runtime, recommend movies similar to ones you’ve liked (based on real-time
behavior data)
■ So we need something that can quickly look up movies similar to ones you’ve liked at scale
– Could reside within web app, but probably want your own service for this
■ We also need to quickly get at your past ratings /views /etc.
OK Then.

■ So we’ll have some web service to create recommendations on demand

■ It’ll talk to a fast NoSQL data store with movie similarities data
■ And it also needs your past ratings / purchases /etc.
■ Movie similarities (which are expensive) can be updated infrequently, based on
log data with views / ratings / etc.
Something like this might work.

Web app servers

Behavior data
Oozie

Flume
Spark / MLLib
Spark Recs service
Streaming (on
YARN/Slider?)
HBase (user ratings and movie
similarities)

HDFS
EXERCISE: DESIGN
WEB ANALYTICS
Track number of sessions per day on a website
Your mission…

■ You work for a big website

■ Some manager wants a graph of total number of sessions per day
■ And for some reason they don’t want to use an existing service for this!
Requirements

■ Only run daily based on previous day’s activity

■ Sessions are defined as traffic from same IP address within a sliding one hour
window
– Hint: Spark Streaming etc. can handle “stateful” data like this.
■ Let’s assume your existing web logs do not have session data in them
■ Data is only used for analytic purposes, internally
How would you do it?

■ Things to consider:
– A daily SQL query run automatically is all you really need
– But this query needs some table that contains session data
■ And that will need to be built up throughout the day
EXERCISE: (A)
SOLUTION
One way to solve the daily session count problem.
One way to do it.
Web
servers

Spark
Kafka Oozie
Streaming

Hive

HDFS
There’s no “right answer.”

■ And, it depends on a lot of things

– Have an existing sessions database that’s updated daily? Just use sqoop
to get at it
– In fact, then you might not even need Hive / HDFS.

Board Resolution Sample
76% (25)
Board Resolution Sample
2 pages
Calibration and Quality Control in The Laboratory
No ratings yet
Calibration and Quality Control in The Laboratory
15 pages
Arun Arora - Rotary Reliability Engineer - Resume - 3
0% (1)
Arun Arora - Rotary Reliability Engineer - Resume - 3
3 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
0 0 1 1 1 W A P 1 N N I 1 I X I N 1 N N I 1 I 2
No ratings yet
0 0 1 1 1 W A P 1 N N I 1 I X I N 1 N N I 1 I 2
2 pages
Choosing Database
No ratings yet
Choosing Database
16 pages
Spark
No ratings yet
Spark
49 pages
Large Datasets in MySQL On Amazon EC2
No ratings yet
Large Datasets in MySQL On Amazon EC2
30 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Databricks, An Introduction: Chuck Connell, Insight Digital Innovation
No ratings yet
Databricks, An Introduction: Chuck Connell, Insight Digital Innovation
36 pages
Cs498 Week 10 Slide
No ratings yet
Cs498 Week 10 Slide
38 pages
Kafka & Redis For Big Data Solutions: Christopher Curtin Head of Technical Research @chriscurtin
No ratings yet
Kafka & Redis For Big Data Solutions: Christopher Curtin Head of Technical Research @chriscurtin
43 pages
Sap Barcodes
No ratings yet
Sap Barcodes
59 pages
whyPostgres
No ratings yet
whyPostgres
16 pages
Cs 3105 - Application Development and Emerging Technologies: FIRST SEMESTER A.Y. 2020-2021 Engr. Ean Jason Velayo
No ratings yet
Cs 3105 - Application Development and Emerging Technologies: FIRST SEMESTER A.Y. 2020-2021 Engr. Ean Jason Velayo
9 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Cloud Based Application Deployment Platforms and Components
No ratings yet
Cloud Based Application Deployment Platforms and Components
21 pages
Next Generation Technology
No ratings yet
Next Generation Technology
4 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Data Engineering Cookbook
86% (7)
Data Engineering Cookbook
88 pages
Hacking The Clouds
No ratings yet
Hacking The Clouds
65 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Big Data Infrastructure
No ratings yet
Big Data Infrastructure
12 pages
System Design
No ratings yet
System Design
90 pages
Trends You Need To Know For Visual Foxpro 9.0 and Above: German Foxpro User Group
No ratings yet
Trends You Need To Know For Visual Foxpro 9.0 and Above: German Foxpro User Group
53 pages
UNIT IV_ioT_1
No ratings yet
UNIT IV_ioT_1
27 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
The Big Data Ecosystem at LinkedIn Presentation 1
No ratings yet
The Big Data Ecosystem at LinkedIn Presentation 1
33 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Boost Drupal Performance by Rumen Yordanov
No ratings yet
Boost Drupal Performance by Rumen Yordanov
40 pages
CloudxLab BDHS Course Details
No ratings yet
CloudxLab BDHS Course Details
9 pages
Skip To Content
No ratings yet
Skip To Content
15 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
CC presentation GAE
No ratings yet
CC presentation GAE
14 pages
Simpleworkflow in The Cloud: Poa June 17, 2010 Alan Robbins and Rick Sears
No ratings yet
Simpleworkflow in The Cloud: Poa June 17, 2010 Alan Robbins and Rick Sears
99 pages
NoSQL and SQL - Open Analytics Summit
No ratings yet
NoSQL and SQL - Open Analytics Summit
28 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Introduction To HANA - Deep Dive
No ratings yet
Introduction To HANA - Deep Dive
106 pages
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
No ratings yet
Data Exploration and Analysis in The Age of Big Data:: Getting Results Faster Than You Thought Possible
20 pages
Sap Barcodes
No ratings yet
Sap Barcodes
59 pages
Sybase
No ratings yet
Sybase
18 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Capacity Planning For IT - 1
No ratings yet
Capacity Planning For IT - 1
15 pages
PM Exercise Updated
No ratings yet
PM Exercise Updated
13 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Scale Perf Best Practices
No ratings yet
Scale Perf Best Practices
39 pages
Parallel Database
No ratings yet
Parallel Database
27 pages
Data Ingestion Use Cases: Moving Big Data Into Hadoop
No ratings yet
Data Ingestion Use Cases: Moving Big Data Into Hadoop
2 pages
Lecture 04 - Cloud Storage
No ratings yet
Lecture 04 - Cloud Storage
28 pages
2011 Webber-A Programmatic Introduction To Neo4j
No ratings yet
2011 Webber-A Programmatic Introduction To Neo4j
66 pages
Commoncrawlpresentation 101027182938 Phpapp02
No ratings yet
Commoncrawlpresentation 101027182938 Phpapp02
17 pages
Spark
No ratings yet
Spark
4 pages
Lamp Technology 8860 9KNDvBR
No ratings yet
Lamp Technology 8860 9KNDvBR
14 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Hepsysman Influxdb Grafana v1
No ratings yet
Hepsysman Influxdb Grafana v1
35 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Installation Instructions: For EVS Security Systems
No ratings yet
Installation Instructions: For EVS Security Systems
14 pages
Sedco Forex Well Control Manual
100% (1)
Sedco Forex Well Control Manual
339 pages
Fabm 1 Review
No ratings yet
Fabm 1 Review
29 pages
Exploded View Parts List (UA55ES7100RXXP)
No ratings yet
Exploded View Parts List (UA55ES7100RXXP)
14 pages
MINI CASE CH 9
No ratings yet
MINI CASE CH 9
7 pages
ILS Assessment
No ratings yet
ILS Assessment
12 pages
Citadel Trojan Report - Eng
No ratings yet
Citadel Trojan Report - Eng
27 pages
Chapter 3
No ratings yet
Chapter 3
8 pages
Aptitude Good
No ratings yet
Aptitude Good
139 pages
Sec 9.3 - 2020 Linear Regression
No ratings yet
Sec 9.3 - 2020 Linear Regression
8 pages
Calcium Compounds
No ratings yet
Calcium Compounds
12 pages
Optimize Your Website To Voice Search
No ratings yet
Optimize Your Website To Voice Search
12 pages
Whitepaper HIZ Finance
No ratings yet
Whitepaper HIZ Finance
14 pages
MC33201, MC33202, MC33204, NCV33202, NCV33204 Low Voltage, Rail-to-Rail Operational Amplifiers
No ratings yet
MC33201, MC33202, MC33204, NCV33202, NCV33204 Low Voltage, Rail-to-Rail Operational Amplifiers
18 pages
Find The Judgement
No ratings yet
Find The Judgement
4 pages
Taking A Gamble?: The Valuation Edition
No ratings yet
Taking A Gamble?: The Valuation Edition
20 pages
Shape Synthesis of High-Performance Machine Parts and Joints
No ratings yet
Shape Synthesis of High-Performance Machine Parts and Joints
1 page
EA SPORTS™ FC 25 - ألعاب PS4 و PS5 ‏PlayStation (المملكة العربية السعودية) 2
No ratings yet
EA SPORTS™ FC 25 - ألعاب PS4 و PS5 ‏PlayStation (المملكة العربية السعودية) 2
1 page
EEE 311 Course Note
No ratings yet
EEE 311 Course Note
14 pages
Paper I - (200 Marks) Duration: Two Hours: Mains
No ratings yet
Paper I - (200 Marks) Duration: Two Hours: Mains
4 pages
Settings Provider
No ratings yet
Settings Provider
181 pages
Bitwise Operators
No ratings yet
Bitwise Operators
9 pages
User Manual
No ratings yet
User Manual
24 pages
Freeemg 1000
No ratings yet
Freeemg 1000
39 pages
Invoice 00005 Webnivers Digital Juned Khan ST Xavier School Himmatnagar
No ratings yet
Invoice 00005 Webnivers Digital Juned Khan ST Xavier School Himmatnagar
1 page
Acc313 314 Audit of Intangibles For Posting
No ratings yet
Acc313 314 Audit of Intangibles For Posting
4 pages