Hadoop Design
Hadoop Design
ARCHITECTURE
DESIGN
Putting the pieces together
Working backwards
■ Start with the end user’s needs, not from where your data is coming from
– Sometimes you need to meet in the middle
■ What sort of access patterns do you anticipate from your end users?
– Analytical queries that span large date ranges?
– Huge amounts of small transactions for very specific rows of data?
– Both?
■ What availability do these users demand?
■ What consistency do these users demand?
Thinking about requirements
■ Latency
– How quickly do end users need to get a response?
■ Milliseconds? Then something like HBase or Cassandra will be needed
■ Timeliness
– Can queries be based on day-old data? Minute-old?
■ Oozie-scheduled jobs in Hive / Pig / Spark etc may cut it
– Or must it be near-real-time?
■ Use Spark Streaming / Storm / Flink with Kafka or Flume
Judicious future-proofing
■ Once you decide where to store your “big data”, moving it will be really
difficult later on
– Think carefully before choosing proprietary solutions or cloud-based
storage
■ Will business analysts want your data in addition to end users (or vice versa?)
Cheat to win
Zookeeper
Behavior data
Oozie
Flume
Spark / MLLib
Spark Recs service
Streaming (on
YARN/Slider?)
HBase (user ratings and movie
similarities)
HDFS
EXERCISE: DESIGN
WEB ANALYTICS
Track number of sessions per day on a website
Your mission…
■ Things to consider:
– A daily SQL query run automatically is all you really need
– But this query needs some table that contains session data
■ And that will need to be built up throughout the day
EXERCISE: (A)
SOLUTION
One way to solve the daily session count problem.
One way to do it.
Web
servers
Spark
Kafka Oozie
Streaming
Hive
HDFS
There’s no “right answer.”