Big data analytics practical through practice
Big data analytics practical through practice
Prerequisite:
students are expected to have a prior knowledge on
Hardware:
1. Every student should have at least 16gb ram and 6 cores, 60 GB free storage.
session -1
● Distributing compute & storage and how it's different from traditional databases.
● Vertical scaling vs horizontal scaling.
● Introduction to hadoop (HDFS, Map reduce), hive & spark.
● Map Reduce classic word count program.
● Creating linux virtual machines and understanding different network configurations.
Assignment:
1. Try to set up a 2 node HDFS cluster on your own. Resource
2. Understand SSH & SSH keygen. By next session you should be able to login
into VM’s from the host machine without a password.
session -2
● Setting up a 2 node HDFS cluster (with yarn).
● Different services and their responsibilities in HDFS.
● Web UI’s different services.
Assignment: Try to ingest 10GB stackoverflow data into microsoft sql server (you
need to setup it on your own) and ingest the data into HDFS. Resource1, Resource2,
Resource3, Resource4
session -4
● Ingesting Stackoverflow data into HDFS and understanding stackoverflow data
modelling.
Assignment:
1. OLAP vs OLTP?
2. Understanding normalisation, slowly changing dimensions,data clustering,
data partitioning. Resource
3. Fact vs dimension tables.
session -5
● Jupyter notebook setup.
● Introduction to pyspark on jupyter notebook (without yarn) and writing basic
transformations on Stackoverflow data.
● Using pyspark with yarn and writing basic transformations.
● Spark UI and lineage graph or execution plan.
Assignment:
1. What is spark standalone cluster vs yarn?
2. What is spark client vs cluster mode and executing a spark job in both the
mode and getting back with their observations.
3. What is the difference between using spark with scala vs pyspark?
session -6
● Setting up Apache Toree for ease of access.
● Writing spark batch jobs to answer below business questions.
○ To be filled……
Assignment: Come up with an OLAP data modelling for stack overflow and update
the spark jobs created earlier to work with your updated data model.
session -7
● Random peer review of updated data model and idea behind the decisions taken.
(student will explain his updated ER diagram and try to give supporting points or
reasons behind his changes.)
● Also students should try to show spark job performance improvement with their new
OLAP design.
Mini Project:
A smart home that contains a fan and temperature sensor. I need to turn
on/off the fan from my smartphone by looking at the temperature value. Should use
MQTT broker to create communication between smart home & mobile. To be
completed before session -10.
Outcome: Students will be able to understand how big data streaming is important
with IOT use cases. MQTT is a mini alternative to the Kafka server and this
knowledge helps students to further connect the dots to understand streaming much
better.
Resources:
Setting up mqtt broker in AWS instance
Nodemcu + micropython + DHT
session -8
● Introduction to hive
● Setting up hive server
● Creating hive external tables for stackoverflow databases.
Assignment:
1. What is jdbc server, jdbc connection string?
2. Create partitions on stackoverflow hive tables.
3. Rewrite the spark jobs in session -5 to read and write data into hive instead of
direct HDFS.
Session -9
● Introduction to streaming.
● Setting up a kafka cluster with zookeeper.
Assignment: Create a python module that consumes from twitter API and writes into
a sqlite database. resource
Session -10
● Introduction to twitter API and setup.
● Creating a kafka topic and consuming with kafka cli.
Session -11
● Spark streaming job introduction.
● Ingesting twitter data into kafka topics and processing using spark streaming jobs.
Assignment: create python to ingest the temperature data from MQTT broker into
Kafka topic and write average temperature per hour into a hive table.
Session -12
● Writing spark streaming jobs for real time analytics, below are a list of analytics
business questions.
○ To be filled…..
Assignment / Project Idea: Come with use cases or problems which can be solved
with big data and tools you learned in this course. The problem you are choosing
should have easy availability of data or data can be gathered by a realistic means.
Session -13
● Intro to apache airflow
● Apache airflow installation
● Writing a simple shell job.
Assignment: write a airflow dag that call twitter API to get all the tweets of previous
day with hashtag #elonmusk everyday at 12 am IST
Session -14
● Get all the tweets of #cryptocurrency and ingest them at hourly cadence into hdfs
and write a spark batch job that counts the numbers of tweets that talk about bitcoin
and the ratio.