0% found this document useful (0 votes)
11 views4 pages

Big Data With Hadoop and Spark_2023-25

The document outlines a course on Big Data with Hadoop and Spark, focusing on concepts, hands-on experience, and real-world applications. It covers the Hadoop ecosystem, including HDFS, MapReduce, Hive, and Spark, along with their architectures and functionalities. By the end of the course, students will have advanced data processing skills and an understanding of scalable data storage and management.

Uploaded by

arnab91221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Big Data With Hadoop and Spark_2023-25

The document outlines a course on Big Data with Hadoop and Spark, focusing on concepts, hands-on experience, and real-world applications. It covers the Hadoop ecosystem, including HDFS, MapReduce, Hive, and Spark, along with their architectures and functionalities. By the end of the course, students will have advanced data processing skills and an understanding of scalable data storage and management.

Uploaded by

arnab91221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

BIG DATA WITH HADOOP AND SPARK

Course Objectives:
1. Understand the concepts of big data and its impact on businesses.
2. Learn about the Hadoop ecosystem and its components, such as HDFS, MapReduce, Hive,
Pig, and HBase.
3. Gain hands-on experience with Hadoop and Spark, including writing applications and running
them on a cluster.
4. Learn about the different types of big data analytics and how to use Hadoop and Spark to
perform them.
5. Be able to apply Hadoop and Spark to real-world big data problems.

Course Outcomes:
At the end of the course students will be able to:
CO1: Advanced Data Processing Skills.
CO2: Scalable Data Storage and Management.
CO3: Distributed Computing Concepts.
CO4: Real-time Data Processing.

UNIT-I:

1. Introduction to Big Data and Hadoop (7 Hours)

(a) What is Big Data?


(b) The Rise of Bytes
(c) Data Explosion and its Sources
(d) Types of Data – Structured, Semi-structured, Unstructured data
(e) Characteristics of Big Data
(f) Limitations of Traditional Large-Scale Systems
(g) Use Cases for Big Data
(h) Challenges of Big Data
(i) Hadoop Introduction - What is Hadoop? Why Hadoop?
(j) Supported Operating Systems
(k) Organizations using Hadoop
(l) Hadoop Job Trends
(m) History of Hadoop
(n) Hadoop Core Components – MapReduce & HDFS

UNIT-II

2. Hdfs Architecture (4 Hours)

(a) Regular File System v/s HDFS

Page 1 of 4
(b) HDFS Architecture
(c) Components of HDFS - NameNode, DataNode, SecondayNameNode
(d) Components of HDFS - NameNode, DataNode, SecondayNameNode
(e) HDFS Features - Fault Tolerance, Horizontal Scaling, Data Replication, Rack Awareness
(f) Anatomy of a file write on HDFS
(g) Anatomy of a file read on HDFS
(h) Hands on with Hadoop HDFS, WebUI and Linux Terminal Commands
(i) HDFS File System Operations
(j) Name Node Metadata, File System Namespace, NameNode Operation,
(k) Data Block Split
(l) Benefits of Data Block Approach
(m) Topology, Data Replication Representation
(n) HDFS Programming Basics – Java API
(o) Hadoop Configuration API
(p) HDFS API Overview
(q) When Hadoop is not suitable

3. Map Reduce (2.5 Hours)

(a) What is MapReduce and Why it is popular


(b) MapReduce Framework– Introduction, Driver, Mapper, Reducer, Combiner, Split,
Shuffle & Sort
(c) YARN ARCHITECTURE
(d) Hadoop 1.0 Limitations
(e) MapReduce Limitations
(f) YARN Architecture

UNIT-III

4. Hive (3.5 Hours)


(a) Limitations of MapReduce
(b) Need for High Level Languages
(c) Analytical OLAP - Data warehousing with Apache Hive
(d) What is Hive?
(e) Hive Query Language
(f) Background of Hive
(g) Hive Installation and Configuration
(h) Hive Architecture, Data Types, Data Model, Examples
(i) Create/Show Database, Drop Tables
(j) SELECT, INSERT, OVERWRITE, EXPLAIN

Page 2 of 4
(k) CREATE, ALTER, DROP, TRUNCATE, JOINS
(l) SerDe (Serialization / Deserialization)
(m) Partitions and Buckets
(n) Limitations of Hive
(o) SQL vs. Hive
(p) Different Formats like Avro, Parquet and ORC

5. Scala (Object Oriented and Functional Programming) (2.5 Hours)


(a) Getting started With Scala.
(b) Scala Background, Scala Vs Java and Basics.
(c) Interactive Scala – REPL, data types, variables, expressions, simple functions.
(d) Running the program with Scala Compiler.
(e) Explore the type lattice and use type inference
(f) Define Methods and Pattern Matching.
(g) Scala set up on Windows.
(h) Scala set up on Unix.
UNIT-IV

6. Functional Programming, Object Oriented, Programming, Integrations (2.5 Hours)


(a) Classes and Properties. Objects. Packaging and Imports. Traits.
(b) Objects, classes, inheritance, lists with multiple related types, apply
(c) What is SBT? Integration of Scala in Eclipse IDE. Integration of SBT with Eclipse.
(d) Batch versus real-time data processing

7. Spark Core (3 Hours)


(a) Introduction to Spark, Spark versus Hadoop
(b) Architecture of Spark.
(c) Data Partitioning and Parallelism
(d) Coding Spark jobs in Scala
(e) Exploring the Spark shell -> Creating Spark Context.
(f) RDD Programming. Operations on RDD.
(g) Transformations. Actions
(h) Loading Data and Saving Data.
(i) Key Value Pair RDD.
(j) RCA for Spark Application failures

Page 3 of 4
UNIT-V

8. Spark Sql (2 Hours)


(a) Introduction to Apache Spark SQL
(b) The SQL context
(c) Importing and saving data
(d) Processing the Text files, JSON and Parquet Files
(e) Data Frames
(f) Using Hive
(g) PySpark and ML demo with use cases
(h) Connectivity with MySQL
(i) Error Handling
9. Spark Streaming (1.5 Hours)

(a) Introduction of Spark Streaming.


(b) Architecture of Spark Streaming
(c) Processing Distributed Log Files in Real Time
(d) Discretized streams RDD.
(e) Applying Transformations and Actions on Streaming Data

10. Kafka (1.5 Hours)


(a) Understanding Kafka Cluster
(b) Installing and Configuring Kafka Cluster
(c) Kafka Producer. Kafka Consumer
(d) Producer and Consumer in Action
(e) Reading Data from Kafka
(f) Lab: Implement Kafka Producer, Consumer using real time streaming data

TOTAL………………. (30 Hours)


Text Books:
1. BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning, Raj
Kamal, Preeti Saxena, Publisher McGraw Hill Education, 2019.
2. Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale, Tom White, Edition 4,
"O'Reilly Media, Inc.", 2015.

Reference Books:
1. The Hadoop for Dummies by Dirk deRoos, Paul C. Zikopoulos, Roman B. Melnyk, Bruce
Brown, Rafael Coss.
2. Hadoop MapReduce Cookbook, Srinath Perera, Thilina Gunarathne.

Page 4 of 4

You might also like