0% found this document useful (0 votes)
3 views

Big Data Analytics 0th Lecture

INT 421 BIG DATA LECTURE0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Big Data Analytics 0th Lecture

INT 421 BIG DATA LECTURE0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

#LifeKoKaroLift

Lecture
Big Data Analytics
INT421 1
Module Name: Big
Data Analytics

Course : Hadoop , Hive


, Spark
Edit Master text styles
Lecture On : Basics of
Big Data Analytics - Day
-1
Instructor : Sneha
Sharma

2
Poll 1
Edit Master text styles

How familiar are you students with Big Data and Hadoop on the scale of 1-5?

• Practice in teams of 4 students


• Industry expert mentoring to learn better
• Get personalised feedback for improvements

23/05/19 Footer 11 3
Course Assessment Model
Marks break up
• Attendance 5
• CA(Assignment(Case Based)+Test+Test) 25
• MTT 20
• ETE
50
Total 100

4
Detail of Academic Tasks
• AT1: Assignment- Case based
• AT2: Class Test
• AT3: Class Test

(AT1 is compulsory and 1 best of AT2 and AT3 will be


considered)
Course Outcomes
CO1 Evaluate the features and limitations of Hadoop's HDFS (Hadoop Distributed File System), and compare it with
conventional data processing systems, in order to make informed decisions about selecting the appropriate data storage
and processing system for their specific business needs.

CO2 Design and implement a Hive data management system for a given business case, including creating a database,
internal and external tables, and performing various operations on the tables such as sorting, distributing and clustering,
in order to efficiently manage large volumes of structured data.

CO3 Compare and contrast the Spark and MapReduce frameworks, and evaluate the benefits of using Spark for big data
processing tasks.

CO4 Apply the concepts of paired RDDs and structured APIs such as DataFrames and Datasets in Apache Spark, and perform
various operations such as data manipulation, window functions and descriptive statistics.

CO5 Evaluate the importance of optimizing Spark jobs for efficient performance, and analyze Spark jobs to identify potential
performance bottlenecks such as disk IO, network IO and shuffles.

CO6 Design and implement an optimized Spark job that utilizes best practices for working with Apache Spark in a production
environment
Today’s Agenda

● Big Data Analytic


● Big Data Processing
● Introduction to Hadoop
● Hadoop Architecture
● Hadoop Ecosystem

7
Introduction to Big Data Analytics

WHAT IS BIG DATA?


• Big data refers to a vast amount of structured, semi-structured, and unstructured
data that is generated at a high velocity and volume.
• The term "big data" is not defined by a specific size but rather the scale and
complexity of the data involved.
• Sources- Social media, IoT devices, Web and Mobile Applications,
• Types of Data - Structured, Semistructured, Unstructured

Primary characteristics of big data are often referred to as the "3Vs":


• Volume - Huge amount of data
• Velocity - The speed at which data is generated
• Variety - Types of data
• Veracity - Quality and accuracy of the data
• Value - Usefulness of the data
Big Data Processing
• Data collection
• Data storage
• Data preprocessing
• Data processing
• Data analysis
• Data visualization
• Decision making.
● Value: The ultimate goal of big data analytics is to extract value and derive
actionable insights from the data. Despite dealing with large volumes of data with
varying velocity and variety, the primary focus is on discovering meaningful
patterns and trends that can lead to data-driven decisions, better business
strategies, improved operational efficiency, and other valuable outcomes.
Introduction to Hadoop

What is Hadoop?

Hadoop is an open-source framework


based on Java that manages the storage
and processing of large amounts of data for
applications. Hadoop uses distributed
storage and parallel processing to handle
big data. It provides a software framework
for distributed storage and processing of big
data using the MapReduce programming
model
Introduction to Hive

What is Hive?

Apache Hive is a data warehouse software


project that is built on top of the Hadoop ecosystem.
It provides an SQL-like interface to query and
analyze large datasets stored in Hadoop’s
distributed file system (HDFS) or other compatible
storage systems.
Hive uses a language called HiveQL, which is
similar to SQL, to allow users to express data
queries, transformations, and analyses in a familiar
syntax. HiveQL statements are compiled into
MapReduce jobs, which are then executed on the
Hadoop cluster to process the data.
Introduction to Spark

What is Spark?

Apache Spark is a lightning-fast cluster


computing technology, designed for fast
computation. It is based on Hadoop
MapReduce and it extends the MapReduce
model to efficiently use it for more types of
computations, which includes interactive
queries and stream processing. The main
feature of Spark is its in-memory cluster
computing that increases the processing
speed of an application.
HDFS

What is HDFS?

HDFS is a distributed file system that


handles large data sets running on
commodity hardware. It is used to scale a
single Apache Hadoop cluster to hundreds
(and even thousands) of nodes. HDFS is
one of the major components of Apache
Hadoop, the others being MapReduce and
YARN.
Mapper

What is Mapper?

Map-Reduce is a programming model that is


mainly divided into two phases Map Phase
and Reduce Phase. It is designed for
processing the data in parallel which is
divided on various machines(nodes). The
Hadoop Java programs are consist of Mapper
class and Reducer class along with the driver
class. Hadoop Mapper is a function or task
which is used to process all input records
from a file and generate the output which
works as input for Reducer. It produces the
output by returning new key-value pairs.
Reducer

What is Reducer?

The Reducer process the output of the


mapper. After processing the data, it produces
a new set of output. At last HDFS stores this
output data.

Hadoop Reducer takes a set of an


intermediate key-value pair produced by the
mapper as the input and runs a Reducer
function on each of them. One can aggregate,
filter, and combine this data (key, value) in a
number of ways for a wide range of
processing.
Key Takeaway

● Big Data Analytics


● Hive in Hadoop
● Spark
● Hadoop
● Hadoop File System
● Mapper
● Reducer

17
#LifeKoKaroLift

Thank You!
Kindly follow the steps provided in the video below to download and
install hadoop, hive and derby.

https://www.youtube.com/watch?v=knAS0w-jiUk&ab_channel=IvyPro
School

https://www.youtube.com/watch?v=CRX6OOUFxyQ&ab_channel=U
nboxingBigData

You might also like