0% found this document useful (0 votes)
3 views

Unit 1 Topic 0 Introduction to Big Data

Uploaded by

Sam Dubey
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 1 Topic 0 Introduction to Big Data

Uploaded by

Sam Dubey
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Big Data

Dr. Anil Kumar Dubey


Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Basic: Data
.

Data can be defined as figures or facts that


can be stored in or can be used by a
computer.
Conti…
Data Measurement Size
Single Binary Digit (1 or
Bit
0)
Byte 8 bits
Kilobyte (KB) 1,024 Bytes
Megabyte (MB) 1,024 Kilobytes
Gigabyte (GB) 1,024 Megabytes
Terabyte (TB) 1,024 Gigabytes
Petabyte (PB) 1,024 Terabytes
Exabyte (EB) 1,024 Petabytes
Basic: Big Data
Data which are very large in size is called Big
Data.

Normally we work on data of size MB (Word


Doc, Excel) or maximum GB (Movies, Codes)
but data in Peta bytes i.e. 10^15 byte size is
called Big Data.
Conti…
Big Data is a collection of data that is huge
in volume, yet growing exponentially with time.
It is a data with so large size and complexity
that none of traditional data management tools
can store it or process it efficiently.

“Big data” is high-volume, velocity, and


variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.”
Example of Big Data
New York Stock Exchange is an example of Big
Data that generates about one terabyte of new
trade data per day.
The statistic shows that 500+terabytes of new
data get ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of
data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches
up to many Petabytes.
Sources of Big Data
Social networking sites: Facebook, Google,
LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of
users worldwide.

E-commerce site: Sites like Amazon, Flipkart,


Alibaba generates huge amount of logs from
which users buying trends can be traced.

Weather Station: All the weather station and


satellite gives very huge data which are stored
Conti…
Telecom company: Telecom giants like Airtel,
Vodafone study the user trends and accordingly
publish their plans and for this they store the data
of its million users.

Share Market: Stock exchange across the world


generates huge amount of data through its daily
transaction.
Advantages of Big Data
Big Data analytics tools can predict outcomes
accurately, thereby, allowing businesses and
organizations to make better decisions,
while simultaneously optimizing their
operational efficiencies and reducing risks.

Big Data provides insights into the customer


pain points and allows companies to improve
upon their products and services.
Conti…
Big Data analytics could help companies
generate more sales leads which would
naturally mean a boost in revenue.

Big Data insights allow you to learn customer


behavior to understand the customer trends
and provide a highly ‘personalized’
experience to them.
Applications of Big Data
Healthcare
With the help of predictive analytics, medical
professionals are able to provide personalized
healthcare services to individual patients.

Apart from that, fitness wearables,


telemedicine, remote monitoring – all
powered by Big Data and AI – are helping
change lives for the better.
Conti…
Academia
Education is no more limited to the physical
bounds of the classroom – there are
numerous online educational courses to learn
from.

Academic institutions are investing in digital


courses powered by Big Data technologies to
aid the all-round development of budding
learners.
Conti…
Banking
The banking sector relies on Big Data for
fraud detection.

Big Data tools can efficiently detect


fraudulent acts in real-time such as misuse of
credit/debit cards, archival of inspection
tracks, faulty alteration in customer stats,
etc.
Conti…
Manufacturing
According to TCS Global Trend Study, the
most significant benefit of Big Data in
manufacturing is improving the supply
strategies and product quality.

In the manufacturing sector, Big data helps


create a transparent infrastructure, thereby,
predicting uncertainties and in competencies
that can affect the business adversely.
Conti…
IT
One of the largest users of Big Data, IT
companies around the world are using Big
Data to optimize their functioning, enhance
employee productivity, and minimize risks in
business operations.

By combining Big Data technologies with ML


and AI, the IT sector is continually powering
innovation to find solutions even for the most
complex of problems.
Conti…
Transportation
Big Data Analytics holds immense value for
the transportation industry.

In countries across the world, both private


and government-run transportation
companies use Big Data technologies to
optimize route planning, control traffic,
manage road congestion, and improve
services.
Conti…
Retail
 Big Data has changed the way of working in traditional
brick and mortar retail stores.
 Over the years, retailers have collected vast amounts of
data from local demographic surveys, POS scanners,
RFID, customer loyalty cards, store inventory, and so on.
 Now, they’ve started to leverage this data to create
personalized customer experiences, boost sales,
increase revenue, and deliver outstanding customer
service.
 Retailers are even using smart sensors and Wi-Fi to
track the movement of customers, the most frequented
aisles, for how long customers linger in the aisles,
Big Data Analysis Tools and Software
The tools that are used to store and analyze a large
number of data sets and processing these complex data
are known as big data tools.

Xplenty Atlas.ti

Analytics Microsoft HDInsight

Talend R-Programming
Xplenty
A cloud-based ETL solution providing simple visualized data
pipelines for automated data flows across a wide range of
sources and destinations. Xplenty’s powerful on-platform
transformation tools allow you to clean, normalize, and
transform data while also adhering to compliance best practices.

Features:
 Powerful, code-free, on-platform data transformation offering
 Rest API connector – pull in data from any source that has a Rest
API
 Destination flexibility – send data to databases, data
warehouses, and Salesforce
 Security focused – field-level data encryption and masking to
meet compliance requirements
Atlas.ti
 All-in-oneresearch software. This big data analytic tool
gives you all-in-one access to the entire range of
platforms. It used for qualitative data analysis and mixed
methods research in academic, market, and user
experience research.

Features:
 Can export information on each source of data.
 It offers an integrated way of working with your data.
 Allows you to rename a Code in the Margin Area
 Helps you to handle projects that contain thousands of
documents and coded data segments.
Analytics
Tool that provides visual analysis and dash boarding.
It allows to connect multiple data sources, including
business applications, databases, cloud drives, and
more.

Features:
Offers visual analysis and dash boarding.
It helps to analyze data in depth.
Provides collaborative review and analysis.
Can embed reports to websites, applications, blogs,
and more.
Azure HDInsight
 Spark and Hadoop service in the cloud. It provides big data
cloud offerings in two categories, Standard and Premium. It
provides an enterprise-scale cluster for the organization to run
their big data workloads.

Features:
 Reliable analytics with an industry-leading SLA
 It offers enterprise-grade security and monitoring
 Protect data assets and extend on-premises security and
governance controls to the cloud
 High-productivity platform for developers and scientists
 Integration with leading productivity applications
 Deploy Hadoop in the cloud without purchasing new hardware
Talend
 Big data analytics software that simplifies and automates big
data integration. Its graphical wizard generates native code.
It also allows big data integration, master data management
and checks data quality.

Features:
 Accelerate time to value for big data projects
 Simplify ETL & ELT for big data
 Talend Big Data Platform simplifies using MapReduce and
Spark by generating native code
 Smarter data quality with machine learning and natural
language processing
 Agile DevOps to speed up big data projects
R-Programming
 Language for statistical computing and graphics. It
also used for big data analysis. It provides a wide
variety of statistical tests.

Features:
 Effective data handling and storage facility,
 It provides a suite of operators for calculations on
arrays, in particular, matrices,
 It provides coherent, integrated collection of big data
tools for data analysis
 It provides graphical facilities for data analysis which
display either on-screen or on hardcopy
Others
 Apache Hadoop: A framework that allows you to store big
data in a distributed environment for parallel processing.
 Apache Pig: A Platform that is used for analyzing large
datasets by representing them as data flows. Pig is
designed to provide an abstraction over MapReduce which
reduces the complexities of writing a MapReduce program.
 Apache Hbase: A multidimensional, distributed, open-
source, and NoSQL database written in Java. It runs on top
of HDFS providing Bigtable-like capabilities for Hadoop.
 Apache Spark: Open-source general-purpose cluster-
computing framework. It provides an interface for
programming all clusters with implicit data parallelism and
fault tolerance.
Big Data Case studies
Walmart leverages Big Data and Data Mining to
create personalized product recommendations for
its customers.

With the help of these two emerging technologies,


Walmart can uncover valuable patterns showing
the most frequently bought products, most
popular products, and even the most popular
product bundles (products that complement each
other and are usually purchased together).
Conti…
Based on these insights, Walmart creates attractive
and customized recommendations for individual
users.

By effectively implementing Data Mining


techniques, the retail giant has successfully
increased the conversion rates and improved its
customer service substantially.

Furthermore, Walmart uses Hadoop and NoSQL


technologies to allow customers to access real-time
Conti…
Uber is one of the major cab service providers in
the world.

Itleverages customer data to track and identify


the most popular and most used services by the
users.

Once this data is collected, Uber uses data


analytics to analyze the usage patterns of
customers and determine which services should
be given more emphasis and importance.
Conti…
Apart from this, Uber uses Big Data in another
unique way.

Uber closely studies the demand and supply of its


services and changes the cab fares accordingly.

It is the surge pricing mechanism that works


something like this – suppose when you are in a
hurry, and you have to book a cab from a crowded
location, Uber will charge you double the normal
amount!
Conti…
Netflix is one of the most popular on-demand
online video content streaming platform used by
people around the world.

Netflixis a major proponent of the


recommendation engine.

It collects customer data to understand the


specific needs, preferences, and taste patterns of
users.
Conti…
Today, Netflix has become so vast that it is even
creating unique content for users.
Data is the secret ingredient that fuels both its
recommendation engines and new content
decisions.

The most pivotal data points used by Netflix


include titles that users watch, user ratings,
genres preferred, and how often users stop the
playback, to name a few.


Hadoop Block
Example Practice
Question: A file of size 612 MB, and using the
default block configuration (128 MB). Computer
How many blocks will create and last block size.
Conti…
Answer:
(128*4+100=612).

◦ Five blocks are created

◦ First four blocks are 128 MB in size, the fifth


block is 100 MB

◦ So last block size is 100 MB


Practice
Question: How many blocks will create if file
of size 812 MB using the default block
configuration. Also compute the size of last
block.

Answer:
a. No. of block = 5, last block size = 172
b. No. of block = 6, last block size = 44
c. No. of block = 7, last block size = 0
d. No. of block = 6, last block size = 128
Practice
Question: How many blocks will create if file
of size 400 MB using the default block
configuration. Also compute the size of last
block.

Answer: ?
Conti…
Practice
Question: How many blocks will create if file
of size 2 GB 500 MB using the default block
configuration. Also compute the size of last
block.

Answer: ?
THANK
YOU

You might also like