0% found this document useful (0 votes)
21 views

Introduction To Bda

Uploaded by

kshitijseven1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Introduction To Bda

Uploaded by

kshitijseven1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Unit 1

Big Data Analytics


Course Code: CDCSC11
Faculty: Dr. Vandana Bhatia
Part 1: Contents
❑Introduction to Big Data
❑Databases and their evolution
❑Convergence of key trends
❑Unstructured data
❑Industry examples of Big Data:
➢ Web analytics
➢ Big data and marketing
➢ Fraud and big data
➢ Risk and big data
➢ Credit risk management
➢ Big data and algorithmic trading
➢ Big data and healthcare
➢ Big data in medicine
➢ Advertising and big data
Part 2: Contents

❑Big data technologies:

➢ Introduction to Hadoop

➢ Open-source technologies

➢ Cloud and big data mobile business intelligence

➢ Crowd sourcing analytics

➢ Inter and trans firewall analytics.


Understanding
Big Data
Objectives:
• To understand what is big data
• To know various types of data
• To understand Examples
• To explore various applications of
Big Data
What is
Big Data
• Simply: Data of Very Big Size

• Can’t process with usual tools

• Distributed Architecture Needed

• Structured / Unstructured
❑ According to Gartner – It is huge-volume, fast-
velocity, and different variety information assets that
demand innovative platform for enhanced insights
and decision making.

❑ A Revolution, authors explain it as – It is a way to


solve all the unsolved problems related to data
What is management and handling, an earlier industry was
used to live with such problems. With Big data
Big Data analytics, you can also unlock hidden patterns and
know the 360-degree view of customers and better
understand their needs.
What is Bigdata Type
Example
Characteristics of Big Data
~VOLUME ~VELOCITY ~VARIETY
Big Data Categories
This refers to the data that is tremendously
large. As you can see from the image, the
volume of data is rising exponentially. In 2016,
the data created was only 8 ZB and it is
expected that, by 2020, the data would rise up
to 40 ZB, which is extremely large.
Big Data:
Volume
A reason for this rapid growth of data
volume is that the data is coming from
different sources in various formats.
The data is categorized as follows:.
Big Data:
Variety
Big Data: Velocity

The speed of data accumulation also plays a role in determining whether


the data is categorized into big data or normal data.
Big Data: Value

Deals with a mechanism to bring out the correct meaning out of data. First of
all, you need to mine the data, i.e., a process to turn raw data into useful data.
Then, an analysis is done on the data that you have cleaned or retrieved out of
the raw data. Then, you need to make sure whatever analysis you have done
benefits your business such as in finding out insights, results, etc. which were
not possible earlier.
Big Data: Veracity
• The trustworthiness and quality of data.
• It is necessary that the veracity of the data is maintained. For example, think
about Facebook posts, with hashtags, abbreviations, images, videos, etc., which
make them unreliable and hamper the quality of their content.
• Collecting loads and loads of data is of no use if the quality and trustworthiness
of the data is not up to the mark.
Applications Of Big Data
Finance
Banking:
o Since there is a massive amount of data that is gushing in from innumerable
sources, banks need to find uncommon and unconventional ways in order to
manage big data.
o It’s also essential to examine customer requirements, render services according
to their specifications, and reduce risks while sustaining regulatory compliance.
Stock Exchange:
o NYSE generates about one terabyte of new trade data every single day.
o So imagine, if one terabyte of data is generated every day, in a whole
year how much data there would be to process.
Applications of Big Data:
Social Network
• Social media in the current scenario is considered as the largest data generator.
• The stats have shown that around 500+ terabytes of new data get generated into the
databases of social media every day, particularly in the case of Facebook.
• The data generated mainly consist of videos, photos, message exchanges, etc. A
single activity on any social media site generates a lot of data which is again stored
and gets processed whenever required.
• Since the data stored is in terabytes, it would take a lot of time for processing if it is
done by our legacy systems. Big Data is a solution to this problem.
Applications of Big Data:
Healthcare

• Nowadays, doctors rely mostly on patients’ clinical records, which means that a lot
of data needs to be gathered, that too for different patients.
• Obviously, it is not possible for old or traditional data storage methods to store
this data.
• Since there is a large amount of data coming from different sources, in various
formats, the need to handle this large amount of data is increased
Applications of Big Data:
E-Commerce
• Maintaining customer relationships is the most important in the e-commerce industry.
• E-commerce websites have different marketing ideas to retail their merchandise to their
customers, to manage transactions, and to implement better tactics of using innovative
ideas with Big Data to improve businesses.
• Flipkart:
▪ Flipkart is a huge e-commerce website dealing with lots of traffic on a daily basis.
▪ But, when there is a pre-announced sale on Flipkart, traffic grows exponentially that
actually crashes the website.
▪ So, to handle this kind of traffic and data, Flipkart uses Big Data.
▪ Big Data can actually help in organizing and analyzing the data for further use.
Applications of Big Data:
Education

• The education sector holds a lot of information with regard to curriculum,


students, and faculty.
• The information is analyzed to get insights that can enhance the operational
adequacy of the educational organization.
• Collecting and analyzing information of a student such as attendance, test
scores, grades, and other issues take up a lot of data.
• So, big data makes an approach for a progressive framework wherein this
data can be stored and analyzed making it easier for the institutes to work
with.
Analyzing limitations and solutions of
existing data analytics
Objectives:
• Understanding Big data
Analytics
• Difference between data
analytics and Big data
analytics
• Limitations
• Solutions
Big Data Challenges
Big Data Challenges
Big Data Analytics
➢Big Data Analytics examines
large and different types of data
in order to uncover the hidden
patterns, insights, and
correlations.

➢Big Data Analytics is helping


large companies facilitate their
growth and development.
➢It majorly includes applying
various data mining algorithms
on a certain dataset.
Big Data Analytics
Big Data Analytics Use Cases

Real Time Intelligence Data Discovery Business Reporting


Big Data Analytics Reference Architectures
Why to put Big Data and analytics together?

➢Big data provides gigantic statistical samples, which enhance analytic


tool results.
➢Analytic tools and databases can now handle big data
➢The economics of analytics is now more embraceable than ever
➢There’s a lot to learn from messy data, as long as it’s big.
➢Big data is a special asset that merits leverage
➢Analytics based on large data samples reveals and leverages business
change
Drivers and Enablers

Big Data

Business Technology
Need Advances

Analytical
Platforms
Technologies for Big Data
(and Analytics)

 Data warehouses
 Appliances
 Analytical sandboxes
 In-memory analytics
 In-database analytics
 Columnar databases
Technologies for Big Data (and Analytics)

Streaming and Critical Event


Processing (CEP) Engines
 Cloud-based services
 Non relational databases
 Hadoop/MapReduce
Part 2: Contents

❑Big data technologies:


➢ Introduction to Hadoop
➢ Open-source technologies
➢ Cloud and big data mobile business intelligence
➢ Crowd sourcing analytics
➢ Inter and trans firewall analytics.
Introduction
to Hadoop
Hadoop/MapReduce
• Grew out of the efforts of Google, Yahoo, and
others to handle massive volumes of data
• Handles multi-structured data
• Process the data across commodity parallel
servers
• Open source software from the Apache
Software Foundation
Understanding Hadoop and its features

• Hadoop was created by Doug Cutting in order to build his


search engine called Nutch. He was joined by Mike Cafarella.
• Hadoop was based on the three papers published by Google:
Google File System, Google MapReduce, and Google Big Table.
• It is named after the toy elephant of Doug Cutting's son.
• Hadoop is under Apache license which means you can use it
anywhere without having to worry about licensing.
• It is quite powerful, popular and well supported.
• It is a framework to handle Big Data.
Hadoop Ecosystem
Understanding Hadoop and its features
• Started as a single project, Hadoop is now an umbrella of projects.
• All of the projects under the Apache Hadoop umbrella should have
followed three characteristics:
1. Distributed - They should be able to utilize multiple machines in
order to solve a problem.
2. Scalable - If needed it should be very easy to add more machines.
3. Reliable - If some of the machines fail, it should still work fine.
These are the three criteria for all the projects or components to
be under Apache Hadoop.
• Hadoop is written in Java so that it can run on kinds of devices.
Hadoop Ecosystem HDFS
• HDFS or Hadoop Distributed File System is the most important component because
the entire eco-system depends upon it. It is based on Google File System.
The Apache Hadoop • It is basically a file system which runs on many computers to provide a humongous
is a suite of storage. If you want to store your petabytes of data in the form of files, you can use
HDFS.
components. Let us • YARN or yet another resource negotiator keeps track of all the resources (CPU,
Memory) of machines in the network and run the applications. Any application
take a look at each of which wants to run in distributed fashion would interact with YARN.

these components HBase


• HBase provides humongous storage in the form of a database table. So, to manage
briefly. We will cover humongous records, you would like to use HBase.

the details in depth • HBase is a kind NoSQL Datastore.


MapReduce
during the full • MapReduce is a framework for distributed computing. It utilizes YARN to execute
course. programs and has a very good sorting engine.
• The programs are written in two parts Map and reduce. The map part transforms
the raw data into key-value and reduce part groups and combines data based on
the key.
Hive
• Writing code in MapReduce is very time-consuming. So, Apache
Hive makes it possible to write your logic in SQL which internally
converts it into MapReduce. So, you can process humongous
structured or semi-structured data with simple SQL using Hive.
SQOOP
• Sqoop is used to transport data between Hadoop and SQL
Databases. Sqoop utilizes MapReduce to efficiently transport data

Hadoop using many machines in a network.


Oozie

Ecosystem • Since a project might involve many components, there is a need of


a workflow engine to execute work in sequence.
• For example, a typical project might involve importing data from
SQL Server, running some Hive Queries, doing predictions with
Mahout, Saving data back to an SQL Server.
• This kind of workflow can be easily accomplished with Oozie.
User Interaction
• A user can either talk to the various components of Hadoop using
Command Line Interface, Web interface, API or using Oozie. We will
cover each of these components in details later.
Pig (Latin)
• Pig Latin is a simplified SQL like language to express your ETL needs in stepwise
fashion. Pig is the engine that translates Pig Latin into Map Reduce and executes it
on Hadoop.
Mahout
• Mahout is a library of machine learning algorithms that run in a distributed fashion.

Hadoop Since machine learning algorithms are complex and time-consuming, mahout
breaks down work such that it gets executed on MapReduce running on many
machines.

Ecosystem ZooKeeper
• Apache Zookeeper is an independent component which is used by various
distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for the
coordination between various components. It provides a distributed configuration
service, synchronization service, and naming registry for large distributed systems.
Flume
• Flume makes it possible to continuously pump the unstructured data from many
sources to a central source such as HDFS.
• If you have many machines continuously generating data such as Webserver Logs,
you can use flume to aggregate data at a central place such as HDFS.
Hadoop 2.X
core
components
• Hadoop 2.0 feature HDFS Federation allows horizontal
HADOOP 2.X scaling for Hadoop distributed file system (HDFS). This is
one of the many sought after features by enterprise class
CORE Hadoop users such as Amazon and eBay. HDFS Federation
supports multiple NameNodes and namespaces.
COMPONENTS • Hadoop 2.x has the following three Major Components:
• HDFS
• YARN
• MapReduce
Hadoop 2.x Architecture
HDFS
Hadoop 3.X
core
components
Why Hadoop 3.x

• With Java 7 attaining end of life in 2015, there was a need to revise the minimum runtime version
to Java 8 with a new Hadoop release so that the new release is supported by Oracle with security
fixes and also will allow hadoop to upgrade its dependencies to modern versions.
• With Hadoop 2.0 shell scripts were difficult to understand as hadoop developers had to read almost
all the shell scripts to understand what is the correct environment variable to set an option and how
to set it whether it is java.library.path or java classpath or GC options.
• With support for only 2 NameNodes, Hadoop 2 did not provide maximum level of fault tolerance
but with the release of Hadoop 3.x there will be additional fault tolerance as it offers multiple
NameNodes.
• Replication is a costly affair in Hadoop 2 as it follows a 3x replication scheme leading to 200%
additional storage space and resource overhead. Hadoop 3.0 will incorporate Erasure Coding in
place of replication consuming comparatively less storage space whilst providing same level of fault
tolerance.
Hadoop 3.x Architecture
Data Replication in 3.x
Difference between 1.x, 2,x
and 3,x Hadoop components
Difference between Hadoop 1.x, 2,x
Difference between Hadoop 2.x, 3,x
Open Source
Technologies
Open-source technologies

• Open source is a term that originally referred to open-source software


(OSS).
• Open-source software is code that is designed to be publicly accessible—
anyone can see, modify, and distribute the code as they see fit.
• Open-source software is developed in a decentralized and collaborative
way, relying on peer review and community production.
• Open-source software is often cheaper, more flexible, and has more
longevity than its proprietary peers because it is developed by communities
rather than a single author or company.
• Hadoop
• Atlas.ti
• Apache Storm
• Qubole
• Cassandra
Open-Source
• CouchDB
Big Data Tools
• Stats iQ
• Flink
• Cloudera
• RapidMiner
• DataCleaner
https://www.guru99.com/big-data-tools.html
Business Analytics
• Data Mining
• Reporting
• Performance metrics and benchmarking
• Descriptive Analysis
• Querying
• Statistical Analysis
• Data Visualization
• Data Preparation
Cloud and Big data mobile business intelligence

• Mobile business intelligence is software that extends desktop business intelligence (BI) applications so they can
be used on a mobile device.
• Business intelligence is a composition of system software that helps in generating meaningful and useful
information that enables the user to understand the keen insight of the company and know about the trends,
patterns, technologies, and reports.
• Big data is always assumed as huge and unstructured data, whereas big data is not just huge but it is also about
the composition of data, operation of days, and value-added in developing data.
Cloud and Big data mobile business
intelligence
• The cloud can help you process and analyze your big data faster,
leading to insights that can improve your products and business.
• Crowdsourcing, a combination of “crowd” and “outsourcing”
• first authored by Wired magazine in 2005
• It is an amazing sourcing model that use the profundity of
experience and thoughts of an open gathering instead of an
associations claim representatives.
• Crowdsourcing taps into the global world of ideas, helping
companies work through a rapid design process.
Crowd • You outsource to large crowds in an effort to make sure your
products or services are right.
sourcing • The upsides of utilizing crowdsourcing are professed to

analytics incorporate improved costs, speed, quality, adaptability,


versatility, or assorted variety.
• It has been utilized by new companies, expansive partnerships,
non-benefit associations, and to make normal products.
• crowdsourcing is a case of ICT marvel based collaboration,
collection, cooperation, agreement,and imagination.
• It is another method for doing work, where if the conditions are
correct, the group can outflank singular specialists.
• Geologically scattered individuals associated by web can
cooperate to deliver strategies and results that are worthy to
most.
Advantages

SAVE COSTS SAVE TIME EVOLVING REDUCE RISK INCREASED


INNOVATION EFFICIENCY
An association that has an errand it
needs performed

A people group (crowd) that is happy to


play out the errand willfully,
Key elements of
crowdsourcing An ICT situation that enables the work to
occur and the network to collaborate
with the association,

Shared advantage for the association


and the network.
CROWDSOURCING
BIG DATA
• Crowdsourcing is an imaginative
methodology in the time of big
data as it improves appropriated
handling and huge information
examination.
• Crowdsourcing big data enables associations to spare
their interior assets - Why procure over qualified staff for
huge information forms that publicly support workforce
can handle all the more proficiently, rapidly and cost
adequately.
• Crowdsourcing big data enables associations to profit by
the human component Content balance and
assessment investigation from criticism of clients, social
updates, surveys or remarks with publicly supported
workforce results in exceedingly exact, significant and
important bits of knowledge when contrasted with
CROWDSOURCING machines
BIG DATA • The appropriated idea of publicly supporting guarantees
that enormous information is handled at an unforeseen
speed which would not be conceivable to accomplish
in-house.
• Associations can fabricate applications dependent on
constant examination as publicly supported workforce
produce enormous information investigation at
ongoing. Endeavors don't need to be made a fuss over
being unfashionably late to the huge information party.
• Generally, an information researcher invests 78% of his energy in
setting up the information for enormous information investigation.
Therefore, a smart and financially savvy system for enormous
information organizations is hand over the unstructured
informational collections to a very much oversaw publicly
supporting stage so the group will educate all the more concerning
the data contained inside the information focuses gathered. For
instance, before the examination the group can tell whether the
information focuses are a Tweet or updates from Facebook and
whether it conveys a negative, positive or impartial meaning.
CROWDSOURCING
• Crowd gives structure (archive altering, sound translation, picture
in BIG DATA comment) to enormous information in this manner helping experts
Analytics improve their investigation prescient models by 25%.
• Crowdsourcing alongside enormous information examination can
help uncover concealed bits of knowledge from scattered however
associated data rapidly.
• Big information issues can be comprehended with more exactness
with publicly supporting as a dependable medium.
• The results from the group can be utilized by information
researchers to improve the productivity of the AI calculations.
crowdsourcing Context

Crowd — An individual or groups Community — Individuals or groups


dealing with a movement and dealing with a movement with
finishing it with zero ability to see some dimension of perceivability to
to different people or groups different people and groups

Competition — Individuals or Collaboration — Individuals or


groups taking a shot at and groups taking a shot at parts of a
finishing a movement movement and adding to its finish
autonomously (just a single victor) (everyone wins)
• Over the last 100 years, supply chains have
Inter and evolved to connect multiple companies and
enable them to collaborate to create
trans enormous value to the end-consumer via
firewall concepts like CPFR, VMI, etc.
• Decision sciences is witnessing a similar
analytics trend as enterprises are beginning to
collaborate on insights across the value
chain.
• We call this trend the move from intra- to
inter and trans-firewall analytics.
Inter and trans firewall analytics
Inter and trans firewall analytics

You might also like