0% found this document useful (0 votes)
6 views

BDA Unit 1 Bigdata Intro

The document provides an overview of Big Data Analytics, detailing its definition, the 7 V's (Volume, Velocity, Variety, Variability, Veracity, Value, and Visualization), and the data analytics lifecycle. It discusses the emerging big data ecosystem, including data devices, collectors, aggregators, and users, as well as various types of analytics such as descriptive, predictive, prescriptive, and cognitive analytics. Additionally, it outlines the stages of the analytics life cycle, from problem identification to model validation and evaluation.

Uploaded by

Vaishu Sangati
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

BDA Unit 1 Bigdata Intro

The document provides an overview of Big Data Analytics, detailing its definition, the 7 V's (Volume, Velocity, Variety, Variability, Veracity, Value, and Visualization), and the data analytics lifecycle. It discusses the emerging big data ecosystem, including data devices, collectors, aggregators, and users, as well as various types of analytics such as descriptive, predictive, prescriptive, and cognitive analytics. Additionally, it outlines the stages of the analytics life cycle, from problem identification to model validation and evaluation.

Uploaded by

Vaishu Sangati
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

BIG DATA ANALYTICS

UNIT-1
UNIT-1
Contents:

• Overview of Big Data


• State of the Practice in Analytics
• Roles of New Big data ecosystem
• Examples of big data analytics
• Data Analytics Lifecycle
Big Data Definition
• No single standard definition…

“Big Data” is data whose scale, diversity, and complexity


require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
The 7 V’s of Big Data
1. Volume
2. Velocity
3. Variety
4. Variability
5. Veracity
6. Value
7. Visualization
Volume
• In big data, Volume represents the sheer scale of data
generated and collected from various sources, including
transaction logs, social media, IoT sensors, and customer
interactions.
• The volume of data has been growing exponentially,
pushing organizations to:
• Implement scalable storage solutions like cloud data
lakes and distributed databases.
Velocity: The speed of data accumulation also
plays a role in determining whether the data
is categorized into big data or normal data.
Velocity
• Modern businesses generate data in real-time, from high-
frequency stock trading data to real-time IoT sensor
readings.
• To derive actionable insights promptly, big data systems
need to:
• Leverage real-time data processing capabilities with
tools like Apache Kafka, which allows for rapid data
ingestion and streaming analytics.
Variety
• Harnessing Diverse Data Types
Big data often comes from a variety of sources and exists
in diverse formats, including structured, semi-structured,
and unstructured data. Variety emphasizes the need to
handle multiple data types, such as:
• Structured data, which is highly organized, often found in
relational databases, and easily processed by traditional
analytics tools.
• Semi-structured data, like JSON files and XML,
containing some organizational structure but requiring
transformation for analysis.
• Unstructured data, such as images, videos, and social
media posts, which requires advanced tools like natural
language processing (NLP) and computer vision to
extract meaningful insights.
Variability
• Variability deals with the inconsistencies and irregularities
often found in big data.
• This challenge arises when data varies in quality,
relevance, or structure due to factors like changing
formats, noise, or missing values.
• Addressing variability involves :
Data cleaning processes to correct inconsistencies, fill in
missing values, and remove noise that could skew results
Veracity
• Since the packages get lost during the execution, we need
to start again from the stage of mining raw data in order to
convert them into valuable data.
• Veracity pertains to the accuracy, trustworthiness, and
credibility of the data used in analytics.
• Inaccurate data can lead to flawed insights, which is why
data veracity is fundamental to reliable analytics:
Data validation processes are essential to ensure data
accurately reflects real-world events and aligns with
business objectives
Value
• Ultimately, Value is the most crucial "V" in big data
analytics, as it pertains to deriving meaningful insights
that benefit the organization.
• For data to be valuable, analytics should:
Align with business objectives, focusing on areas
where data can drive operational improvements,
customer satisfaction, or revenue growth.
This is a process to turn raw data into useful data. Then, an
analysis is done on the data that you have cleaned or
retrieved out of the raw data.
State of the Practice in Analytics
• Customer churn: is the percentage of customers that
stopped using your company's product or service during a
certain time frame.
• Fraud: criminal deception intended to result in financial or
personal gain.
• Default: fail to fulfil an obligation, especially to repay a
loan or to appear in a law court.
• Anti-money laundering: Which refers to a set of laws,
regulations, and procedures intended to prevent criminals
from disguising illegally obtained funds as legitimate
income. Though anti-money-laundering (AML) laws cover
a relatively limited range of transactions and criminal
behaviors, their implications are far-reaching.
• SAS is a statistical software suite developed by SAS
Institute for advanced analytics, multivariate analysis,
business intelligence, criminal investigation, data
management, and predictive analytics.
• fair lending means unbaised treatment.
• Basel II is a set of international banking regulations put
forth by the Basel Committee on Bank Supervision, which
leveled the international
Emerging Big Data Ecosystem and a New Approach to Analytics:

1. Data devices
2. Data collectors
3. Data aggregators
4. Data users and buyers
• Data devices:
The “Sensornet” gather data from multiple locations and continuously
generate new data about this data. For each gigabyte of new data created, an
additional petabyte of data is created about that data.
• Data collectors:
 Data results from a cable TV provider tracking the shows a person
watches, which TV channels someone will and will not pay for to watch on
demand, and the prices someone is willing to pay for premium TV content.

 Retail stores tracking the path a customer takes through their store while
pushing a shopping cart with an RFID chip so they can gauge which
products get the most foot traffic using geospatial data collected from the
RFID chips(radio frequency identification) .
• Data aggregators :
The data collected from the various entities from the “SensorNet” or
the “Internet of Things.” These organizations compile data from the
devices and usage patterns collected by government agencies, retail
stores, and websites. In turn, they can choose to transform and
package the data as products to sell to list brokers, who may want to
generate marketing lists of people who may be good targets for
specific ad campaigns.

• Data users and buyers :


These groups directly benefit from the data collected and aggregated
by others within the data value chain.
DATA ANALYTICS INTRODUCTION
Introduction
• Data: Data is a set of values of qualitative or quantitative
variables. It is information in raw or unorganized form. It may
be a fact, figure, characters, symbols etc.
• Analytics: Analytics is the discovery , interpretation, and
communication of meaningful patterns or summery in data.

• Data Analytics (DA) is the process of examining data sets in


order to draw conclusion about the information it contains.
• Analytics is not a tool or technology, rather it is the way of
thinking and acting on data.
Stages of Analytical Evolution
Descriptive analytics
• Descriptive analytics is a statistical method that is used
to search and summeraize historical data in order to
identify patterns or meaning.
Prescriptive analytics
• prescriptive analytics is related to both descriptive and
predictive analytics.
• prescriptive analytics focuses on finding the best course
of action for a given situation and belongs to a portfolio of
analytic capabilities that include descriptive and predictive
analytics
Predictive analytics
• Predictive analytics is the branch of the advanced
analytics which is used to make predictions about
unknown future events.
• predictive analytics uses many techniques like data
mining, statistics , modelling , machine learning and
artificial intelligence to analyze current data to make
predictions about future.
Diagnostic analytics:
• Diagnostic analytics is a form of advance analytics which
examines data or content to answer the question “Why
did it happen?”
• And it really answers the questions of why.
• Why things are happening?
• What's driving things to go up, or down, or anything along
those lines?
Cognitive analytics :

• Cognitive analytics applies human like intelligence to


certain tasks, such as understanding not only the words in
a text, but the full context of what is being written or
spoken, or recognizing objects in an image within large
amounts of information.
Analytics life cycle
1.Problem Identification
2. Hypothesis formulation
3.Data Collection
4.Data Exploration/preparation
5.Model Building
6.Model Validation and Evaluation
7.Problem Identification
Problem Identification
• The problem is a situation which is judged to be corrected or solved
• Problem can be identified through
1. Comparative/benchmarking studies
2. Performance Reporting
3. Asking some basic questions
a)Who are affected by the problem?
b)What will happen if problem is not solved?
c)When and where does the problem occur?
d)Why is the problem occurring
e)How are the people currently handling the problem?
Hypothesis formulation
1.Frame the questions which need to be answered.
2.Develop a comprehensive list of all possible issues
related to the problem.
3.Reduce the list by eliminating duplicates and combining
overlapping issues.
4.Using consensus building get down to a major issue list.
Data collection:
1.Using data that is already collected by others
2.Systematically selecting and watching characteristics of
people, objects, and events.
3.Oral questioning respondents either individually or as a
group
4.Collecting data based on answers provided by the
respondents in written format.
Data Exploration
1.Importing data
2.Variable Identification
3.Data Cleaning
4.Summarizing data
5.Selecting subset of data
Model Building
• Building a Model is a very iterative process because there
is no such thing as final and perfect solution.
• Many of the machine learning and statistical techniques
are available in traditional technology platform
Model validation and Evaluation
• Like model building the process of validating model is also
a iterative process.There are so many ways …
• Confusion Matrix.
• Confidence Interval.
• ROC curve
• Chi Square.
• Root Mean Square Error
• Gain and Lift Chart.
• A Confusion Matrix is a table that is often used to
describe the performance of a classification model (or
"classifier") on a set of test data for which the true values
are known. The confusion matrix itself is relatively simple
to understand, but the related terminology can be
confusing.
• A Receiver Operating Characteristic curve, or ROC curve,
is a graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is
varied.

You might also like