Big - Data Unit-2
Big - Data Unit-2
Science
Unit -2
Definition to Data Science
• It is study of data that involves developing methods of recording ,
storing & analyzing data to effectively extract useful information.
• Data science is to gain insight and knowledge from any type of data
i.e. both structured and unstructured.
• It is area of study that combines domain expertise, programming skills
and knowledge of mathematics and statistics to extract meaningful
insights from data.
Lifecycle of Data Science (Stages)
• Capture the data: Acquiring data from identified internal and external
sources and extract the data, maintain in data warehousing.
• Process: Raw data have lots of inconsistencies, incorrect data format
which need to be cleaned and convert it in the form for further analyze.
• Explore the data: After data cleaned we can understand the
information contained within at a high level.
• Analyze the data: It is done by applying machine learning, statistical
models, algorithms.
• Communicate Result: the key findings ( output) are communicated to
all stakeholders. This help to decide if the result of the project are a
success or a failure based on the inputs from the model.
Data Analysis
• It is the process of evaluating, cleaning , transforming and modeling
data with the aim of discovering useful information, conclusions and
supporting decision making process.
• Data analysis use in organization when they want to create strong
business architecture, prepare solid business cases, conduct risk
assessments, identify market dynamics, assess product performance
etc.
Basics Data Analytics
• It is defined as a process of transforming data into actions through
analysis and perception in the context of organization decision making
and problem- solving.
• It is powerful scientific and statistical tool for the analysis of the raw
data to remodel the information to obtain the knowledge.
Data Analytics Process
1. Business Understanding
Define Objectives: Identify the business problem or goal.
Determine Scope: Establish project boundaries, resources, and timeline.
Identify Success Criteria: Define objectives, goals what success looks like for the project.
3. Data Preparation
Clean Data: Address missing values, remove duplicates, and correct errors.
Transform Data: Normalize, scale, and encode data for analysis.
Integrate Data: Combine data from multiple sources into a cohesive dataset.
Data Analytics Process
4. Data Modeling
Select Model: Choose appropriate modeling techniques.
Train Model: Build the model using training data.
Validate Model: Evaluate model performance on test data.
5. Data Evaluation
Assess Performance: Ensure the model meets business objectives.
Business Validation: Confirm the model’s results align with business needs.
Interpret Results: Ensure the model is understandable to stakeholders.
Refine Model: Improve the model based on evaluation results.
6. Deployment
Plan Deployment: Develop a strategy for implementing the model.
Deploy Model: Integrate the model into the production environment.
Monitor Performance: Continuously track the model’s effectiveness.
Maintain Model: Update and retrain the model as needed.
Types of Analytics
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Descriptive Analytics
• Descriptive analytics is a branch of data science focused on analyzing
historical data to understand what has happened in the past. It involves
summarizing and interpreting data to generate insights that describe
past events and trends. Here are the key aspects of descriptive analytics:
• Data Aggregation: Collecting and compiling data from various sources to
create a comprehensive dataset.
• Data Summarization: Using statistical methods to summarize the data,
such as calculating averages, totals, and percentages.
• Data Visualization: Creating charts, graphs, and other visual
representations to make the data easier to understand.
• Pattern Recognition: Identifying patterns and trends in the data to
provide a clearer picture of past performance.
Key Techniques:
• Summary Statistics: Measures such as mean, median, mode, range,
and standard deviation.
• Data Visualization Tools: Bar charts, histograms, pie charts, line
graphs, scatter plots, and heat maps.
• Reporting: Creating dashboards and reports that provide insights into
the data.
Tools
• Ms-excel,
• MATLAB,
• SPSS,
• STATA,
• GOOGLE ANALYTICS
Applications:
• Business Intelligence: Helping businesses understand their sales,
marketing effectiveness, customer behavior, and operational
performance.
• Healthcare: Analyzing patient data to track disease outbreaks,
treatment outcomes, and hospital performance.
• Finance: Summarizing financial data to understand market trends,
investment performance, and risk management.
• Retail: Understanding customer purchase patterns, inventory levels,
and sales trends.
Diagnostic Analytics
• Diagnostic analytics is a deeper level of data analysis that goes beyond descriptive analytics.
While descriptive analytics focuses on summarizing what has happened, diagnostic analytics
aims to explain why it happened. Here are the main features of diagnostic analytics:
• Drill-Down Analysis: Exploring detailed data to uncover the root causes of specific outcomes.
• Data Relationships: Examining relationships and correlations between different variables to
identify causative factors.
• Hypothesis Testing: Formulating and testing hypotheses to confirm potential causes of
observed patterns.
• Comparative Analysis: Comparing data across different periods, groups, or conditions to
understand variations and their underlying reasons.
Key Techniques:
• Correlation Analysis: Measuring the strength and direction of
relationships between variables.
• Data Mining: Using algorithms to discover patterns and relationships
in large datasets.
• Root Cause Analysis: Systematically identifying the fundamental
causes of a problem.
Applications:
• Business: Understanding why sales dropped in a particular quarter or
why a marketing campaign succeeded or failed.
• Healthcare: Investigating the causes of a sudden increase in patient
readmissions or the factors contributing to successful treatment
outcomes.
• Finance: Analyzing the reasons behind market movements or the
factors influencing credit risk.
• Manufacturing: Identifying the root causes of production defects or
equipment failures.
Predictive analytics
• Predictive analytics involves using historical data, statistical algorithms, and
machine learning techniques to identify the likelihood (probability) of future
outcomes based on past data. It goes beyond descriptive and diagnostic
analytics by providing actionable insights about future trends and behaviors.
Here are the key aspects of predictive analytics:
• Data Collection and Preparation: Gathering historical data and preparing it for
analysis by cleaning, transforming, and integrating data from different sources.
• Model Building: Developing statistical models and machine learning algorithms
that can predict future events based on historical data.
• Validation and Testing: Evaluating the accuracy and reliability of predictive
models using techniques such as cross-validation and testing on unseen data.
• Deployment: Implementing predictive models in real-time systems to make
predictions and inform decision-making.
Key Techniques:
Regression Analysis: Modeling the relationship between a dependent
variable and one or more independent variables to predict future
values.
Classification: Assigning items to predefined categories or classes
based on input data.
Time Series Analysis: Analyzing time-ordered data points to forecast
future trends.
Machine Learning Algorithms: Techniques such as decision trees,
random forests, neural networks, and support vector machines.
Applications:
• Business: Forecasting sales, predicting customer churn, and
identifying potential leads.
• Healthcare: Predicting patient outcomes, disease outbreaks, and
hospital admissions.
• Finance: Risk assessment, credit scoring, and stock price forecasting.
• Marketing: Personalizing marketing campaigns, predicting customer
behavior, and optimizing pricing strategies.
• Manufacturing: Predicting equipment failures, optimizing
maintenance schedules, and improving supply chain management.
Prescriptive analytics
• Prescriptive analytics is the most advanced type of data analytics. It
goes beyond descriptive, diagnostic, and predictive analytics by not only
predicting future outcomes but also recommending actions to achieve
desired results. Here are the key aspects of prescriptive analytics:
• Data Integration: Combining data from various sources to create a
comprehensive dataset for analysis.
• Predictive Modeling: Using predictive analytics to forecast future
outcomes and identify potential opportunities and risks.
• Optimization: Applying mathematical and computational models to
determine the best course of action from a set of possible alternatives.
• Simulation: Using models to simulate different scenarios and their
potential outcomes to evaluate the impact of various decisions.
Key Techniques:
Where,
x is the variable
μ is the mean
σ is the standard deviation
e is Euler's number i.e. 2.71
Question 1: Calculate the probability density function of normal
distribution using the following data. x = 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can
write;
normal distribution example
Where, q=1-p
n = the number of experiments
p = Probability of Success
r=number of successes
Where to apply binomial
distribution(Criteria)
• Finite number of trials.
• Two possible outcome which are mutually exclusive.
• Each trials is independent.
• The probability of success is same in each trial.
Example
• Tossing a coin 10 times and identifying probability of getting head 3 times.
p(X)= n! * (p)x * (q) n-x
----------
(n-X)! * x!
• X=3
• n=10
• p=1/2
• q=1/2
Properties of Binomial Distribution
• Main Parameters: Two parameters n and p.
Poisson Distribution
• It is also discrete probability distribution.
• French Mathematician Dr. Simon Poisson discovered it.
• It used in scenario where the probability of happing an event is very small
i.e. the chances of happing of the event is rare.
• This means that the probability of success is very small and the value of n is
very large.
• Probability are calculated for a certain time period.
• Example:
• Probability of defective items in a manufacturing company for a month.
• Probability of occurring of earthquake in a year.
• Formula:
• M=NP
• P=3/100 = 30
• N=1000
• e=2.71
Properties of Binomial Distribution
• Main Parameters: ‘m’ is only parameter and so called Uniparametric
distribution.
Correlation
• It is a statistical technique that use to show whether and how strong pair of variable are
related.
• Correlation is a term that is a measure of the strength of linear relationship between two
quantitative variable.
• The most common measure of correlation is Pearson's correlation coefficient, denoted as ‘r’.
The value of ‘r’ ranges from -1 to 1, where:
• r=1 indicates a perfect positive correlation (as one variable increases, the other also
increases).
• r=−1 indicates a perfect negative correlation (as one variable increases, the other
decreases).
• r=0 indicates no correlation (no linear relationship between the variables).
Correlation Formula:
• To calculate the Pearson correlation coefficient (r) between Study
Hours (X) and Exam Scores (Y):
• Compute the means of X and Y.
• Subtract the means from each X and Y value to get the deviations
from the mean.
• Multiply the deviations of X and Y for each pair and sum them up.
• Compute the sum of squared deviations for both X and Y.
𝑋‾=2+4+6+8+105=6
• Means:
𝑌‾=70+75+80+85+905=80
• Deviations:
X−X‾=[−4,−2,0,2,4]
Y−Y‾=[−10,−5,0,5,10]
• Product of Deviations:
(−4)×(−10)=40
(−2)×(−5)=10
0×0=0
2×5=10
4×10=40
Sum = 100
• Sum of Squared Deviations:
∑(X−X‾)2=16+4+0+4+16=40
∑(Y−Y‾)2=100+25+0+25+100=250
Regression
• Regression is a fundamental technique in data science used for
predicting a continuous outcome variable based on one or more
predictor variables. The main goal of regression analysis is to understand
the relationship between the dependent variable (response) and one or
more independent variables (predictors).
• Regression analysis is set of statistical processes for estimating relation
between a dependent variable (outcome variable) and one or more
independent variable (predictors or covariates or features).
• It is used for two conceptually distinct purposes.
1. for prediction and forecasting.
2. Used to casual relationship between independent and dependent
variables.
Types of Regression
• Linear Regression:
• Linear regression analysis is used to predict the value of a variable based on the value of
another variable.
• Simple Linear Regression: Involves one independent variable to predict the dependent
variable. Y=β0+β1X+ϵY =
Where:
• Y is the dependent variable.
• X is the independent variable.
• β0is the intercept.
• β1is the slope coefficient.
• ϵ(epsilon) is the error term
• Multiple Linear Regression: Involves two or more independent variables.
Y=β0+β1X1+β2X2+…+βnXn+ϵ
• Polynomial Regression
• Extends linear regression by allowing the relationship between the
independent and dependent variables to be modeled as an n-th
degree polynomial.
• Y=β0+β1X+β2X2+…+βnXn+ϵ
• Stepwise Regression: It is a method used in regression analysis to select the
most significant predictor variables from a larger set of potential variables.
The process is "stepwise" because it adds or removes predictors one at a
time, based on specific criteria, to build the most effective model.
• Quantile Regression: is an extension of traditional linear regression that
allows for the modeling of different quantiles (percentiles) of the
dependent variable (response) rather than just the mean. In other words,
instead of predicting the average value of the response variable given a set
of predictors, quantile regression allows you to predict the value of the
response variable at specific quantiles (e.g., the 25th percentile, median or
50th percentile, 75th percentile, etc.).
• Bayesian Regression: Bayesian Regression is a way to do regression
analysis using the principles of Bayesian statistics. In simple terms, it's
a method where you estimate the relationship between variables
while also considering prior beliefs or knowledge about the model
parameters.