100% found this document useful (2 votes)
335 views

Big - Data Unit-2

Brief Introduction to Data science, Life cycle of Data Science, Data Analytics and their types, population and their types, Sample and their types, Probability and their types, Distribution and their types, Correlation and their types, Regression and their types.

Uploaded by

Tulshiram Kamble
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
335 views

Big - Data Unit-2

Brief Introduction to Data science, Life cycle of Data Science, Data Analytics and their types, population and their types, Sample and their types, Probability and their types, Distribution and their types, Correlation and their types, Regression and their types.

Uploaded by

Tulshiram Kamble
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Introduction to Data

Science
Unit -2
Definition to Data Science
• It is study of data that involves developing methods of recording ,
storing & analyzing data to effectively extract useful information.
• Data science is to gain insight and knowledge from any type of data
i.e. both structured and unstructured.
• It is area of study that combines domain expertise, programming skills
and knowledge of mathematics and statistics to extract meaningful
insights from data.
Lifecycle of Data Science (Stages)
• Capture the data: Acquiring data from identified internal and external
sources and extract the data, maintain in data warehousing.
• Process: Raw data have lots of inconsistencies, incorrect data format
which need to be cleaned and convert it in the form for further analyze.
• Explore the data: After data cleaned we can understand the
information contained within at a high level.
• Analyze the data: It is done by applying machine learning, statistical
models, algorithms.
• Communicate Result: the key findings ( output) are communicated to
all stakeholders. This help to decide if the result of the project are a
success or a failure based on the inputs from the model.
Data Analysis
• It is the process of evaluating, cleaning , transforming and modeling
data with the aim of discovering useful information, conclusions and
supporting decision making process.
• Data analysis use in organization when they want to create strong
business architecture, prepare solid business cases, conduct risk
assessments, identify market dynamics, assess product performance
etc.
Basics Data Analytics
• It is defined as a process of transforming data into actions through
analysis and perception in the context of organization decision making
and problem- solving.
• It is powerful scientific and statistical tool for the analysis of the raw
data to remodel the information to obtain the knowledge.
Data Analytics Process
1. Business Understanding
Define Objectives: Identify the business problem or goal.
Determine Scope: Establish project boundaries, resources, and timeline.
Identify Success Criteria: Define objectives, goals what success looks like for the project.

2. Data Exploration (Examination)


Collect Data: Gather data from relevant sources.
Initial Analysis: Perform basic statistical analysis.
Visualize Data: Use charts and graphs to identify patterns and outliers.
Understand Relationships: Explore correlations between variables.

3. Data Preparation
Clean Data: Address missing values, remove duplicates, and correct errors.
Transform Data: Normalize, scale, and encode data for analysis.
Integrate Data: Combine data from multiple sources into a cohesive dataset.
Data Analytics Process
4. Data Modeling
Select Model: Choose appropriate modeling techniques.
Train Model: Build the model using training data.
Validate Model: Evaluate model performance on test data.

5. Data Evaluation
Assess Performance: Ensure the model meets business objectives.
Business Validation: Confirm the model’s results align with business needs.
Interpret Results: Ensure the model is understandable to stakeholders.
Refine Model: Improve the model based on evaluation results.

6. Deployment
Plan Deployment: Develop a strategy for implementing the model.
Deploy Model: Integrate the model into the production environment.
Monitor Performance: Continuously track the model’s effectiveness.
Maintain Model: Update and retrain the model as needed.
Types of Analytics
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Descriptive Analytics
• Descriptive analytics is a branch of data science focused on analyzing
historical data to understand what has happened in the past. It involves
summarizing and interpreting data to generate insights that describe
past events and trends. Here are the key aspects of descriptive analytics:
• Data Aggregation: Collecting and compiling data from various sources to
create a comprehensive dataset.
• Data Summarization: Using statistical methods to summarize the data,
such as calculating averages, totals, and percentages.
• Data Visualization: Creating charts, graphs, and other visual
representations to make the data easier to understand.
• Pattern Recognition: Identifying patterns and trends in the data to
provide a clearer picture of past performance.
Key Techniques:
• Summary Statistics: Measures such as mean, median, mode, range,
and standard deviation.
• Data Visualization Tools: Bar charts, histograms, pie charts, line
graphs, scatter plots, and heat maps.
• Reporting: Creating dashboards and reports that provide insights into
the data.
Tools
• Ms-excel,
• MATLAB,
• SPSS,
• STATA,
• GOOGLE ANALYTICS
Applications:
• Business Intelligence: Helping businesses understand their sales,
marketing effectiveness, customer behavior, and operational
performance.
• Healthcare: Analyzing patient data to track disease outbreaks,
treatment outcomes, and hospital performance.
• Finance: Summarizing financial data to understand market trends,
investment performance, and risk management.
• Retail: Understanding customer purchase patterns, inventory levels,
and sales trends.
Diagnostic Analytics
• Diagnostic analytics is a deeper level of data analysis that goes beyond descriptive analytics.
While descriptive analytics focuses on summarizing what has happened, diagnostic analytics
aims to explain why it happened. Here are the main features of diagnostic analytics:
• Drill-Down Analysis: Exploring detailed data to uncover the root causes of specific outcomes.
• Data Relationships: Examining relationships and correlations between different variables to
identify causative factors.
• Hypothesis Testing: Formulating and testing hypotheses to confirm potential causes of
observed patterns.
• Comparative Analysis: Comparing data across different periods, groups, or conditions to
understand variations and their underlying reasons.
Key Techniques:
• Correlation Analysis: Measuring the strength and direction of
relationships between variables.
• Data Mining: Using algorithms to discover patterns and relationships
in large datasets.
• Root Cause Analysis: Systematically identifying the fundamental
causes of a problem.
Applications:
• Business: Understanding why sales dropped in a particular quarter or
why a marketing campaign succeeded or failed.
• Healthcare: Investigating the causes of a sudden increase in patient
readmissions or the factors contributing to successful treatment
outcomes.
• Finance: Analyzing the reasons behind market movements or the
factors influencing credit risk.
• Manufacturing: Identifying the root causes of production defects or
equipment failures.
Predictive analytics
• Predictive analytics involves using historical data, statistical algorithms, and
machine learning techniques to identify the likelihood (probability) of future
outcomes based on past data. It goes beyond descriptive and diagnostic
analytics by providing actionable insights about future trends and behaviors.
Here are the key aspects of predictive analytics:
• Data Collection and Preparation: Gathering historical data and preparing it for
analysis by cleaning, transforming, and integrating data from different sources.
• Model Building: Developing statistical models and machine learning algorithms
that can predict future events based on historical data.
• Validation and Testing: Evaluating the accuracy and reliability of predictive
models using techniques such as cross-validation and testing on unseen data.
• Deployment: Implementing predictive models in real-time systems to make
predictions and inform decision-making.
Key Techniques:
Regression Analysis: Modeling the relationship between a dependent
variable and one or more independent variables to predict future
values.
Classification: Assigning items to predefined categories or classes
based on input data.
Time Series Analysis: Analyzing time-ordered data points to forecast
future trends.
Machine Learning Algorithms: Techniques such as decision trees,
random forests, neural networks, and support vector machines.
Applications:
• Business: Forecasting sales, predicting customer churn, and
identifying potential leads.
• Healthcare: Predicting patient outcomes, disease outbreaks, and
hospital admissions.
• Finance: Risk assessment, credit scoring, and stock price forecasting.
• Marketing: Personalizing marketing campaigns, predicting customer
behavior, and optimizing pricing strategies.
• Manufacturing: Predicting equipment failures, optimizing
maintenance schedules, and improving supply chain management.
Prescriptive analytics
• Prescriptive analytics is the most advanced type of data analytics. It
goes beyond descriptive, diagnostic, and predictive analytics by not only
predicting future outcomes but also recommending actions to achieve
desired results. Here are the key aspects of prescriptive analytics:
• Data Integration: Combining data from various sources to create a
comprehensive dataset for analysis.
• Predictive Modeling: Using predictive analytics to forecast future
outcomes and identify potential opportunities and risks.
• Optimization: Applying mathematical and computational models to
determine the best course of action from a set of possible alternatives.
• Simulation: Using models to simulate different scenarios and their
potential outcomes to evaluate the impact of various decisions.
Key Techniques:

• Optimization Algorithms: Techniques such as linear programming,


mixed-integer programming, and nonlinear programming to find the
optimal solution to a problem.
• Heuristic Methods: Approaches such as genetic algorithms, simulated
annealing, and ant colony optimization for solving complex
optimization problems.
• Decision Analysis: Techniques such as decision trees, influence
diagrams, and utility theory to evaluate and compare different
decision options.
Applications:
• Supply Chain Management: Optimizing inventory levels, production
schedules, and logistics to minimize costs and maximize efficiency.
• Healthcare: Recommending treatment plans, optimizing hospital
resource allocation, and improving patient care management.
• Finance: Optimizing investment portfolios, managing risk, and
improving financial planning.
• Marketing: Personalizing marketing campaigns, optimizing pricing
strategies, and improving customer segmentation.
• Manufacturing: Optimizing production processes, improving quality
control, and reducing downtime.
Population
• It is set of all possible states of a random variable. The size of
population either infinite or finite. It is an entire group about which
some information required to be ascertained ( discovered).
• It is number of something we are observing humans, events, animals
etc. It has some parameters such as the mean, median, mode,
standard deviation among others.
• Population attributes are called parameters.
• Parameters denoted using Greek letters ( mu, sigma) while statistics
denoted using roman letters (x,s)
Types of population
• Finite Population
• Infinite Population
• Existent Population
• Hypothetical Population
• Finite Population:
Definition: A population that contains a fixed, countable number of elements.
Example: The number of students in a particular school. If the school has 1,000
students, the population is finite because you can count the exact number of
students.
• Infinite Population:
Definition: A population that is theoretically unlimited or so large that it is
considered infinite for practical purposes.
Example: The number of stars in the universe or the number of grains of sand on a
beach. These populations are so vast that they are treated as infinite.
• Existent Population:
Definition: A population that exists in reality and can be observed or measured.
Example: The current employees of a company. These employees are real and can
be counted and observed directly.
• Hypothetical Population:
Definition: A population that is imagined or theoretical, often used in statistical
modeling to understand potential outcomes or distributions.
Example: All possible outcomes of rolling a fair die an infinite number of times. This
population doesn't physically exist but is used to understand the distribution and
probability of each outcome.
Sample
• Sample is only portion of the population.
• It include one or more observations that are drawn from the
population and measurable characteristic of a sample is statistic.
• Without population, sample cant exist.
• Population and sample two different terms, they both are related to
each other. The population is used to draw samples.
• E.g. Some people living in India is the sample of the population.
Types of sampling
• Probability Sampling
• Non-Probability Sampling
Probability Sampling
• In probability sampling, every member of the population has a known,
non-zero chance of being selected. This method allows for the calculation
of sampling errors and helps ensure the sample is representative of the
population. Here are some common types of probability sampling:
• Simple Random Sampling:
• Description: Every member of the population has an equal chance of being
selected.
• Example: Drawing names from a hat where each name has an equal chance of
being picked.
• Stratified Sampling:
• Description: The population is divided into subgroups (strata) based on a specific
characteristic, and random samples are taken from each stratum.
• Example: Dividing a population into age groups (e.g., 18-25, 26-35, etc.) and
randomly sampling individuals from each group.
• Cluster Sampling:
• Description: The population is divided into clusters, often based on
geographical areas or naturally occurring groups. A random sample of clusters
is selected, and then all members of chosen clusters are surveyed.
• Example: Selecting a random sample of schools in a city and then surveying
all students within those selected schools.
• Systematic Sampling:
• Description: Every nth member of the population is selected after a random
starting point.
• Example: Choosing every 10th person on a list after randomly selecting a
starting point.
Non-probability sampling
• In non-probability sampling, not all members of the population have a
known or equal chance of being selected. This method is often easier
and less expensive but can lead to biased results. Here are some
common types of non-probability sampling:
• Convenience Sampling:
• Description: Samples are selected based on ease of access and proximity.
• Example: Surveying people at a mall because they are readily available.
• Judgmental (Purposive) Sampling:
• Description: Samples are selected based on the researcher’s judgment about
which individuals are most appropriate for the study.
• Example: Choosing experts in a field to participate in a study because they
have specific knowledge relevant to the research.
• Quota Sampling:
Description: The population is divided into subgroups, and samples are
taken from each subgroup to meet a specified quota.
Example: Ensuring that a survey includes 50 men and 50 women by
selecting participants until the quota for each group is met.
• Snowball Sampling:
Description: In this sampling that involves starting with a small group of
participants and then growing the sample as they refer others.
Example: Surveying a network of professionals by asking each
participant to refer others in the same profession.
Probability
• It means Possibility.
• It is branch of mathematics that deals with the occurrence of a random event.
• The value is expressed between zero and one.
• To find the probability of a single event to occur, first we should know the total
number of possible outcomes.
• Probability of an event=Number of ways it can happen / total number of outcomes.
• If an event is certain then its probability is 1. if event is not certain its probability is
0.
• E.g. Rolling a die
• What's the probability of rolling an even number? There are three favorable
outcomes (2, 4, and 6) and six total outcomes (1, 2, 3, 4, 5, and 6), so the
probability is 3/6, or 50%.
Types of Probability
• Classical
• Empirical
• Subjective
Classical Probability
• Definition: Classical probability, also known as theoretical probability,
is based on the assumption that all outcomes in a sample space are
equally likely to occur.
• Formula: P(E)= Number of favorable outcomes​
Total number of possible outcomes
• Example: The probability of rolling a 3 on a fair six-sided dice is 1/6
because there is one favourable outcome and six possible outcomes.
Empirical (or Experimental)
Probability
• Definition: Empirical probability is based on observations or
experiments. It is the ratio of the number of times an event occurs to
the total number of trials.
• Formula: P(E)= Number of times event E occures
Total number of Trials
• Example: If you flip a coin 100 times and it lands on heads 48 times,
the empirical probability of landing on heads is 48/100 = 0.48
Subjective Probability
• Definition: Subjective probability is based on personal judgment,
experience, intuition, or belief rather than on empirical data or a
precise calculation.
• Example: A weather forecaster might say there is a 70% chance of
rain tomorrow based on their experience and available data.
Distribution
• Distribution refers to the way values of a variable or data set are
spread or distributed.
• It provides a summary of the frequency of individual values or ranges
of values for a variable.
• Types of Distribution:
• Normal Distribution
• Binomial Distribution
• Poisson Distribution
Normal Distribution
• Also known as Gaussian distribution.
• It describes how the values of a variable are distributed and is
characterized by its bell-shaped curve.
• A normal distribution is a symmetric distribution where most of the
data points cluster around a central peak, with values tapering
off(decrease) as they move away from the center.
• It is defined by two main parameters: the mean (average) and the
standard deviation (a measure of the spread or dispersion).
Formula

Where,

x is the variable
μ is the mean
σ is the standard deviation
e is Euler's number i.e. 2.71
Question 1: Calculate the probability density function of normal
distribution using the following data. x = 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can
write;
normal distribution example

Hence, f(3,4,2) = 0.176


Binomial Distribution
• It is used in those scenario where the outcomes of experiment are in
the form of success and failure i.e. the set of only two alternatives
hence the name binomial.
• Example:
• a new drugs introduced and experiment is conduct to identify drugs
cure disease or it’s not.
• Identify the probability of winning a lottery ticket.
• Tossing a coin 10 times and identifying probability of getting head 3
times.
• Formula:
P(r) = nCr pr (1-p)n-r or
p(X)= n! * (p)x * (q) n-x
----------
(n-X)! * x!

Where, q=1-p
n = the number of experiments
p = Probability of Success
r=number of successes
Where to apply binomial
distribution(Criteria)
• Finite number of trials.
• Two possible outcome which are mutually exclusive.
• Each trials is independent.
• The probability of success is same in each trial.
Example
• Tossing a coin 10 times and identifying probability of getting head 3 times.
p(X)= n! * (p)x * (q) n-x
----------
(n-X)! * x!

• X=3
• n=10
• p=1/2
• q=1/2
Properties of Binomial Distribution
• Main Parameters: Two parameters n and p.
Poisson Distribution
• It is also discrete probability distribution.
• French Mathematician Dr. Simon Poisson discovered it.
• It used in scenario where the probability of happing an event is very small
i.e. the chances of happing of the event is rare.
• This means that the probability of success is very small and the value of n is
very large.
• Probability are calculated for a certain time period.
• Example:
• Probability of defective items in a manufacturing company for a month.
• Probability of occurring of earthquake in a year.
• Formula:

P(x) = (e-m * mx )/x!

m=np n=total number of items or event.


• Where
• x = 0, 1, 2, 3...
• e is the Euler's number(e = 2.718)
• m is an average rate of the expected value and m = variance, also λ>0
Where to apply Poisson Distribution
• Infinite number of trials , i.e. ‘n’ is very large.
• The possibility of success i.e. ‘p’ is small.
• E.g. 3% of company manufactured product are defective. Find the
probability, that out of 1000 products, 2 products are defective.

P(x) = (e-m * mx )/x!

• M=NP
• P=3/100 = 30
• N=1000
• e=2.71
Properties of Binomial Distribution
• Main Parameters: ‘m’ is only parameter and so called Uniparametric
distribution.
Correlation
• It is a statistical technique that use to show whether and how strong pair of variable are
related.
• Correlation is a term that is a measure of the strength of linear relationship between two
quantitative variable.
• The most common measure of correlation is Pearson's correlation coefficient, denoted as ‘r’.
The value of ‘r’ ranges from -1 to 1, where:
• r=1 indicates a perfect positive correlation (as one variable increases, the other also
increases).
• r=−1 indicates a perfect negative correlation (as one variable increases, the other
decreases).
• r=0 indicates no correlation (no linear relationship between the variables).
Correlation Formula:
• To calculate the Pearson correlation coefficient (r) between Study
Hours (X) and Exam Scores (Y):
• Compute the means of X and Y.
• Subtract the means from each X and Y value to get the deviations
from the mean.
• Multiply the deviations of X and Y for each pair and sum them up.
• Compute the sum of squared deviations for both X and Y.
𝑋‾=2+4+6+8+105=6
• Means:

𝑌‾=70+75+80+85+905=80

• Deviations:
X−X‾=[−4,−2,0,2,4]
Y−Y‾=[−10,−5,0,5,10]
• Product of Deviations:
(−4)×(−10)=40
(−2)×(−5)=10
0×0=0
2×5=10
4×10=40
Sum = 100
• Sum of Squared Deviations:
∑(X−X‾)2=16+4+0+4+16=40
∑(Y−Y‾)2=100+25+0+25+100=250
Regression
• Regression is a fundamental technique in data science used for
predicting a continuous outcome variable based on one or more
predictor variables. The main goal of regression analysis is to understand
the relationship between the dependent variable (response) and one or
more independent variables (predictors).
• Regression analysis is set of statistical processes for estimating relation
between a dependent variable (outcome variable) and one or more
independent variable (predictors or covariates or features).
• It is used for two conceptually distinct purposes.
1. for prediction and forecasting.
2. Used to casual relationship between independent and dependent
variables.
Types of Regression
• Linear Regression:
• Linear regression analysis is used to predict the value of a variable based on the value of
another variable.
• Simple Linear Regression: Involves one independent variable to predict the dependent
variable. Y=β0+β1X+ϵY =
Where:
• Y is the dependent variable.
• X is the independent variable.
• β0is the intercept.
• β1is the slope coefficient.
• ϵ(epsilon) is the error term
• Multiple Linear Regression: Involves two or more independent variables.
Y=β0+β1X1+β2X2+…+βnXn+ϵ
• Polynomial Regression
• Extends linear regression by allowing the relationship between the
independent and dependent variables to be modeled as an n-th
degree polynomial.
• Y=β0+β1X+β2X2+…+βnXn+ϵ
• Stepwise Regression: It is a method used in regression analysis to select the
most significant predictor variables from a larger set of potential variables.
The process is "stepwise" because it adds or removes predictors one at a
time, based on specific criteria, to build the most effective model.
• Quantile Regression: is an extension of traditional linear regression that
allows for the modeling of different quantiles (percentiles) of the
dependent variable (response) rather than just the mean. In other words,
instead of predicting the average value of the response variable given a set
of predictors, quantile regression allows you to predict the value of the
response variable at specific quantiles (e.g., the 25th percentile, median or
50th percentile, 75th percentile, etc.).
• Bayesian Regression: Bayesian Regression is a way to do regression
analysis using the principles of Bayesian statistics. In simple terms, it's
a method where you estimate the relationship between variables
while also considering prior beliefs or knowledge about the model
parameters.

You might also like