0% found this document useful (0 votes)
6 views

U3 Prob & Stat & Hypo

The document covers fundamental concepts in probability, statistics, and machine learning, focusing on data types, data processing, and the significance of hypothesis testing. It explains the architecture of big data, the machine learning process, and various statistical methods including Bayes' theorem and hypothesis testing techniques. Additionally, it discusses central tendency, dispersion, and the importance of data quality in achieving accurate results.

Uploaded by

bharatpatidar399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

U3 Prob & Stat & Hypo

The document covers fundamental concepts in probability, statistics, and machine learning, focusing on data types, data processing, and the significance of hypothesis testing. It explains the architecture of big data, the machine learning process, and various statistical methods including Bayes' theorem and hypothesis testing techniques. Additionally, it discusses central tendency, dispersion, and the importance of data quality in achieving accurate results.

Uploaded by

bharatpatidar399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Probability, Statistics

& Hypothesis
Understanding Data

• All facts are data.


• Human interpretable
• Diffused data.
• Operational data
• Non-operational data
Big Data

• Big is a larger data whose volume is much larger than 'Small data'
and is characterized as follows:
• Volume
• Velocity
• Variety
• Veracity
• Validity
• Value
Types of Data

• In Big Data, there are three kinds of data.


• Structured data –
• Unstructured data –
• Semi-structured data –
Big Data Architecture Layers

• There are four main Big Data architecture layers:


• Data Ingestion
• Data Processing
• Data Storage
• Data Visualization
• The Big data processing cycle involves data management that consists
of the following steps:
• Data collection
• Data preprocessing
• Applications of machine learning algorithm
• Interpretation of results and visualization of machine learning algorithm
Process of Machine Learning
• ML is the process that starts with defining the data and ends with a
model with some defined level of accuracy.
• Define the problem (Problem Domain)
• Data collection
• Data preparation
• Splitting data in training and testing
• Algorithm selection
• Performance of ML algorithm
Define the problem (Problem Domain)
• Problem: Find out whether the input image is of a human or not.
• To define a problem, it will divide in
• Task (T),
• Experience (E)
• Performance (P)
• Task(T): Classify an image to find if contains human or not.
• Experience(E): Images with the label contains human or not.
• Performance(P): (Error rate) out of all classified images, what
percentage is wrong prediction.
• Lower error rate leads to higher accuracy.
Data collection
• Data can be collected from
• Open/public data source,
• Social media
• Academic research.
• Government or institutional data
• A good quality data yields a better result.
• Good data is one that has the following properties:
• Timeliness
• Relevancy
• Accuracy
• Reliability
• Knowledge about the data
Data preparation
• Preprocessing of data involves processes as follows:
• Cleaning: Involves identifying and rectifying errors or inconsistencies in the
dataset.
• Handling missing values
• Removing duplicates
• Correcting inconsistent data
• Dealing with outliers
• Binning methods
• Smoothing by mean
• Smoothing by bin medians
• Smoothing by bin boundaries

• Formatting:
• Converting categorical data
• Date and time handling
Data preparation
• Sampling:
• Random Sampling:
• Over-sampling and Under-sampling
• Bootstrapping:
• Decomposition:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
• Scaling:
• Normalization (Min-Max Scaling)
• Standardization (Z-score Scaling)
• Robust Scaling:
Data preprocessing
• In real world, the available data is dirty. 'dirty' means:
• Incomplete data
• Outlier data
• Data with inconsistent values
• Inaccurate data
• Data with missing values
• Duplicate data
Example
• The 'bad' or 'dirty' data can be observed in following patient table.
Example
• Consider the set: S = (12, 14, 19, 22, 24, 26, 28, 31, 32). Apply various
binning techniques and show the result.
Split data in training and testing

• For training the model, data needs to be divided into 3- parts:


Algorithm Selection
• Selection of algorithm depends on the problem definition.
• For e.g. Classification of emails as 'spam' or 'not spam'
requires algorithms that takes input variable and gives an
output as SPAM/ Not SPAM).
Data Transformations
• Data transformation perform operations like normalization to improve
the performance of the data mining algorithms.
• Normalization technique:
• Min-Max Procedure
• z-Score Normalization
Min-Max Procedure
• It is a normalization technique where each variable V is normalized by
its difference with the minimum value divided by the range to a new
range, say 0-1.
Example
• Consider the set: V= (88, 90, 92, 94). Apply Min-Max procedure and
map the marks to a new range 0-1.
z-Score Normalization
• This procedure works by taking the difference between the field value
and mean value, and by scaling this difference by standard deviation
of the attribute.
Example
Consider the mark list V= {10, 20, 30}, convert the marks to z-score.
Data Types

• Data types are classified as:


Data Types
• Other way of classifying the data is based on the number of variables used in the
dataset.
• The data can be classified as
• Univariate data
• Bivariate data
• Multivariate data
Univariate

• Univariate analysis has only one variable.


Bivariate data
• This type of data involves two different variables.
• The analysis of this type of data deals with causes and relationships.
• Bivariate statistics
• Covariance

• Correlation
Example
• Find the covariance and correlation of data
X= (1, 2, 3, 4, 5) and Y= (1, 4, 9, 16, 25).
Multivariate data
• When the data involves three or more variables, it is categorized
under multivariate.
• Analysis techniques are regression analysis, path analysis, factor
analysis, cluster analysis and multivariate analysis of variance
(MANOVA).
Central tendency

• Popular measures of central tendency are


• Mean: The sum of all values divided by the total number of values
• Median: The middle number in an ordered dataset.
• Mode: The most frequent value.
Dispersion
• The spread out a set of data around the central tendency is called
dispersion.
• Dispersion is represented by
• Variance
• Standard deviation
Machine Learning and Importance of Probability and Statistics

• Statistics - In Machine learning statistics is used to analyze data. It


helps find unseen patterns.
• Probability distributions - In machine learning it is used to calculate
confidence intervals for parameters and to calculate critical regions for
hypothesis tests.
Random variable

• A random variable ‘X’ is a process by which a (real) number


x(s) is assigned to each possible outcome of a statistical
experiment
• RV are of two types
• Discrete RVs and
• Continuous RVs
• The nth moment of a random variable X is defined as
Mean square value

• When n = 2

• is called mean square value of random variable X.


Central moments
• The central moments are the moments of the difference
between random variables X and its mean mx .
• Thus the nth central moment is defined as
Variance of random variable
• The second central moment ; i.e. n = 2 is called
variance of random variable X.
Standard Deviation
• The square root of variance is called standard
deviation of random variable X.
Standard deviation = √ (variance)
Cumulative Distribution Function

• The CDF of a RV is defined as the probability that the RV


X takes values less than or equal to x.

• Properties of CDF
Probability Density Function

• Probability density function PDF is defined by


Gaussian or Normal Distribution
• Gaussian Distribution is also called Normal Distribution.
• The PDF for a Gaussian random variable is given as,
Properties of PDF
• It is non negative function for all values of x

• The area under the PDF curve is always unity

• The peak value occurs at x = m (i.e. mean value).

• The plot of Gaussian PDF has even symmetry around mean value
Probability

• Marginal probability is the probability of an event irrespective of the


outcome of another variable.
• Joint probability is the probability of two events occurring
simultaneously.
• Conditional probability is the probability of one event occurring in
the presence of a second event.
Marginal probability

• P (F) =?
Joint Probability of Two Variables

• Find the probability that employee should rank 1 officer and male.
• Occurrence of 2 or more events.
Conditional probability

• Find the probability that employee should rank 1 officer given he is


male.
Bayes Theorem in ML
• Bayes theorem is given by Mr. Thomas Bayes in 17th century.
• Bayes Theorem is a method to determine conditional probabilities –
that is, the probability of one event occurring given that another event
has already occurred.
• Bayes Theorem is a mathematical formula that describes how to
update the probability of a hypothesis (or event) based on new
evidence.
• It allows the model to revise its predictions as new data becomes
available.
• Used widely in Machine Learning (ML), particularly in classification
problems and probabilistic models.
Bayes Theorem in ML

P(B | A) ∗ P(A)
P(A | B) =
P(B)
• The above equation is called as Bayes Rule or Bayes Theorem.
• P(A|B) is called as posterior probability, which we need to calculate. It is
defined as updated probability after considering the evidence.
• P(B|A) is called the likelihood. It is the probability of evidence when hypothesis
is true.
• P(A) is called the prior probability, probability of hypothesis before considering
the evidence
• P(B) is called marginal probability. It is defined as the total probability of the
evidence under all hypotheses
• Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Example
• Diagnose whether someone has a certain disease, if following
information is given :
• Prior Probability: Based on the general population, the prior
probability of someone having this disease is 1% (i.e., 1 out of 100
people has it).
• Evidence: Conduct a medical test, which is not perfect. The test has:
• A True Positive Rate of 90%, meaning if the person has the disease, the test
will correctly identify it 90% of the time.
• A False Positive Rate of 10%, meaning if the person does not have the disease,
the test will incorrectly say they have it 10% of the time.
Find out the updated probability (posterior probability) that the
person has the disease given that the test result is positive.
Bayes’ Theorem Examples

• A man is known to speak the lies 1 out of 4 times. He throws a die and
reports that it is a six. Find the probability that is actually a six?

• What is the probability that a patient has diseases meningitis with a


stiff neck? A doctor is aware that disease meningitis causes a patient to
have a stiff neck, and it occurs 80% of the time. He is also aware of
some more facts, which are given as follows :
• The Known probability that a patient has meningitis disease is 1/30,000.
• The Known probability that a patient has a stiff neck is 2%.
What is a Bayesian Network?
• Bayesian Network (BN): A graphical model that represents
probabilistic relationships among variables using a directed acyclic
graph (DAG).

• Components:
• Nodes: Represent random variables (e.g., weather, disease, sensor readings).
• Edges: Represent dependencies or conditional relationships between
variables.
• Conditional Probability Tables (CPTs): Define the probability of a node
given its parent nodes.
Applications of Bayesian Networks

• Medical Diagnosis
• Decision Support Systems
• Risk Assessment
• Machine Learning
• Natural Language Processing
Advantages of Bayesian Networks

• Modeling Uncertainty
• Visual Representation
• Learning from Data
Example
Example
Bayesian Networks Example
• Example: Harry installed a new burglary alarm at his home to detect
burglary. The alarm reliably responds at detecting a burglary but also
responds for minor earthquakes. Harry has two neighbors David and
Sophia, who have taken a responsibility to inform Harry at work when
they hear the alarm. David always calls Harry when he hears the
alarm, but sometimes he got confused with the phone ringing and calls
at that time too. On the other hand, Sophia likes to listen to high
music, so sometimes she misses to hear the alarm. Here we would like
to compute the probability of Burglary Alarm.
• Calculate the probability that alarm has sounded, but there is
neither a burglary, nor an earthquake occurred, and David and
Sophia both called the Harry.
Bayesian Networks
Hypothesis
• A hypothesis is an assumption or prediction based on some evidence
that can be tested .
• It is specifically used in Supervised Machine learning, where an ML
model learns a function that best maps the input to corresponding
outputs with the help of an available dataset.
• Types Hypothesis
• Null Hypothesis (H0)
• Alternative Hypothesis (H1)
Hypothesis testing

• There are two types of hypothesis tests,


• Parametric
• Tests – Z, t, F
• Non-parametric
• Tests – chi-square
Significance level α
• The significance level, often denoted as alpha (α), is a threshold used in
hypothesis testing to determine whether the null hypothesis should be rejected.
• Mostly two values are considered for the level of significance, i.e. 5% and 1%.
Z-test
• It assumes normal distribution of data whose population variation is known.
• The sample size is assumed to be large.
• The focus is to test the population mean.
• The z-statistic is given as:

• X is the sample mean,


• n is sample size
• µ is mean of population
• σ is standard deviation of population
Examples
• A company has developed a vaccine that is supposed to increase immunity
level. The standard deviation of immunity level in the general population is
20. vaccine is tested on 40 patients and obtain a mean immunity level of
96.25. Using an alpha value of 0.05, is this immunity level significantly
different than the population mean of 100?
• The population of GATE scores are known to have a standard deviation of
9. The Engineering department hopes to receive applicants with a GATE
scores over 300. This year, the mean GATE scores for the 40 applicants was
303.8. Using a value of α = 0.05 is this new mean significantly greater than
the desired mean of 300?
• Swami Vivekanand school science teacher claims that students in his
section will score higher marks than those in his colleague’s section. The
mean science score for 60 students in his section is 22.1, and the standard
deviation is 4.8. The mean science score for 40 of the colleagues’ sections is
18.8, and the standard deviation is 8.1. At α = 0.05, can the teacher’s claim
be supported?
One Sample Test
• In this test, the mean of one group is checked against the set average
that can be either theoretical value or population mean.

• Here, t is t-statistic, x is the mean of the sample, µ is the theoretical


value or population mean
• sx is the standard deviation, and n is the sample size.
• Degree of freedom N-1.
Example
• The following data represents marks for 10 students 9.5, 10, 8, 7, 11, 7, 6.5, 8.5,
10.5, 12. Is the mean value for students significantly differ from the mean value of
general population (12). Evaluate the role of chance. ( α = 0.05 )

Independent Two Sample Test
• t-statistic for two groups A and B is computed as follows:

• Here, given mean(A) and mean(B) are for two different samples.
• N1 and N2 are sample sizes of two groups A and B.
• s² is the variance of the two samples and the degree of freedom is Here, given
N1+N2-2.
• Then, t-statistics compared with the t-critical value.
Chi- Square test
• Chi-Square test is a non-parametric test.
• It measures the statistical significance between observed frequency and expected
frequency, and each observation is independent of each other and follows normal
distribution.
• This comparison is used to calculate the value of the Chi-Square statistic as:

• E is the expected frequency, O is the observed frequency and the degree of


freedom is C-1, where, C is number of categories.
• The Chi-Square test allows us to detect the duplication of data and helps to
remove the redundancy of values.
Example
• Consider the following Table, where the machine learning course registration is
done by both boys and girls. There are 50 boys and 50 girls in the class and the
registration of the course is given in the table. Apply Chi-Square test and find out
whether any differences exist between boys and girls for course registration.
Concept learning

• Concept learning: is the process of acquiring knowledge about categories, ideas,


or things based on shared features.
• In the context of concept learning, "shared features" refer to the common
characteristics or attributes that are present in all instances of a category.
• Purpose: It helps to group similar objects, events, or ideas together and
distinguish them from other categories.
• Importance: It is essential for problem-solving, categorization, language learning.
• Concept learning requires three things:
• Input -Training dataset which is a set of training instances, each labeled with
the name of a concept or category to which it belongs.
• Output - Target concept or Target function f. It is a mapping function f(x) from
input x to output y. It is to determine the specific features or common features
to identify an object. In other words, it is to find the hypothesis to determine
the target concept. For e.g., the specific set of features to identify an elephant
from all animals.
• Test - New instances to test the learned model.
• Formally, It is defined as-"Given a set of hypotheses, the learner
searches through the hypothesis space to identify the best hypothesis
that matches the target concept".
Hypothesis space (H):
• Hypothesis space is defined as a set of all possible legal hypotheses;
hence it is also known as a hypothesis set. It is used by supervised
machine learning algorithms to determine the best possible hypothesis
to describe the target function or best maps input to output.
Searching the Hypothesis Space
• There are two ways of learning the hypothesis,
• Specialization - General to Specific learning
• This learning methodology will search through the hypothesis space for an
approximate hypothesis by specializing the most general hypothesis.
• Generalization - Specific to General learning
• This learning methodology will search through the hypothesis space for an
approximate hypothesis by generalizing the most specific hypothesis
General to Specific

• This approach starts with a general idea and then narrows down to
specific instances or observations.
• It is often associated with deductive reasoning, where you begin with
a general principle and apply it to a specific case.
• Example:
• General: "All birds have wings."
• Specific: "This animal is a bird, so it must have wings."
Specific to General

• This approach starts with specific observations or data and then make
broader generalization.
• It's associated with inductive reasoning, where conclusions are drawn
based on the analysis of specific instances or patterns.
• Example:
• Specific: "I have seen five different swans, and all of them are white."
• General: "All swans must be white."
Hypothesis Space Search by Find-S Algorithm
• The Find-S algorithm is a simple machine learning algorithm used for
concept learning.
• Find-S algorithm is initially starts with the most specific hypothesis.
• This algorithm considers only the positive instances and eliminates
negative instances while generating the hypothesis.
Bias and Variance

• In Machine Learning (ML), bias and variance are two fundamental


concepts that help us understand the performance of a model.
• Bias
• Variance
Bias
• Bias refers to the difference between the model's expected prediction
and the true value.
• Types of Bias:
• High Bias: The model is too simple and fails to capture the underlying
patterns in the data. This results in poor performance on both training and test
data. (underfitting)
• Low Bias: The model is complex and able to capture the underlying patterns
in the data. However, it may also capture noise in the data, leading to
overfitting.
Variance
• Variance refers to the variability of the model's predictions. It
measures how much the model's predictions change when it is trained
on different subsets of the data.
• Types of Variance:
• High Variance: The model is too complex capturing not just the
patterns but also the noise. This results in poor performance on test
data.
• Low Variance: The model is simple and fails to capture the
underlying patterns in the data. This results in poor performance on
both training and test data.
Trade-off between Bias and Variance
• There is a fundamental trade-off between bias and variance.
• As we increase the complexity of the model:
• Bias decreases (the model becomes less biased)
• Variance increases (the model becomes more prone to overfitting)
• Conversely, as we decrease the complexity of the model:
• Bias increases (the model becomes more biased)
• Variance decreases (the model becomes less prone to overfitting)

You might also like