0% found this document useful (0 votes)
5 views

FDA-Unit II-Notes

The document provides an overview of descriptive statistics, explaining its purpose in making informed decisions and predictions based on data analysis. It outlines the steps involved in statistical methods, the importance of gathering data, and the distinction between populations and samples. Additionally, it covers types of data, measurement levels, and measures of central tendency, including mean, median, and mode.

Uploaded by

Barkha Kumari
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

FDA-Unit II-Notes

The document provides an overview of descriptive statistics, explaining its purpose in making informed decisions and predictions based on data analysis. It outlines the steps involved in statistical methods, the importance of gathering data, and the distinction between populations and samples. Additionally, it covers types of data, measurement levels, and measures of central tendency, including mean, median, and mode.

Uploaded by

Barkha Kumari
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 120

Unit II

Descriptive Statistics
What is Statistics Used for?
• Statistics is used in all kinds of science and business applications.
• Statistics gives us more accurate knowledge which helps us make better decisions.
• Statistics can focus on making predictions about what will happen in the future. It can also
focus on explaining how different things are connected.
Statistics is mainly divided into 2 parts:
1. Descriptive
2. Inferential

Typical Steps of Statistical Methods


The typical steps are:
• Gathering data
• Describing and visualizing data
• Making conclusions
How is Statistics Used?
• Statistics can be used to explain things in a precise way. You can use
it to understand and make conclusions about the group that you
want to know more about. This group is called the population.
A population could be many different kinds of groups.
It could be:
• All of the people in a country
• All the businesses in an industry
• All the customers of a business
• All people that play football who are older than 45
• and so on - it just depends on what you want to know about.
• Gathering data about the population will give you
a sample. This is a part of the whole population.
Statistical methods are then used on that sample.
• The results of the statistical methods from the sample is
used to make conclusions about the population.
Gathering Data
• Gathering data is the first step in statistical analysis.
• Say for example that you want to know something about all the
people in France.
• The population is then all of the people in France.
• It is too much effort to gather information about all of the
members of a population (e.g. all 67+ million people living in
France). It is often much easier to collect a smaller group of that
population and analyze that. This is called a sample.
Statistics - Describing Data
Descriptive Statistics
The information (data) from your sample or population
can be visualized with graphs or summarized by
numbers. This will show key information in a simpler way
than just looking at raw data. It can help us understand
how the data is distributed.
Statistics - Making Conclusions
• Using statistics to make conclusions about a population
is called statistical inference.
Descriptive Statistics:
Descriptive statistics summarize and organize
characteristics of a data set. A data set is a collection of
responses or observations from a sample or entire
population..
population (N)and sample(n):
The population refers to the complete record of
observation(rows) and features(columns). They are
usually very big in size when solving real-time problems.
The sample is a part or subset of the population where
statistical studies will be conducted on it to understand
what it is going to be for the population it comes from.
• Population(N): Everything in the group that we want to learn
about.
• Sample(n): A part of the population.

Examples of populations and a sample from those populations:


• For good statistical analysis, the sample needs to be as "similar" as possible to
the population. If they are similar enough, we say that the sample
is representative of the population.
• The sample is used to make conclusions about the whole population. If the
sample is not similar enough to the whole population, the conclusions could be
useless.
Types of Data
There are different types of data in Statistics, that are collected, analyzed,
interpreted and presented. The data are the individual pieces of factual information
recorded, and it is used for the purpose of the analysis process. The two processes of
data analysis are interpretation and presentation. Statistics are the result of data
analysis,
Data can be different types, and require different types of statistical methods to
analyze.
Types of data
There are two main types of data:
Qualitative (or 'categorical’)
Quantitative (or 'numerical’)
These main types also have different sub-types depending on their measurement
level.
Qualitative or Categorical Data
• Qualitative data, also known as the categorical data, describes the
data that fits into the categories. Qualitative data are not
numerical. The categorical information involves categorical
variables that describe the features such as a person’s gender,
home town etc. Categorical measures are defined in terms of
natural language specifications, but not in terms of numbers.
• Sometimes categorical data can hold numerical values
(quantitative value), but those values do not have a mathematical
sense. Examples of the categorical data are birthdate, favourite
sport, school postcode. Here, the birthdate and school postcode
hold the quantitative value, but it does not give numerical
meaning.
Examples:
• Brands
• Nationality
• Professions
• With categorical data we can calculate statistics
like proportions. For example, the proportion of Indian
people in the world, or the percent of people who prefer
one brand to another.
The Nominal and Ordinal data types are classified under
categorical data.
Nominal Data
Nominal data is defined as data that is used for naming or
labelling variables, without any quantitative value. It is
sometimes called “named” data – a meaning coined from
the word nominal. The nominal data are examined using the
grouping method. In this method, the data are grouped into
categories, and then the frequency or the percentage of the
data can be calculated. Examples of nominal data include
country, gender, race, hair color etc. of a group of people
Ordinal data
Ordinal data/variable is a type of data that follows a natural order.
Ordinal data is a type of categorical data with an order. The
variables in ordinal data are listed in an ordered manner. The
ordinal variables are usually numbered, so as to indicate the order
of the list. However, the numbers are not mathematically
measured or determined but are merely assigned as labels for
opinions.
Example of ordinal data includes having a position in class as
“First” or “Second” or “Third”.
Income as “Low” or “Medium” or “High”
Performance as “Good” or “Better” or “Best”
Quantitative or Numerical Data
Quantitative data is also known as numerical data which
represents the numerical value (i.e., how much, how often,
how many). Numerical data gives information about the
quantities of a specific thing. Some examples of numerical
data are height, length, size, weight, and so on. The
quantitative data can be classified into two different types
based on the data sets.
The two different classifications of numerical data are
1. Discrete data
2. Continuous data.
Discrete Data
• Discrete data can take only discrete values. Discrete information
contains only a finite number of possible values. Those values
cannot be subdivided meaningfully. Here, things can be counted
in whole numbers.
• Example: Number of students in the class
Continuous Data
• Continuous data is data that can be calculated. It has an infinite
number of probable values that can be selected within a given
specific range.
• Example: Temperature range, Height,Weight.
Measurement Levels
A variable has one of four different levels of measurement:
• Nominal
• Ordinal
• Interval
• Ratio.
The four different levels of measurement are:
• Nominal–Latin for name only (Republican, Democrat, Green, Libertarian)
• Ordinal–Think ordered levels or ranks (small–8oz, medium–12oz, large–32oz)
• Interval–Equal intervals among levels (1 dollar to 2 dollars is the same interval
as 88 dollars to 89 dollars)
• Ratio–Let the “o” in ratio remind you of a zero in the scale (Day 0, day 1, day 2,
day 3, …)
• The first level of measurement is nominal level of measurement.
• In this level of measurement, the numbers in the variable are used only to
classify the data.
• In this level of measurement, words, letters, and alpha-numeric symbols can
be used.
Suppose there are data about people belonging to three different gender
categories.
In this case,
• Person belonging to the female gender could be classified as F,
• Person belonging to the male gender could be classified as M,
• Transgendered classified as T.
This type of assigning classification is nominal level of measurement.
The second level of measurement is the ordinal level of measurement.
This level of measurement depicts some ordered relationship among the
variable’s observations.
Suppose ,
• A student scores the highest grade of 100 in the class. In this case, He would
be assigned the first rank.
• Then, another classmate scores the second highest grade of an 92; she would
be assigned the second rank.
• A third student scores a 81 and he would be assigned the third rank, and so
on.
The ordinal level of measurement indicates an ordering of the measurements.
The third level of measurement is the interval level of measurement.
The interval level of measurement not only classifies and orders the
measurements,
but it also specifies that the difference between each interval on the
scale are equivalent along the scale from low interval to high interval.
For example, an interval level of measurement could be the
measurement of anxiety in a student between the score of 10 and 11,
this interval is the same as that of a student who scores between 40
and 41.
A popular example of this level of measurement is temperature in
centigrade, where, for example, the difference between 940C and 960C is
the same as the difference between 1000C and 1020C.
The fourth level of measurement is the ratio level of measurement.
In this level of measurement, the observations, in addition to having equal
intervals, can have a value of zero as well.
The zero in the scale makes this type of measurement unlike the other types of
measurement, although the properties are similar to that of the interval level of
measurement.
In the ratio level of measurement, the divisions between the points on the scale
have an equivalent distance between them.
On a ratio scale, a zero means there’s a total absence of the variable of interest.
For example, the number of children in a household or years of work experience
are ratio variables: A respondent can have no children in their household or zero
years of work experience.
Measures of Central Tendency
• In statistics, the central tendency is the descriptive summary of a data
set. Through the single value from the dataset, it reflects the center of
the data distribution. Moreover, it does not provide information
regarding individual data from the dataset, where it gives a summary of
the dataset. Generally, the central tendency of a dataset can be defined
using some of the measures in statistics.
Definition
• The central tendency is stated as the statistical measure that represents
the single value of the entire distribution or a dataset. It aims to provide
an accurate description of the entire data in the distribution.
• Measures of central tendency are numbers that tend to cluster around
the “middle” of a set of values. Three such middle numbers are
the mean, the median, and the mode.
Measures of Central Tendency
• The central tendency of the dataset can be found out
using the three important measures namely mean,
median and mode.
Mean
• Mean is an arithmetic average of the data set and it can be calculated by dividing
a sum of all the data points with the number of data points in the data set.
• It is a point in a data set that is the average of all the data points we have in a
set.
• In statistics, mean is the most common and frequently used method to measure
the center of a data set.
• It’s a fundamental yet essential part of the statistical analysis of data.
• If we calculate the average value of the population set, then
• it is called the population mean. Sometimes, population data is vast, and we
cannot perform analysis on that data set.
• Hence, in that case, we take a sample out of it and take an average. That sample
represents the population set and the mean of this part of the data is called a
sample mean.
An important note is that the mean value is the average
value, which will fall between the maximum and
minimum value in the data set. The mean value will not
be the number in the data set, but its values are
sometimes equal to the data set’s value.
Mean Formula For Ungrouped Data or Individual Series
• The formula to find the mean of an ungrouped data is given below:
• Suppose x1, x2, x3,….., xn be n observations of a data set, then the
mean of these values is:

• Here,
• xi = ith observation, 1 ≤ i ≤ n
• ∑xi = Sum of observations
• n = Number of observations
Examples
Question 1: Find the mean of the following data set.
10, 20, 36, 12, 35, 40, 36, 30, 36, 40
Solution:
• Given,
• xi = 10, 20, 36, 12, 35, 40, 36, 30, 36, 40
• n = 10
• Mean = ∑xi/n
• = (10 + 20 + 36 + 12 + 35 + 40 + 36 + 30 + 36 + 40)/10
• = 295/10
• = 29.5
• Therefore, the mean of the given data set is 29.5.
Example: If the heights of 5 people are 142 cm, 150 cm, 149 cm, 156
cm, and 153 cm.Find the mean height.
Solution:
• Mean height, x̄ = (142 + 150 + 149 + 156 + 153)/5
= 750/5
= 150
• Mean, x̄ = 150 cm
• Thus, the mean height is 150 cm.
When a data set is large, a frequency distribution table is
often used to display the data in an organized way. A
frequency distribution table lists the data values, as well
as the number of times each value appears in the data
set. In a discrete frequency distribution the arithmetic
mean may be computed by any one of the following
methods:
• Direct Method
• Assumed Mean Method
• Step-deviation Method
Mean Formula For Grouped Data
• There are three methods to find the mean for grouped
data, depending on the size of the data. They are:
• Direct Method
• Assumed Mean Method
• Step-deviation Method
Let us go through the formulas in these three methods
given below:
Direct Method
• Suppose x1, x2, x3,…., xn be n observations with respective frequencies f1, f2, f3,
…., fn. This means, the observation x1 occurs f1 times, x2 occurs f2 times,
x3 occurs f3 times and so on. Hence, the formula to calculate the mean in the
direct method is:
Here,
• ∑fixi = Sum of all the observations
• ∑fi = Sum of frequencies or observations
• This method is used when the number of observations is
small.
Mean, x̄ = (∑xi fi)/(∑fi)
= 360/40
=9
• Thus, Mean = 9
* Solve above example by using Assumed method.
Assumed Mean Method
• In this method, we generally assume a value as the
mean (namely a). This value is taken for calculating the
deviations based on which the formula is defined. Also,
the data will be in the form of a frequency distribution
table with classes.
• Thus, the formula to find the mean in assumed mean
method is:
Here,
• a = assumed mean
• fi = frequency of ith class
• di = xi – a = deviation of ith class
• Σfi = N = Total number of observations
• xi = class mark(if given in interval i.e. Continuous series then find mid
point))= (upper class limit + lower class limit)/2
Assumed Mean Method Examples
• If xi and fi are numerically large, the assumed mean method is preferred. Below
are some examples of calculating the mean of grouped data by this method.
Example 1:
• The following table gives information about the marks obtained by 110 students
in an examination. Find the mean marks of the students using the assumed mean
method.
Assumed mean = a = 25
Mean of the data:

• = 25 + (-10/ 110)
• = 25 -( 1/11)
• = (275-1)/11
• = 274/11
• =24.9
Hence, the mean marks of the students are 24.9.
Example 2:
• The table below gives information about the percentage distribution
of female employees in a company of various branches and a number
of departments. Find the mean percentage of female employees by
the assumed mean method.
Assumed mean = a = 40
• Mean = a+ (Σfidi /Σfi)
• =40+ (360/35)
• = 40+(72/7)
• = 40 + 10.28
• =50.28 (approx)
Hence, the mean percentage of female employees is 50.28.
Practice Questions on Assumed Mean Method
Solve the following questions using the formula of assumed mean
method.
1. Find the mean of the following data by assumed mean method.
2. The given distribution shows the number of runs scored by some top
batsmen of the world in one-day international cricket matches. Find the
mean of the data.
3. Find the mean of the following data using the assumed mean
method formula.
Step-deviation Method
• When the data values are large, the step-deviation method is used to find the mean. The formula is given
by:

Here,
• a = assumed mean
• fi = frequency of ith class
• xi – a = deviation of ith class
• ui = (xi – a)/h
• Σfi = N = Total number of observations
• xi = class mark = (upper class limit + lower class limit)/2
Example: Consider the following example to understand this method.
Find the mean of the following using the step-deviation method.
Solution: To find the mean, we first have to find the class marks
and decide A (assumed mean). Let A = 35 Here h (class width) = 10
• Using mean formula:
• x̄ = A + h × ∑xiui / ∑fi
• = 35 + (16/50) ×10 = 35 + 3.2 = 38.2

Mean = 38.
What is Median?
• Generally median represents the mid-value of the given set of data
when arranged in a particular order.
Median
• The value of the middlemost observation, obtained after arranging
the data in ascending or descending order, is called the median of the
data.
• For example, consider the data: 4, 4, 6, 3, 2. Let's arrange this data in
ascending order: 2, 3, 4, 4, 6. There are 5 observations. Thus, median
= middle value i.e. 4.
Case 1: Ungrouped Data
Step 1: Arrange the data in ascending or descending order.
Step 2: Let the total number of observations be n.
To find the median, we need to consider if n is even or odd. If n is odd,
then use the formula:
Median = (n + 1)/2th observation
Example 1: Let's consider the data: 56, 67, 54, 34, 78, 43, 23. What is
the median?
Solution:
Arranging in ascending order, we get: 23, 34, 43, 54, 56, 67, 78.
Here, n (number of observations) = 7
So, (7 + 1)/2 = 4
∴ Median = 4th observation
Median = 54
If n is even, then use the formula:
Median = [(n/2)th obs.+ ((n/2) + 1)th obs.]/2

Example 2: Let's consider the data: 50, 67, 24, 34, 78, 43. What is the median?
Solution:
Arranging in ascending order, we get: 24, 34, 43, 50, 67, 78.
Here, n (no.of observations) = 6
6/2 = 3
Using the median formula,
Median = (3rd obs. + 4th obs.)/2
= (43 + 50)/2
Median = 46.5
• Case 2: Grouped Data
• When the data is continuous and in the form of a frequency
distribution, the median is found as shown below:
• Step 1: Find the median class.
• Let n = total number of observations i.e. ∑ fi
• Note: Median Class is the class where (n/2) lies.
• Step 2: Use the following formula to find the median.
• where,
• l = lower limit of median class
• c = cumulative frequency of the class preceding the median class
• f = frequency of the median class
• h = class size
Solution: We need to calculate the cumulative frequencies to find the
median.
Calculation table:
N = 50
N/2 = 50/2 = 25
Median Class = (20 - 30)
l = 20, f = 22, c = 14, h = 10
Using Median formula:
= 20 + (25 - 14)/22 × 10
= 20 + (11/22) × 10
= 20 + 5 = 25
∴ Median = 25
Mode
• The value which appears most often in the given data i.e. the observation with the highest
frequency is called a mode of data.
Case 1: Ungrouped Data
• For ungrouped data, we just need to identify the observation which occurs maximum times.
• Mode = Observation with maximum frequency
• For example in the data: 6, 8, 9, 3, 4, 6, 7, 6, 3, the value 6 appears the most number of
times.
• Thus, mode = 6. An easy way to remember mode is: Most Often Data Entered. Note: A data
may have no mode, 1 mode, or more than 1 mode. Depending upon the number of modes
the data has, it can be called unimodal, bimodal, trimodal, or multimodal.
• The example discussed above has only 1 mode, so it is unimodal.
Case 2: Grouped Data
When the data is continuous, the mode can be found using the
following steps:
Step 1: Find modal class i.e. the class with maximum frequency.
Step 2: Find mode using the following formula:
where,
• l = lower limit of modal class,
• fm = frequency of modal class,
• f1 = frequency of class preceding modal class,
• f2 = frequency of class succeeding modal class,
• h = class width
Solution:
The highest frequency = 12, so the modal class is 40-60.
l = lower limit of modal class = 40
fm = frequency of modal class = 12
f1 = frequency of class preceding modal class = 10
f2 = frequency of class succeeding modal class = 6
h = class width = 20
Relation Between Mean, Median and Mode
• The three measures of central values i.e. mean, median,
and mode are closely connected by the following
relations (called an empirical relationship).
• 2Mean + Mode = 3Median
• For instance, if we are asked to calculate the mean,
median, and mode of continuous grouped data, then we
can calculate mean and median using the formulas as
discussed in the previous sections and then find mode
using the empirical relation.
For example, we have data whose mode = 65 and
median = 61.6.
Then, we can find the mean using the above mean,
median, and mode relation.
2Mean + Mode = 3 Median
∴2Mean = 3 × 61.6 - 65
∴2Mean = 119.8
⇒ Mean = 119.8/2
⇒ Mean = 59.9
Skewness

Skewness
The skewness in statistics is a measure of asymmetry
or the deviation of a given random variable’s distribution
from a symmetric distribution (like normal Distribution).
In Normal Distribution, we know that:
Median = Mode = Mean
• The blue curve is a Normal Distribution.
The yellow histogram shows some data that
follows it closely, but not perfectly .It is often called Bell curve, because it looks like a
bell.
Skewness in statistics can be divided into two categories.
They are:
• Positive Skewness
• Negative Skewness
Positive Skewness
• The extreme data values are higher in a positive skew
distribution, which increases the mean value of the data set. To
put it another way, a positive skew distribution has the tail on
the right side.
• It means that, Mean > Median > Mode in positive skewness
Negative Skewness
• The extreme data values are smaller in negative skewness,
which lowers the dataset’s mean value. A negative skew
distribution is one with the tail on the left side.
• Hence, in negative Skewness, Mean < Median < Mode.
Skewness Formula in Statistics
• Skewness is a measure used in statistics that helps
reveal the asymmetry of a probability distribution. It can
either be positive or negative, irrespective of the signs.
To calculate the skewness, we have to first find the
mean and variance of the given data.
• The skewness formula is given by:
Variance and Standard Deviation
are the two important measurements in statistics.
Variance is a measure of how data points vary from the
mean, whereas standard deviation is the measure of the
distribution of statistical data. The basic difference
between both is standard deviation is represented in the
same units as the mean of data, while the variance is
represented in squared units.
Variance
According to layman’s words, the variance is a measure of how
far a set of data are dispersed out from their mean or average
value. It is denoted as ‘σ2’.
Properties of Variance
• It is always non-negative since each term in the variance sum
is squared and therefore the result is either positive or zero.
• Variance always has squared units. For example, the variance
of a set of weights estimated in kilograms will be given in kg
squared. Since the population variance is squared, we cannot
compare it directly with the mean or the data themselves.
Standard Deviation
• The spread of statistical data is measured by the standard deviation.
Distribution measures the deviation of data from its mean or average
position. The degree of dispersion is computed by the method of
estimating the deviation of data points. It is denoted by the symbol, ‘σ’.
Properties of Standard Deviation
• It describes the square root of the mean of the squares of all values in a
data set and is also called the root-mean-square deviation.
• The smallest value of the standard deviation is 0 since it cannot be
negative.
• When the data values of a group are similar, then the standard deviation
will be very low or close to zero. But when the data values vary with
each other, then the standard variation is high or far from zero.
Variance and Standard Deviation Formula

The formulas for the variance and the standard deviation for both population
and sample data set are given below:
Variance and Standard deviation Relationship
Variance is equal to the average squared deviations from
the mean, while standard deviation is the number’s
square root. Also, the standard deviation is a square root
of variance. Both measures exhibit variability in
distribution, but their units vary: Standard deviation is
expressed in the same units as the original values,
whereas the variance is expressed in squared units.
Example
Question: If a die is rolled, then find the variance and
standard deviation of the possibilities.

Solution: When a die is rolled, the possible outcome will be 6. So


the sample space, n = 6 and the data set = { 1;2;3;4;5;6}.
To find the variance, first, we need to calculate the mean of the data
set.
• Mean, x̅ = (1+2+3+4+5+6)/6 = 3.5
• We can put the value of data and mean in the formula to get;
• σ2 = Σ (xi – x̅)2/n
• σ2 = ⅙ (6.25+2.25+0.25+0.25+2.25+6.25)
• σ2 = 2.917( Varience)
• Now, the standard deviation
σ = √2.917 = 1.708
Coefficient of Variation Formula
In statistic, the Coefficient of variation formula (CV), also known as
relative standard deviation (RSD), is a standardized measure of the
dispersion of a probability distribution or frequency distribution.
When the value of the coefficient of variation is lower, it means the
data has less variability and high stability.
Example : Find the coefficient of variation of the
following sample set of numbers.
• {1, 5, 6, 8, 10, 40, 65, 88}.
Solution:
• Given sample set: {1, 5, 6, 8, 10, 40, 65, 88}.
• Sample mean = (1 + 5 + 6 + 8 + 10 + 40 + 65 + 88)/8
= 223/8 = 27.875
Covariance and Correlation
Covariance and correlation are two mathematical concepts used in statistics.
Both terms are used to describe how two variables relate to each other.
Covariance is a measure of how two variables change together.
It is calculated as the covariance of the two variables divided by the
product of their standard deviations. Covariance can be
positive, negative, or zero. A positive covariance means that the two
variables tend to increase or decrease together.
A negative covariance means that the two variables tend to move in
opposite directions.
A zero covariance means that the two variables are not related.
Correlation can only be between -1 and 1. A correlation of -1 means that the two
variables are perfectly negatively correlated, which means that as one variable
increases, the other decreases. A correlation of 1 means that the two variables
are perfectly positively correlated, which means that as one variable increases,
the other also increases. A correlation of 0 means that the two variables are not
related.
Covariance Formula
Cov (x,y) = 8/2 = 4
Hence, Co-variance for the above data is 4

You might also like