0% found this document useful (0 votes)
32 views31 pages

Foundations or Research Analysis

This document defines data and provides an overview of how to classify and analyze different types of data. It discusses: 1) Two broad classifications of data based on source: primary data collected directly and secondary data collected from other sources. 2) Statistical classifications of categorical and measurement data, and how each is measured. 3) Scaling theory classifications of nominal, ordinal, interval, and ratio data based on the type of information and mathematical operations they allow. 4) Descriptive statistics measures used to analyze data, including measures of central tendency (mean, median, mode), dispersion (range, quartile deviation, mean absolute deviation, standard deviation), and skewness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views31 pages

Foundations or Research Analysis

This document defines data and provides an overview of how to classify and analyze different types of data. It discusses: 1) Two broad classifications of data based on source: primary data collected directly and secondary data collected from other sources. 2) Statistical classifications of categorical and measurement data, and how each is measured. 3) Scaling theory classifications of nominal, ordinal, interval, and ratio data based on the type of information and mathematical operations they allow. 4) Descriptive statistics measures used to analyze data, including measures of central tendency (mean, median, mode), dispersion (range, quartile deviation, mean absolute deviation, standard deviation), and skewness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

What is Data?

• Observations of a set of variables


• Lowest level of abstraction from which information is derived

• Each Discipline has evolved it’s own method of classification of data

• Two Broad Classification of Data Based on Source


– Primary Data:
• Data Collected from Primary Source
– Secondary Data:
• Data Collected From Secondary Source

1
Classification :: Statistics
• Categorical Data
– The Objects are grouped into categories based on some Qualitative Trait
– The resultant data are merely labels or categories
– Example:
• Hair Color: Brown / Black / Red
• Smoking Status: Favor / Neutral / Against
• Measurement Data
– The Objects are “measured” on some Quantitative Trait
– The resultant data is a set of numbers
– Example:
• Age of the Students
• JEMAT Score
• Number of Students Not Attending Class

2
Categorical Data
• Nominal Data
– A type of categorical data in which numbers act as a label without having
any specific meaning
– Example:
• Male : 1
• Female : 2
• Ordinal Data
– A type of categorical data in which numbers act as an guide to the level of
importance of the object
– Example:
• Mild
• Moderate
• Severe

3
Measurement Data
• Discrete Data
– Only Certain Values are Possible
– There are gaps between the possible value
– Are generated through the process of Counting
– Example:
• Number of students in the class
• Number of Employees Absent from Work
• Continuous Data
– Any value within an interval is possible with a suitable measuring device
– Theoretically, the number can be accurate to any desired number of
decimal places
– Are generated through the process of Measurement
– Example:
• Height in cm
• Time to complete the assignment

4
Classification :: Scaling Theory
• Nominal Data ORDER DISTANCE ORIGIN
– A type of categorical data in which numbers act as a label without having
any specific meaning
– Example:
• Male : 1
• Female : 2
• Ordinal Data
– A type of categorical data in which numbers act as an guide to the level of
importance of the object
– Example:
• Mild
• Moderate
• Severe

ORDER DISTANCE ORIGIN


5
Classification :: Scaling Theory
• Interval Data ORDER DISTANCE ORIGIN
– Quantitative Data but does not has any real zero point
– Allows comparison within the scale but cannot compare outside the scale
– Used in Social Research, but most researcher not clear about Interval
scale
– Example:
• Definitely Will Buy / Probably Will Buy / May or May not Buy / Probably Will not
Buy / Definitely Will not Buy
• Ratio Data
– Quantitative Data but has real zero point
– Allows conversion and preservation on the magnitude in another scale
– Example:
• Distance in Kms

ORDER DISTANCE ORIGIN


6
Why understand Data?
• The type of Analysis depends on the Type of data you
have collected
• General Guideline is a follows:

– Nominal Data Mode, Chi-Square

– Ordinal Data + Median / Percentiles

– Interval Data + Mean / SD / Correlation / Regression /


ANOVA

– Ratio Scale + Geometric Mean / Harmonic Mean /


Coefficient of Variation /
Logarithms

7
Some Points to Remember
• Tend to use Interval Scales
• Data need not be comparable with other studies
• Data has to make sense in your context
• Students fail to understand the importance of Data
– Wrong Approach
• “Data Collect Kore Niyechi… Ebar Ki Kori”
– Right Approach
• “Amar Ki Data Dorkar? Kano Daokar? Kothay Pabo? Kibhabe
Analyse Kore Uttor Pabo”

8
Descriptive Statistics
:: A Quick Review

9
Measures of Central Tendency
• Central tendency is “loosely” defined as the concept of
location of the center of a distribution of data
• Three basic measures
– Arithmetic Mean
– Median
– Mode

10
Arithmetic Mean
• Advantages:
– Easy to Compute
– Affected by every value in the set of observations
– Defined by rigid mathematical formulation
– It is relatively reliable
– It represents the “center of gravity” of the data
• Disadvantages:
– Unduly affected by small and / or large values
– Cannot be calculated for data with open ended class
– Is a good measure only when the distribution is fairly symmetric

11
Median
• Advantages
– Refers to the “Middle Value” of the distribution
– It is a “positional measure”
– Useful in case of open ended class
– Not seriously affected by Extreme Values
– Most appropriate for dealing with Qualitative Rank Data
– Has a series of related positional measures like Quartiles, Deciles,
Percentiles
• Disadvantages:
– It does not take every value into consideration
– It is not capable of algebraic treatment
– It is erratic if the number of items are smalle

12
Mode
• Advantages:
– It is the most typical or representative value of a distribution
– Not unduly affected by extreme values
– It can be used to describe qualitative phenomenon
• Disadvantages:
– Mode may not be there in a distribution or may be present more
than once in a distribution
– Not capable of algebraic treatment
– It is not rigidly defined for calculation

13
Relation Between the 3 Measures
• In moderately skewed distribution:
Mode = 3 Median – 2 Mean

14
Measures of Dispersion
• Dispersion is defined as the degree to which data tends to
spread about a central value
• Four Absolute & Relative Measures
– Range Coefficient of Range
– Quartile Deviation Coefficient of Quartile Deviation
– Mean Absolute Deviation Coefficient of MAD
– Standard Deviation Coefficient of Variation

• Range and QD are positional measures of dispersion


• AD and SD are calculation measures of dispersion

15
Range
• Range

• Advantages
– Simplest to understand and compute
• Disadvantages:
– Not based on each and every item in the data
– Does not take into account the shape of distribution
– Cannot be computed in case of open ended classes

16
Quartile Deviation
• Inter Quartile Range (IQR)

• Quartile Deviation (Semi IQR)

• Coefficient of QD

17
Quartile Deviation
• Advantages:
– Can measure variation in open ended distributions
– It is extremely useful in case of erratic or badly skewed data
– It is not affected by extreme values
• Disadvantages:
– Ignores 50% of the data
– Is not capable of mathematical manipulation
– Is not considered as a measure of dispersion:
• Effectively shows the distance between two positional points

18
Mean Absolute Deviation
• Mean Absolute Deviation (MAD) defined as:

• Coefficient of MAD defined as:


= MAD / Median or MAD / Mean
• Advantages:
– Simple to understand and compute
– Based on each and every item in the data
– Less affected by extreme values than other measured
• Disadvantage:
– It is not capable of mathematical treatment

19
Standard Deviation
• Defined as “Root Mean Squared Deviation from Mean”

• Coefficient of Variation

20
Standard Deviation
• Advantages:
– Best Measure of Dispersion
– Possible to calculate the combined standard deviation of two or
more groups
– Chebycheff’s Theorem (1821-1894)
• What so ever be the distribution at least 75% of the values will fall
within +/- 2 sd from the mean of the distribution and at least 89% will
fall within +/- 3 sd from the mean of the distribution
– Has relation with other measures:
• QD = 0.667 SD
• MD = 0.80 SD

21
Skewness
• Refers to the asymmetry in the shape of the distribution

• Important to test skewness in data analysis as skewed


data suggest that the assumption of normality is violated

22
Skewness - Measures
• Karl Pearson’s Measure of Skewness:
Mean – Mode OR
3(Mean – Median)
Standard Deviation Standard Deviation
- Skewness coefficient > 0 is positively skewed
- Skewness coefficient < 0 is negatively skewed
- Skewness coefficient = 0 is symmetrical

• Bowley’s Measure

• Moments Measure

23
Kurtosis
• Kurtosis means “Bulginess”
• Refers to the degree of flatness or peaked-ness in the
region about the mode of the distribution:
– Lepto-Kurtic : If the curve is more peaked than Normal Curve
– Meso-Kurtic : If the curve is the same as the Normal Curve
– Platy-Kurtic : If the curve is less peaked than Normal Curve

• Presence of Kurtosis does not violate normality


• Important to check Kurtosis because it shows the
distribution of data around the mode

24
KURTOSIS - Measures

• Kurtosis

Excess Kurtosis
Kurtosis

25
Interpretation
• A normal distribution has kurtosis exactly 3 (excess
kurtosis exactly 0). Any distribution with kurtosis ≈3
(excess ≈0) is called mesokurtic.
• A distribution with kurtosis <3 (excess kurtosis <0) is
called platykurtic. Compared to a normal distribution, its
central peak is lower and broader, and its tails are shorter
and thinner.
• A distribution with kurtosis >3 (excess kurtosis >0) is
called leptokurtic. Compared to a normal distribution, its
central peak is higher and sharper, and its tails are longer
and fatter.
Kurtosis: Leptokurtic
Kurtosis: Mesokurtic
Kurtosis: Platykurtic
Uses of Skewness and Kurtosis
• Most stock prices and asset returns are positive or
negative skew. Skewed data can be used to determine
whether a given or future data point can be more or less
than the mean. Basically related to asymmetries (or
risks) in information. Higher risks lead to higher
returns
• Kurtosis is used to describe volatility around the mean.
For example, if past data yields leptokurtic distribution,
the stock will have a relatively low amount of variance.
This further implies the return values are close to the
mean hence less volatile. Platykurtic distribution
expect more volatilty (or losses ) in the future.
What is Descriptive Statistics?
• The following Needs to Be Reported:
– Arithmetic Mean
– Median
– Mode
– Standard Deviation
– Variance
– Kurtosis
– Skewness
– Range
– Minimum
– Maximum
– Sum
– Count

31

You might also like