0% found this document useful (0 votes)
16 views

Chapter 4 Data Management

This document provides an overview of key concepts in statistics including descriptive statistics, the normal distribution, hypothesis testing, regression, and correlation. It reviews topics like population and sample, variables and scales of measurement. It also covers measures of central tendency (mean, median, mode) and variability (range, variance, standard deviation). Additionally, it discusses the standard normal distribution including properties of the normal curve, calculating z-scores, and finding probabilities by determining the area under the curve.

Uploaded by

abuboreggie133
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter 4 Data Management

This document provides an overview of key concepts in statistics including descriptive statistics, the normal distribution, hypothesis testing, regression, and correlation. It reviews topics like population and sample, variables and scales of measurement. It also covers measures of central tendency (mean, median, mode) and variability (range, variance, standard deviation). Additionally, it discusses the standard normal distribution including properties of the normal curve, calculating z-scores, and finding probabilities by determining the area under the curve.

Uploaded by

abuboreggie133
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

DATA

MANAGEMENT Day 4 – P M
D e c e mbe r 1 3 , 2 0 1 7

PREPARED BY JOSEPH G. TABAN

1
CONTENTS

Review: Descriptive Statistics


Normal Distribution
Hypothesis Testing
Regression and Correlation
Use of available technology related to
statistics

2
REVIEW ON BASIC CONCEPTS
IN STATISTICS

Preliminaries
Population and Sample
Types of Variable
Scales of Measurement

3
POPULATION AND SAMPLE

4
VARIABLE

This refers to some specific characteristic of a


subject that assumes one or more different
values.
VARIABLE

Quantitative Qualitative
Variable Variable

5
SCALES OF MEASUREMENT

Nominal
Scales

Ordinal
Scales

Interval
Scales

Ratio
Scales
6
EXERCISE:

Identify the type and scale of


measurement of the given
variables:

7
REPRESENTATION OF DATA

What is the appropriate graphical


representation of a given data?
Types of Graphs
Histogram
Bar Graphs
Line graphs
Pie charts
Area graphs
X-Y plots 8
REPRESENTATION OF DATA

Use of Excel for Graphical


Representation of Data

9
What can you say about
the following graphs?

10
WHAT’S WRONG?

11
WHAT’S WRONG?

12
WHAT’S WRONG?

13
WHAT’S WRONG?

14
MEASURE OF CENTRAL TENDENCY

The purpose of central tendency is to


determine the single value that best represents the
entire distribution of scores. The three
standard measures of central tendency are the
Mean
Median
Mode

Show how to compute for each measure.


15
MEASURE OF CENTRAL TENDENCY

Which measure of central tendency is


best used in answering each question"
What would a student sharing a house with
friends expect to pay in rent each month?
What are the eating preferences( based on a
list of foods) of freshman students?
 At what age do people usually get married
for the first time?
16
MEASURE OF CENTRAL TENDENCY

Choosing a measure of central tendency


the level of measurement of the variable
concerned (nominal, ordinal, interval or ratio);
what is to be done with the figure obtained.

The mean is suitable only for ratio and interval data.


For ordinal variables, where the data can be ranked but
one cannot validly talk of `equal differences' between
values, the median, which is based on ranking, may be
used. Where it is not even possible to rank the data, as in
the case of a nominal variable, the mode may be the
only measure available.
17
EXERCISE

Imagine that you received the following data on the


vocabulary test of a group of students:

22 23 23 23
23 23 24 25
29 30 30 30
30 30 31 32
33 33 34 35
36 36 37 37
•Compute the mean, mode, and median of the data and
decide which of the three you believe to be best for the central
tendency of the data.
•Use Excel to verify the computed values. 18
MEASURES OF LOCATION

These are values which divide the


distribution into a given number of equal
parts.
Types:
Quartiles
Deciles
Percentiles

19
A percentile is a point in the
distribution below which a given
percent of cases lie.

If P 70 of a 100-item test
is 80, what does it mean?

20
EXERCISE

 Locate 𝑄1 , 𝐷3 , 𝑃40 from the following data

18, 10, 12, 27, 25, 35, 12, 26,


24, 18, 15, 30, 34, 26, 21, 14

21
MIND WORK

 Imagine that you conducted an in-service course for


teachers. To receive university credit for the course, the
teachers must take examinations--in this case, a
midterm and a final. The midterm was a multiple-choice
test of 50 items and the final exam presented teachers
with 10 problem situations to solve. Sue, like most
teachers, was a whiz at taking multiple-choice exams,
but bombed out on the problem-solving final exam. She
received a 48 on the midterm and a 1 on the final.
Becky didn't do so well on the midterm. She kept
thinking of exceptions to answers on the multiple-
choice exam. Her score was 39. However, she really did
shine on the final, scoring a 10.
22
Since you expect students to do well on both
exams, you reason that Becky has done a
creditable job on each and Sue has not. Becky
gets the higher grade. Yet, if you add the points
together, Sue has 49 and Becky has 49. The
question is whether the points are really equal.

Should Sue also do this bit of arithmetic, she


might come to your office to complain of the
injustice of it all. How will you show her that the
value of each point on the two tests is different? 23
MEASURE OF VARIABILITY

This provides a quantitative


measure of the degree to which scores in
a distribution are spread out or clustered
together.
Range
Variance
Standard Deviation
24
MEASURE OF VARIABILITY

• Consider the following data as scores of Students in 8


quizzes in Math .
Group A Group B
11 20
8 10
10 1
9 8
8 0
12 30
10 13
11 6

• Compute Range, Variance and Standard Deviation.


• What can you say about the two groups of students 25
CHARACTERISTICS OF THE MEASURES
OF VARIABILITY

The larger the standard deviation figure,


the wider the range of distribution away
from the measure of central tendency
Adding a constant to each score does not
change the standard deviation.
Multiplying each score by a constant
causes the standard deviation to be
multiplied by the same constant.
26
THE STANDARD NORMAL DISTRIBUTION

1. Properties of the normal curve


2. Mean and standard deviation of the
normal curve
3. Calculating z-scores
4. Area under the curve
5. Probability

27
NORMAL CURVE

 The curve is symmetric about the mean.


 Mean=Median= Mode
 The tail or ends are asymptotic relative
to the horizontal axis
 Each half represents 50% of
the total area.
 The total area under the normal
curve is 1 or 100%
Areas can be thought of as
probabilities. Areas could be written as percents. Areas can not
be negative.
The normal curve area may be subdivide d into standa rd
deviations, at least 3 units to the left and 3 units to the right of
the vertical line 28
THE HISTOGRAM AND THE
NORMAL CURVE

29
NORMAL DISTRIBUTION

The 68-95-99.7 Rule


Normal Density Plot
In the normal distribution with

0.08
mean µ and standard deviation σ:

68% of the observations fall

0.06
within σ of the mean µ.

95% of the observations fall

0.04
f(x)
within 2σ of the mean µ.

0.02
99.7% of the observations fall
within 3σ of the mean µ. 0.00

-20 -10 0 10 20

3σ 2σ σ x σ
2σ 3σ
30
Z-SCORES

Are a way of determining the position of


a single score under the normal curve.
Measured in standard deviations relative
to the mean of the curve.
The Z-score can be used to determine an
area under the curve known as a
probability.
Formula: z = (X – 𝑥 )
s 31
USING THE NORMAL CURVE: Z SCORES

Steps:

To find areas, first compute Z scores.


Substitute score of interest for x.
Use sample mean for 𝑥 and sample
standard deviation for s.
The formula changes a “raw” score (x) to
a standardized score (z).
32
FINDING PROBABILITIES

Areas under the curve can also be


expressed as probabilities.
Probabilities are proportions and range
from 0.00 to 1.00.

The higher the value, the greater the


probability (the more likely the event).
For instance, a .95 probability of rain is
higher than a .05 probability that it will
rain!
33
THREE DIFFERENT AREA
CALCULATIONS:

1. FIND THE AREA TO THE LEFT OF Z

2. FIND THE AREA TO THE RIGHT OF Z

3. FIND THE AREA BETWEEN 𝒛 𝟏 AND 𝒛 𝟐

34
Obtaining Area under Standard Normal Curve

Approach Graphically Solution


Shade the area to the left Use Table to find the row and
Find the area to of za column that correspond to za.
the left of za The area is the value where the
row and column intersect.
P(Z < a)
a
Shade the area to the right Use Table to find the area to
Find the area to of za
the right of za the left of za. The area to the
right of za is 1 – area to the left
of za.
P(Z > a) or
1 – P(Z < a)
a
Shade the area between za Use Table to find the area to
Find the area and zb the left of za and to the left of
between za and zb zb. The area between is areazb
– areaza.
P(a < Z < b)
a 35
b
EXAMPLE 1
Determine the area under the standard
normal curve that lies to the left of

A.Z = -3.49
= 0.0002
B.Z = -1.99
= 0.0233
C.Z = 0.92
= 0.8212
D.Z = 2.90 a

= 0.9981

36
EXAMPLE 2

Determine the area under the standard


normal curve that lies to the right of

a) Z = -3.49

= 0.9998
b) Z = -0.55
= 0.7088

c) Z = 2.23
a
= 0.0129

d) Z = 3.45
= 0.0003 37
EXAMPLE 3

Find the indicated probability of the


standard normal random variable Z

a) P(-2.55 < Z < 2.55)

= 0.9892
b) P(-0.55 < Z < 0)
= 0.2088 a b

c) P(-1.04 < Z < 2.76)


= 0.8479
38
EXERCISE

Worksheet

39
SIMPLE TEST OF
HYPOTHESIS
Objectives:
1. Define a hypothesis
2. Differentiate between Null and
Alternative Hypothesis
3. State hypothesis for a particular
study/problem
4. Differentiate the types of hypothesis
testing
5. Follow the steps in hypothesis testing
6. Compare means by hypothesis
testing using different test statistic
40
WHAT IS A HYPOTHESIS?

It is an educated guess


It is a tentative generalization.
Statistical Hypothesis---- a guess
or prediction made by a researcher
regarding the possible outcome of
the study.
41
TWO TYPES OF STATISTICAL
HYPOTHESIS:

A. Null Hypothesis (H o )
It is the hypothesis to be tested which
one hopes to reject.
It shows the equality or no significant
difference or relationship between the variables
B. Alternative Hypothesis (H a )
It generally represents the idea which
the researcher wants to prove.
Exercise: Stating the (H o ) and (H a ).
42
2 TYPES OF HYPOTHESIS TESTING

1. One- tailed test. It is a directional test with


the region of rejection lying on either left or right of the
normal curve.
a. Right-Directional Test
(H a uses comparatives such as greater than, more than, higher
than, better than, lower than, superior to, exceeds, etc..)
b. Left- Directional Test
(H a uses comparatives such as smaller than, less than, lower
than, inferior to, below, etc..)
2. Two- tailed Test. It is a non - directional test with the
region of rejection lying on both tails of the normal curve.
(H a uses words such as not equal to, significantly different,
etc) 43
STATISTICAL ERRORS

Type I Error. It is the error committed when the


null hypothesis is rejected when in fact it is true
and the alternative is false
Type II error. It is the error committed when the
null hypothesis is accepted when in fact it is
false and the alternative is true.
Facts Decision
Accept Ho Reject Ho
Ho is True Correct Type I
Ha is False Type II Correct

44
LEVEL OF SIGNIFICANCE

Alpha(α)--- it is used to designate the


probability of committing type I error
Beta(β)--- it is used to designate the probability
of committing type II error
Note: Alpha is the size of rejection, while Beta is
the size of the acceptance region.
What does it mean when you set α= 0.05?

45
STEPS IN HYPOTHESIS TESTING

1. Formulate Ho and Ha.


2. Set the level of signific anc e (α), then determine the type of
hypothe sis testing and the tabular p- value
3. Set the criterion (whe n to reject Ho)
Determine and compute for the test statistic
4. Make your decision
5. Formulate your conclusio n.

46
CRITERIA FOR REJECTING HO

Using tabular value of z


1. One-tailed test (right- directiona l)
Reject Ho if Z compu ted is ≥ Z tabular
2. One-ta iled test (left- directiona l)
Reject Ho if Z compu ted is ≤ Z tabular
3. Two- tailed test ( Zc is positive)
Reject Ho if Z compu ted is ≥ Z tabular
4. Two- tailed test ( Zc is negative)
Reject Ho if Z compu ted is ≤ Z tabular

47
TYPES OF TEST STATISTIC FOR HYPOTHESIS
TEST CONCERNING MEANS

A. Z-test ( used when n is large or n≥ 30.


1. Z- test for comparing hypothesized
and sample mean
2. Z-test for comparing 2 sample means
a. When the population standard
deviation is given
b. When the sample standard
deviation is given.

49
CORRELATION
AND
REGRESSION

50
CORRELATION AND REGRESSION

 Scatter plot is used to show a rough estimate of the


relationship between two variables
 Correlation
Measures the strength of the association between
two variables ( bivariate data)
 Only concerned with strength of the relationship
 No causal effect is implied

 Bivariate data
Are data sets in which each subject has two
observations associated with it.
51
TYPES

POSITIVE CORRELATION – exists when high scores


in one variable are associated with high scores in the
second variable or low scores in one variable are
associated with low scores in the other
NEGATIVE CORRELATION – exists when high scores
in one variable are associated with low scores in the
second or vice versa.
ZERO CORRELATION– exists when the points on the
scatter diagram are spread in a random manner.
PERFECT CORRELATION– all points lie on a straight
line

52
TH E STRENGTH OR DEGREE OF TH E
RELATIONSH I P IS BA SED ON TH E FOLLOW IN G
RA NGES OF TH E CORRELA TI O N COEFFI CI EN T:

 Ranges of r Degree/strength of
relationship
±1.00 perfect relationship
± 0.90 to ± 0.99 very strong/very high
± 0.70 to ± 0.89 strong/high
± 0.40 to ± 0.69 moderate/substantial
± 0.20 to ± 0.39 weak/small
± 0.01 to ± 0.19 almost negligible to
slight
0 no correlation
SCATTER PLOT EXAMPLES

Strong relationships Weak relationships

y y

x x

y y

x x
54
SCATTER PLOT EXAMPLES

No relationship

x
55
CORRELATION COEFFICIENT

A descriptive measure usually


denoted by r, which ranges
from -1 to 1.
It measures the degree of
relationship between two
variables.

56
FEATURES OF R
Unit free
Ranges between -1 and 1
The closer to -1, the stronger the
negative linear relationship
The closer to 1, the stronger the
positive linear relationship
The closer to 0, the weaker the
linear relationship
57
EXAMPLES OF APPROXIMATE
R VALUES

y y y

x x x
r = -1 r = -.6 r=0
y y

x x
r = +.3 r = +1 58
CALCULATING THE
CORRELATION COEFFICIENT

n xy   x y
r
[n( x )  ( x) ][n( y )  ( y) ]
2 2 2 2

where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent
variable
y = Value of the dependent variable
59
CALCULATION EXAMPLE

Tree Trunk
Height Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
CALCULATION EXAMPLE
)

Tree
n xy   x y
Height,
r
[n(  x 2 )  ( x)2 ][n(  y 2 )  ( y)2 ]
y
70

8(3142) (73)(321)
60

50 
40 [8(713)  (73)2 ][8(14111) (321)2 ]
30

20
 0.886
10
Trunk Diameter, x
r = 0.886 → relatively strong
0
0 2 4 6 8 10 12 14 positive
linear association between x and y
61
EXERCISE

Identify the correlation given a pair of variables

 Temperature and air conditioning cost


 School attendance achievement
 Investment period and interest earned
 Weight and IQ
 Temperature and ice cream sales
 Age and agility
 Amount of exercise and body weight
62
Pearson product moment correlation
coefficient
Coefficient of determination = R squared
Indicates the proportion of the variance in
one variable that can be associated within
the variance in the other variable.

63
COEFFICIENT OF
𝟐
DETERMINATION, 𝑹

The coefficient of determination is


the portion of the total variation in
the dependent variable that is
explained by variation in the
independent variable

64
COEFFICIENT OF DETERMINATION, R 2
(

Note: In the single independent variable case, the


coefficient of determination is

where:
R r 2 2

R2 = Coefficient of
determination
r = Simple correlation
coefficient

65
INTRODUCTION TO REGRESSION
ANALYSIS

 Regression analysis is used to:


 Predict the value of a dependent variable
based on the value of at least one independent
variable
 Explain the impact of changes in an
independent variable on the dependent
variable
Dependent variable: the variable we wish
to explain
Independent variable: the variable used to
explain the dependent variable
66
SIMPLE LINEAR REGRESSION MODEL

 Only one independent variable, x


 Relationship between x and y is
described by a linear function
 Changes in y are assumed to be
caused by changes in x

67
TYPES OF REGRESSION MODELS

Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship

68
COEFFICIENT OF DETERMINATION, R 2
(continued)
Coefficient of determinatio n

SSR sum of squares explained by regression


R 
2

SST total sum of squares

Note: In the single independent variable case, the coefficient


of determination is

R r2 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
69
EXAMPLES OF APPROXIMATE
R 2 VALUES

y
R2 = 1

Perfect linear relationship


between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x

x
R2 = +1
70
EXAMPLES OF APPROXIMATE
R 2 VALUES

y
0 < R2 < 1

Weaker linear relationship


between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x

x
71
EXAMPLES OF APPROXIMATE
R 2 VALUES

R2 = 0
y
No linear relationship
between x and y:

The value of Y does not


x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)

72
EXAMPLE

The relationship between the number of


sale calls and the number of units sold is
given by r = 0.759
The coefficient of determination is r
squared = 0.576
This means that 57.6 % of the variation
in the number of units sold is explained, or
accounted for, by the variation in the number
of sale calls.
73
 Correlation is a measure of the linear relationship between two
variables and does not mean there is a causal relationship
between them.

 Example. ( explain that there is no causal relationship between


the variables, other factors must have been the causes)
 IQ level and starting menstrual period among females
 Entrance test result and grades.

74
REGRESSION ANALYSIS

The process of developing an equation,


Preliminaries
Regression equation
How well a regression line fits the data
R2 =1 perfect fit
R2 =0
0< r2< 0.5 not well fit

75
Used Excel
for the Analysis of Data

COMPUTER HANDS-ON

76
Open forum for
clarification and ideas

77
Design a Plan or make a
project proposal
will be due on
December 15, 2017

78

You might also like