0% found this document useful (0 votes)
40 views

Fdsa Unit 2

Uploaded by

Dinesh 1812
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Fdsa Unit 2

Uploaded by

Dinesh 1812
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

lOMoARcPSD|33947538

FDSA UNIT 2

Fundamentals of Data Science and Analytics (Mailam Engineering College)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Dinesh 1812 ([email protected])
lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

UNIT 2 – DESCRIPTIVE ANALYTICS


SYLLABUS:
Frequency distributions – Outliers –interpreting distributions – graphs
– averages – describing variability – interquartile range – variability for
qualitative and ranked data - Normal distributions – z scores –
correlation – scatter plots – regression – regression line – least squares
regression line – standard error of estimate – interpretation of r2 –
multiple regression equations – regression toward the mean.

PART A
1. What is Statistics?
 Statistics is a branch of applied mathematics that involves the collection,
description, analysis, and inference of conclusions from quantitative data.

2. What are the categories of Statistics? Define it.


 Descriptive statistics
Statistics provides tools—tables, graphs, averages, ranges, correlations—
for organizing and summarizing the inevitable variability in collections of
actual observations or scores.
 Inferential statistics.
Statistics provides tools—a variety of tests and estimates—for generalizing
beyond collections of actual observations.

3. Define Populations and Samples.


 Population refers to any complete collection of observations or potential
observations.
 Sample refers to any smaller collection of actual observations drawn from
a population

4. What is Random Sampling in Statistics?


 Random sampling is a procedure designed to ensure that each potential
observation in the population has an equal chance of being selected in a
survey.

5. What is Random Assignment in Statistics?


 Random Assignment is a procedure designed to ensure that each person
has an equal chance of being assigned to any group in an experiment.

6. Define Data in Statistics.


 A collection of actual observations or scores in a survey or an experiment

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 1

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

7. What are the types of data in statistical analysis?


 Qualitative data
 Ranked data
 Quantitative data

8. Define Qualitative data


 A set of observations where any single observation is a word, letter, or
numerical code that represents a class or category.

9. Define Ranked data


 A set of observations where any single observation is a number that
indicates relative standing

10. Define Ranked data


 A set of observations where any single observation is a number that
represents an amount or a count.

11. What are Levels of Measurement?


 Levels of measurement specify the extent to which a number (or word or
letter) actually represents some attribute and, therefore, has implications
for the appropriateness of various arithmetic operations and statistical
procedures.
 There are three levels of measurement
 nominal
 ordinal
 interval/ratio

12. What is nominal measurement?


 Nominal measurement is classification—that is, sorting observations into
different classes or categories.
 Words, letters, or numerical codes reflect only differences in kind, not
differences in amount.
 Examples of nominal measurement include classifying mood disorders as
manic, bipolar, or depressive.

13. What is Ordinal measurement?


 Ordinal measurement is order.
 The relative standing of ranked data that reflects differences in degree
based on the order.
 For example, it’s inappropriate to conclude that the arithmetic means of
ranks 1 and 3 equals rank 2, since this assumes that the actual distance
between ranks 1 and 2 equals the distance between ranks 2 and 3.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 2

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

14. What is Interval/Ratio measurement?


 Interval/ratio measurement is equal intervals and a true zero.
 Amounts or counts of quantitative data reflect differences in degree based
on equal intervals and a true zero.
 For example, a reading of 0 on the Fahrenheit temperature scale does not
reflect the complete absence of heat—that is, the absence of any
molecular motion.

15. Give the Pictorial representation of types of data and levels of


measurement.

16. What is a Variable in statistical analysis?


 A variable is a characteristic or property that can take on different values.

17. Define Constant in statistical analysis.


 A Constant is a characteristic or property that can take on only one
value.

18. What is meant by Discrete and Continuous Variables?


 A discrete variable consists of isolated numbers separated by gaps.
 Examples include most counts, such as the number of children in a
family.
 A continuous variable consists of numbers whose values, at least in
theory, have no restrictions.
 Examples include amounts, such as weights of male statistics students.
19. What is meant by Independent and Dependent Variables?
 Independent Variable
o An independent variable is a treatment manipulated by the
investigator.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 3

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Dependent Variable
o A dependent Variable is a variable that is believed to have been
influenced by the independent variable.

20. What is frequency distribution and give usage?


 A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
 A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.

21. Give guidelines for frequency distributions rules.


Essential
1. Each observation should be included in one, and only one, class.
2. List all classes, even those with zero frequencies.
3. All classes should have equal intervals.
Optional
4. All classes should have both an upper boundary and a lower boundary.
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10,
particularly 5 and 10 or multiples of 5 and 10.
6. The lower boundary of each class interval should be a multiple of the class
interval.
7. Aim for a total of approximately 10 classes.

22. Define outliers.


 An outlier is an extremely high or extremely low data point relative to the
nearest data point and the rest of the neighboring co-existing values in a
data graph or dataset.

23. What is frequency distribution for Ungrouped Data and grouped Data?
 Frequency Distribution for Ungrouped Data
 A frequency distribution produced whenever observations are sorted
into classes of single values.
 Frequency Distribution for Grouped Data
 A frequency distribution produced whenever observations are sorted
into classes of more than one value

24. Define Unit of Measurement.


 The smallest possible difference between scores

25. Define Real Limits of Class Intervals.


 Located at the midpoint of the gap between adjacent tabled boundaries.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 4

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

26. What is Relative Frequency Distribution?


 A frequency distribution showing the frequency of each class as a fraction
of the total frequency for the entire distribution.

27. What is Cumulative Frequency Distribution?


 A frequency distribution showing the total number of observations in each
class and all lower-ranked classes.

28. Define Percentile Rank of an Observation.


 Percentage of scores in the entire distribution with equal or smaller values
than that score.

29. What do you mean by correlation?


 Correlation is a statistical measure that expresses the extent to which two
variables are linearly related.
 A correlation reflects the strength and/or direction of the relationship
between two (or more) variables.
 The direction of a correlation can be either positive or negative.

30. What are the three features of a correlation?


 Correlations have three important characteristics.
The direction of the relationship,
the form (shape) of the relationship,
the degree (strength) of the relationship between two variable

31. What are the 4 types of correlation?


 Pearson correlation,
 Kendall rank correlation,
 Spearman correlation,
 Point-Biserial correlation.

32. What is importance of correlation?


 The correlation coefficient helps in measuring the extent of the
relationship between two variables.
 Correlation analysis facilitates the understanding of economic
behavior and helps in locating the critically important variables on
which others depend.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 5

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

33. Why correlation is used in research?


 A correlational research design investigates relationships between
variables without the researcher controlling or manipulating any of
them.
 A correlation reflects the strength and/or direction of the relationship
between two (or more) variables. The direction of a correlation can be
either positive or negative.

34. What are the benefits of correlation in statistics?


 The main benefits of correlation analysis are that it helps companies
determine which variables they want to investigate further, and it
allows for rapid hypothesis testing.
 The main type of correlation analysis uses Pearson's r formula to
identify the degree of the linear relationship between two variables

35.What is positive relationship?


 Low values are paired with relatively low values, and relatively high
values are paired with relatively high values, the relationship is
positive.

36.What is negative relationship?


 Low values are paired with relatively high values, and relatively high
values are paired with relatively low values, the relationship is
negative.

37.What is Little or no relationship?


 A dot cluster that lacks any apparent slope, reflects little or no
relationship.

38.What is strong or weak relationship?


 Having established that a relationship is either positive or negative,
note how closely the dot cluster approximates a straight line.
 The more closely the dot cluster approximates a straight line, the
stronger (the more regular) the relationship will be.

39.What is Perfect Relationship?


 A dot cluster that equals (rather than merely approximates) a straight
line reflects a perfect relationship between two variables. In practice,
perfect relationships are most unlikely.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 6

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

40.What is Linear Relationship?


 A relationship that can be described best with a straight line.

41.How do you describe correlation on a scatter plot?


 We often see patterns or relationships in scatter plots.
 When the y variable tends to increase as the x variable increases,
there is a positive correlation between the variables.
 When the y variable tends to decrease as the x variable increases,
there is a negative correlation between the variables.

42.Define correlation coefficient.


 A correlation coefficient is a number between –1 and 1 that
describes the relationship between pairs of variables.

43.Write short note on Pearson Correlation Coefficient (r).


 A number between –1.00 and+1.00 that describes the linear
relationship between pairs of quantitative variables.

44.What are the key properties of r?


 The Pearson correlation coefficient, r, can equal any value between –
1.00 and +1.00. Furthermore, the following two properties apply:
o The sign of r indicates the type of linear relationship, whether
positive or negative.
o The numerical value of r, without regard to sign, indicates the
strength of the linear relationship.

45.Draw the graph for Effect of range restriction on the value of r.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 7

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

46.What is the correlation coefficient computational formula for r

47.What is Correlation Matrix ?


Table showing correlations for all possible pairs of variables.

48. What is regression or correlation?


 The most commonly used techniques for investigating the
relationship between two quantitative variables are correlation and
linear regression.
 Correlation quantifies the strength of the linear relationship
between a pair of variables, whereas regression expresses the
relationship in the form of an equation

49. What is mean by regression line?


 The regression line is a straight line rather than a curved line
because of the linear relationship between cards sent and cards
received.
 Used to predict the value of some continuous response variable.

50.What is mean by predictive errors?


 In statistics, prediction error refers to the difference between the
predicted values made by some model and the actual values.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 8

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

51.What is Logistic Regression?


 Used to predict the value of some binary response variable. One
common way to measure the prediction error of a logistic
regression model is with a metric known as the total
misclassification rate.

52. . What is a least squares regression line?


 If the data shows a linear relationship between two variables, the


line that best fits this
 linear relationship is known as a least-squares regression line,
which minimizes the vertical
 distance from the data points to the regression line

53. Write Least Squares Regression Equation.


 An equation pinpoints the exact least squares regression line for
any scatterplot. Most generally, this equation reads

54. What is Least Squares Regression Equation ?


 The equation that minimizes the total of all squared prediction
errors for known Y scores in the original correlation analysis.

55. How will you Find Values of b and a in least square regression
equation?

56. What is Standard Error of Estimate (sy|x )?


 A rough measure of the average amount of predictive error

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 9

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

57. Write the formula to Finding the Standard Error of Estimate


(Definition formula)

58. Write the formula to Finding the Standard Error of Estimate


(Computation formula)

59. Write short notes on squared correlation coefficient of r2?


 The squared correlation coefficient, r2 , provides us with not only a
key interpretation of the correlation coefficient but also a measure of
predictive accuracy that supplements the standard error of estimate,
sy|x.

60.Write notes on Multiple Regression Equation .


 A least squares equation that contains more than one predictor or X
variable.

61.What is mean by Regression Toward the Mean ?


 A tendency for scores, particularly extreme scores, to shrink toward
the mean

62. Elucidate Regression Fallacy .


 Regression Fallacy Occurs whenever regression toward the mean is
interpreted as a real, rather than a chance, effect.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 10

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

PART B
1. Explain in detail about types of Data.

THREE TYPES OF DATA


 Qualitative data consist of words (Yes or No), letters (Y or N), or numerical
codes (0 or 1) that represent a class or category.
 Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent
relative standing within a group.
 Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that
represent an amount or a count.

Example Table 2.1

 The weights reported by 53 male students in Table 2.1 are quantitative


data, since any single observation, such as 160 lbs, represents an amount of
weight.
 If the weights had been replaced with ranks, beginning with a rank of 1 for
the lightest weight of 133 lbs and ending with a rank of 53 for the
heaviest weight of 245 lbs, these numbers would have been ranked data.

Table 2.2

 Finally, the Y and N replies of students in Table 2.2 are qualitative data,
since any single observation is a letter that represents a class of replies.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 11

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

2. Explain in detail about Types of Variables?

TYPES OF VARIABLES

Definition
 A variable is a characteristic or property that can take on different values.
 Accordingly, the weights not only as quantitative data but also as
observations for a quantitative variable, since the various weights take on
different numerical values.
 By the same token, for a qualitative variable, since the replies to the
Facebook profile question take on different values of either Yes or No.
 Any single observation is constant, since it takes on only one value.

Discrete and Continuous Variables


 Quantitative variables can be divided into
o Discrete
o Continuous
 A discrete variable consists of isolated numbers separated by gaps.
o Example
o counts, such as the number of children in a family;
o The number of foreign countries you have visited;
 A continuous variable consists of numbers whose values have no
restrictions.
Example
o amounts, such as weights of male statistics students;
o durations, such as the reaction times of grade school children to a fire
alarm;
o Standardized test scores, such as those on the Scholastic Aptitude Test
(SAT).
 Approximate Numbers
 Values for continuous variables be rounded off, the resulting numbers are
approximate, never exact.
 Example, the weights of the male statistics students are approximate
because they have been rounded to the nearest pound.
 A student whose weight is listed as 150 lbs could actually weigh between
149.5 and 150.5 lbs.

Independent and Dependent Variables


 Presence or absence of a relationship between two or more variables.
 An experiment is a study in which the investigator decides who receives
the special treatment.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 12

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Independent Variable
 An independent variable is the treatment manipulated by the
investigator.
Dependent Variable
 A variable that is believed to have been influenced by the independent
variable.

Observational Studies
 An observational study focuses on detecting relationships between variables
not manipulated by the investigator,
Example,
 The independent variable is qualitative, with nominal measurement,
whereas the dependent variable (number of communication breakdowns)
is quantitative.

Confounding Variable
 An uncontrolled variable that compromises the interpretation of a study is
known as a confounding variable.

3. Explain in detail about Describing Data with Tables and Graphs.


DESCRIBING DATA WITH TABLES AND GRAPHS
Tables
 Frequency distributions for quantitative data
 Outliers
 Relative frequency distributions
 Cumulative frequency distributions
 Frequency distributions for qualitative (nominal) data
 Interpreting distributions constructed by others
Graphs
 Graphs for quantitative data
 Typical shapes
 A graph for qualitative (nominal) data

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 13

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

FREQUENCY DISTRIBUTIONS FOR QUANTITATIVE DATA

 Definition
 Frequency Distribution
o Ungrouped Data
o Grouped Data
 Guidelines
 Example Problems
 Types of Frequency Distribution
o Relative FD
o Cumulative FD
 Frequency Distributions For Qualitative (Nominal) Data

 Definition
 A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.

 Frequency Distribution for Ungrouped Data


A frequency distribution produced whenever observations are sorted into
classes of single values. , refer Table 2.3.
Table 2.3

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 14

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Frequency distributions for ungrouped data are much more informative


when the number of possible values is less than about 20.
 Otherwise, if there are 20 or more possible values, using a frequency
distribution for grouped data
Table 2.4

 Example - In Table 2.4 Students in a theater arts appreciation class rated


the classic film, The Wizard of Oz on a 10-point scale, ranging from 1
(poor) to 10 (excellent), since the number of possible values is relatively
small—only 10—it’s appropriate to construct a frequency distribution for
ungrouped data.
Solution Refer Table 2.5
Table 2.5

 Frequency Distribution for Grouped Data


 A frequency distribution produced whenever observations are sorted into
classes of more than one value.
 The general structure of this frequency distribution is
o Data are grouped into class intervals with 10 possible values each in
Table 2.6.
o The bottom class includes the smallest observation (133), and the top
class includes the largest observation (245).
o The distance between bottom and top is occupied by an orderly series
of classes.
o The frequency (f) column shows the frequency of observations in each
class and, at the bottom, the total number of observations in all
classes.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 15

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

o Example – Refer Table 2.6


Table 2.6

 GUIDELINES
 The “Guidelines for Frequency Distributions” box lists seven rules for
producing a well-constructed frequency distribution.
 The first three rules are essential and should not be violated. The last
four rules are optional and can be modified or ignored.

GUIDELINES FOR FREQUENCY DISTRIBUTIONS


Essential
1. Each observation should be included in one, and only one, class.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to
use 130–140, 140–150, 150–160, etc., in which, because the
boundaries of classes overlap, an observation of 140 (or 150) could be
assigned to either of two classes.
2. List all classes, even those with zero frequencies.
Example: Listed in Table 2.2 is the class 210–219 and its frequency of
zero. It would be incorrect to skip this class because of its zero
frequency.
3. All classes should have equal intervals.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to
use 130–139, 140–159, etc., in which the second class interval (140–
159) is twice as wide as the first class interval (130–139).

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 16

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Optional
4. All classes should have both an upper boundary and a lower
boundary.
Example: 240–249. Less preferred would be 240–above, in which no
maximum value can be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2,
3, . . . 10, particularly 5 and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a
convenient number. Less preferred would be 130–142, 143–155, etc.,
in which the class interval of 13 is not a convenient number.
6. The lower boundary of each class interval should be a multiple
of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130,
140, are multiples of 10, the class interval. Less preferred would be
135–144, 145–154, etc., in which the lower boundaries of 135 and
145 are not multiples of 10, the class interval.
7. Aim for a total of approximately 10 classes.

Gaps between Classes


 The size of the gap should always equal one unit of measurement; that
is, it should always equal the smallest possible difference between scores
within a particular set of data.
 The smallest class interval would be 130.0–139.9 (not 130–139), and the
next class interval would be 140.0–149.9 (not 140–149), and so on.

Real Limits of Class Intervals


 The real limits are located at the midpoint of the gap between adjacent
tabled boundaries;
 That is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper
tabled boundary.
Example,
 The real limits for 140–149 are 139.5 (140 minus one-half of the unit of
measurement of 1) and 149.5 (149 plus one-half of the unit of
measurement of 1), and the actual width of the class interval would be 10
(from 149.5 -139.5 = 10).

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 17

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Example 2.1
The IQ scores for a group of 35 high school dropouts are as follows:
Table 2.7

(a) Construct a frequency distribution for grouped data.


(b) Specify the real limits for the lowest class interval in this frequency
Distribution.

Solution
a. Construction of a Frequency Distribution
1. Find the range, that is, the difference between the largest and smallest
observations.
Example- From Table 2.7:
The range of weights in the above table is 123 – 69 = 54.
2. Find the class interval required to span the range by dividing the range
by the desired number of classes (ordinarily 10).

Example: Class interval = 54/10=5.4


3. Round off to the nearest convenient interval (such as 1, 2, 3, . . . 10,
particularly 5 or 10 or multiples of 5 or 10).
Example: nearest convenient interval = 5
4. Determine where the lowest class should begin. (Ordinarily, this number
should be a multiple of the class interval.)
Example: the smallest score is 69, and therefore the lowest class should
begin at 65, since 65 is a multiple of 5 (the class interval).
5. Determine where the lowest class should end by adding the class interval
to the lower boundary and then subtracting one unit of measurement.
Example: Add 5 to 65 and then subtract 1, the unit of measurement, to
obtain 69—the number at which the lowest class should end.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 18

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

6. Working upward, list as many equivalent classes as are required to


include the largest observation.
Example: 65-69, 70 – 74, . . . 120 - 124, so that the last class includes 123,
the largest score.
7. Indicate with a tally the class in which each observation falls.
For example, the first score in Table 2.6, 160, produces a tally next to 160–
169; the next score, 193, produces a tally next to 190–199; and so on.
8. Replace the tally count for each class with a number—the frequency
(f)—and show the total of all frequencies. (Tally marks are not usually
shown in the final frequency distribution.)
9. Supply headings for both columns and a title for the table.
Output – Refer Table 2.8
Table 2.8

Solution b - the real limits for the lowest class interval in this frequency
Distribution - 64.5 – 69.5
Example 2.2
What are some possible poor features of the following frequency
Distribution?

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 19

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution
 Not all observations can be assigned to one and only one class (because of
gap between 20–22 and 25–30 and overlap between 25–30 and 30–34).
 All classes are not equal in width (25–30 versus 30–34).
 All classes do not have both boundaries (35–above).

 Types of Frequency distribution


1. Relative Frequency Distributions
• Relative frequency distributions shows the frequency of each class as a
part or fraction of the total frequency for the entire distribution.
Constructing Relative Frequency Distributions
• To convert a frequency distribution into a relative frequency
distribution, divide the frequency for each class by the total frequency
for the entire distribution.
Percentages or Proportions
• A proportion always varies between 0 and 1, whereas a percentage
always varies between 0 percent and 100 percent.
• To convert the relative frequencies from proportions to percentages,
multiply each proportion by 100;
• That is, move the decimal point two places to the right.
• For example, multiply .06 by 100 to obtain 6 percent.
Example – Table 2.9

2. Cumulative Frequency Distributions


 Cumulative frequency distributions show the total number of
observations in each class and in all lower-ranked classes.
 Cumulative frequencies are usually converted, to cumulative
percentages. Cumulative percentages are often referred to as
percentile ranks.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 20

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Constructing Cumulative Frequency Distributions


 To convert a frequency distribution into a cumulative frequency
distribution, add to the frequency of each class to the sum of the
frequencies of all classes ranked below it.
 Begin with the lowest-ranked class in the frequency distribution and
work upward, finding the cumulative frequencies in ascending order.
Example – Table 2.10

 The cumulative frequency for the class 130–139 is 3, since there are
no classes ranked lower.
 The cumulative frequency for the class 140–149 is 4, since 1 is the
frequency for that class and 3 is the frequency of all lower-ranked
classes.
 The cumulative frequency for the class 150–159 is 21, since 17 is the
frequency for that class and 4 is the sum of the frequencies of all
lower-ranked classes.

Cumulative Percentages
• To obtain this cumulative percentage (75%), the cumulative frequency
of 40 for the class 170–179 should be divided by the total frequency of
53 for the entire distribution.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 21

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Example 2.3
GRE scores for a group of graduate school applicants are
distributed as follows:

Convert the distribution of GRE scores shown to a cumulative


frequency distribution and cumulative percent.
Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 22

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Percentile Ranks
 Cumulative percentages are referred to as percentile ranks.
 The percentile rank of a score indicates the percentage of scores in the
entire distribution with similar or smaller values than that score.
Types of Percentile Rank
 The assignment of exact percentile ranks requires that cumulative
percentages be obtained from frequency distributions for ungrouped
data.
 The assignment of approximate percentile ranks requires that
cumulative percentages be obtained from frequency distributions for
grouped data.

Example 2.4
Referring to Table 2.10, find the approximate percentile rank of
any weight in the class 200–209.
Solution –
The approximate percentile rank for weights between 200 and 209 lbs
is 92 (because 92 is the cumulative percent for this interval).

 FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA


 When, among a set of observations, any single observation is a word,
letter, or numerical code, the data are qualitative.

Example 2.5
Movie ratings reflect ordinal measurement because they can be
ordered from most to least restrictive: NC-17, R, PG-13, PG, and G.
The ratings of some films shown recently in San Francisco are as
follows:

(a) Construct a frequency distribution.


(b) Convert to relative frequencies, expressed as percentages.
(c) Construct a cumulative frequency distribution.
(d) Find the approximate percentile rank for those films with a PG
rating

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 23

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

4. How the frequency distributions are interpreted by others?


Interpreting Distributions
 When inspecting a distribution for the first time, train to look at the
entire table, not just the distribution.
 Read the title, column headings, and any footnotes.
 Identify from where do the data come from and Is a source cited.
 Next, focus on the form of the frequency distribution. Is it well
constructed?
 For quantitative data, check does the total number of classes seem to
avoid either over- or under-summarizing the data.
 After these preliminaries, inspect the content of the frequency
distribution.
 What is the approximate range? Does it seem reasonable?
 Disregard the inevitable irregularities that accompany a frequency
distribution and focus on its overall appearance or shape.
 Do the frequencies arrange themselves around a single peak (high point)
or several peaks?
 Is the distribution fairly balanced around its peak?
 When interpreting distributions, including distributions constructed by
someone else, keep an open mind.

5. Give a brief summary about outliers.


OUTLIERS
o Are a very extreme score
Handling Outliers
 Check for Accuracy
o Whenever encounter an outrageously extreme value, such as a GPA of
0.06, attempt to verify its accuracy.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 24

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

o For instance, was a respectable GPA of 3.06 recorded erroneously as


0.06?
o If the outlier survives an accuracy check, it should be treated as a
legitimate score.
 Might Exclude from Summaries
o To segregate an outlier from any summary of the data
 Might Enhance Understanding
o Insofar as a valid outlier can be viewed as the product of special
circumstances, it might help you to understand the data.

Example 2.6:
Identify any outliers in each of the following sets of data collected
from nine college students

Solution
o Outliers are a summer income of $25,700; an age of 61; and a family
size of 18. No outliers for GPA.

6. Discuss in detail about graphs and graphs for Quantitative data.


GRAPHS

 Graphs for Quantitative Data


o Histograms
o Frequency Polygon
o Stem and Leaf Displays
 Graphs for Qualitative Data
o Bar Graph
 Shapes
o Normal
o Bimodal
o Positively Skewed
o Negatively Skewed

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 25

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 GRAPHS FOR QUANTITATIVE DATA


1. Histograms
 A bar-type graph for quantitative data.
 The common boundaries between adjacent bars emphasize the
Continuity of the data, as with continuous variables.
Important features of histograms
 Equal units along the horizontal axis (the X axis) reflect the various
class intervals of the frequency distribution.
 Equal units along the vertical axis (the Y axis) reflect increases in
frequency.
 The intersection of the two axes defines the origin at which both
numerical scales equal 0.
 Numerical scales always increase from left to right along the horizontal
axis and from bottom to top along the vertical axis.
 The body of the histogram consists of a series of bars whose heights
reflect the frequencies for the various classes.
 Example Figure 2.1

Figure 2.1
2. Frequency Polygon
 Frequency Polygon or a line graph for quantitative data emphasizes the
continuity of continuous variables.
 Frequency polygons may be constructed directly from frequency
distributions.
Transformation steps from histogram to frequency polygon
1. This panel shows the histogram for the weight distribution.
2. Place dots at the midpoints of each bar top or, at midpoints for classes
on the horizontal axis, and connect them with straight lines.
3. First, extend the upper tail to the midpoint of the first unoccupied class
on the upper flank of the histogram. Then extend the lower tail to the

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 26

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

midpoint of the first unoccupied class on the lower flank of the


histogram. Now all of the area under the frequency polygon is enclosed
completely.
4. Finally, erase all of the histogram bars, leaving only the frequency
polygon.
5. Example Figure 2.2

EXAMPLE

Figure 2.2 Transition from histogram to frequency polygon.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 27

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

3. Stem and Leaf Displays


 A device for sorting quantitative data on the basis of leading and trailing
digits
 Summarizing quantitative data is a stem and leaf display.

Constructing a Display
 To construct the stem and leaf display for these data, first note that,
when counting by tens, the weights range from the 130s to the 240s.
 Arrange a column of numbers, the stems, beginning with 13
(representing the 130s) and ending with 24 (representing the 240s).
 Draw a vertical line to separate the stems, which represent multiples of
10, from the space to be occupied by the leaves, which represent
multiples of 1.
 Next, enter each raw score into the stem and leaf display.

Interpretation
 The weight data have been sorted by the stems. All weights in the 130s
are listed together; all of those in the 140s are listed together, and so on.

As suggested by the shaded coding in Table 2.11, the first raw score
of 160 reappears as a leaf of 0 on a stem of 16. The next raw score of
193 reappears as a leaf of 3 on a stem of 19, and the third raw score
of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each
raw score reappears as a leaf on its appropriate stem.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 28

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Example 2.7
Construct a stem and leaf display for the following IQ scores
obtained from a group of four-year-old children.

Solution
Stem Leaf

 A GRAPH FOR QUALITATIVE (NOMINAL) DATA


 A bar-type graph for qualitative data. Gaps between adjacent bars
emphasize the discontinuous nature of the data.

Bar Graph
 Equal segments along the vertical axis reflect increases in frequency.
The body of the bar graph consists of a series of bars whose heights
reflect the frequencies for the various words or classes. Refer Figure 2.7.

Figure 2.7 Bar Graph

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 29

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

CONSTRUCTING GRAPHS
1. Decide on the appropriate type of graph, recalling that histograms and
frequency polygons are appropriate for quantitative data, while bar graphs
are appropriate for qualitative data.
2. Draw the horizontal axis, then the vertical axis, remembering that the
vertical axis should be about as tall as the horizontal axis is wide.
3. Identify the string of class intervals that eventually will be
superimposed on the horizontal axis. For qualitative data or ungrouped
quantitative data, just use the classes suggested by the data. For grouped
quantitative data, creating a set of class intervals for a frequency
distribution.
4. Superimpose the string of class intervals (with gaps for bar graphs) along
the entire length of the horizontal axis.
5. Along the entire length of the vertical axis, superimpose a progression
of convenient numbers, beginning at the bottom with 0 and ending at the
top with a number as large as or slightly larger than the maximum
observed frequency.
6. Using the scaled axes, construct bars (or dots and lines) to reflect the
frequency of observations within each class interval.
7. Supply labels for both axes and a title for the graph.

 TYPICAL SHAPES
 Whether expressed as a histogram, a frequency polygon, or a stem and
leaf display, an important characteristic of a frequency distribution is its
shape
 The more typical shapes for smoothed frequency polygons
Normal
 The familiar bell-shaped silhouette of the normal curve can be
superimposed on many frequency distributions, scores on standardized
tests, and even the popping times of individual kernels in a batch of
popcorn. Figure 2.3 represents normal curve.

Figure 2.3 Normal


Bimodal

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 30

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Any distribution that approximates the bimodal shape reflect the


coexistence of two different types of observations in the same distribution.
 For instance, the distribution of the ages of residents in a neighborhood
consisting largely of either new parents or their infants has a bimodal
shape. Figure 2.4 represents bimodal curve.

Figure 2.4 Bimodal

Positively Skewed Distribution


 A distribution that includes a few extreme observations in the positive
direction (to the right of the majority of observations).
 Figure 2.5 represents positively skewed distribution curve.

Figure 2.5 Positively Skewed Distribution

Negatively Skewed Distribution


 A distribution that includes a few extreme observations in the negative
direction (to the left of the majority of observations)
 Figure 2.6 represents positively skewed distribution curve.

Figure 2.6 Negatively Skewed Distribution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 31

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Example 2.8
Describe the probable shape—normal, bimodal, positively
Skewed, or negatively skewed—for each of the following distributions:
(a) female beauty contestants’ scores on a masculinity test, with a
higher score indicating a greater degree of masculinity
(b) scores on a standardized IQ test for a group of people selected
from the general population
(c) test scores for a group of high school students on a very difficult
college - level math exam
(d) reading achievement scores for a third-grade class consisting of
about equal numbers of regular students and learning-challenged
students
(e) scores of students at the Eastman School of Music on a test of
music aptitude (designed for use with the general population)
Solution

7 Explain in detail about describing data with Averages.

Averages or Measures of central tendency


 Averages for Quantitative Data
 Mode
 Median
 Mean
 Averages for Qualitative Data
 Averages for Ranked Data

MEASURES OF CENTRAL TENDENCY


 Numbers or words that attempt to describe, most generally, the middle or
typical value for a distribution.

 AVERAGES FOR QUANTITATIVE DATA


MODE
 The mode reflects the value of the most frequently occurring score.
 Distributions with two obvious peaks, even though they are not exactly the
same height, are referred to as bimodal.
 Distributions with more than two peaks are referred to as multimodal.
 Refer Figure 2.8

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 32

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Figure 2.8 - Modes

Example 2.9:
Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
Answer - mode = 63

Example 2.10:
The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9. Find the mode for these data.
Answer - mode = 27.4

MEDIAN
 The median reflects the middle value when observations are ordered from
least to most.

FINDING THE MEDIAN


A. INSTRUCTIONS
1. Order scores from least to most.
2. Find the middle position by adding one to the total number of scores
and dividing by 2.
3. If the middle position is a whole number, as in the left-hand panel
below, use this number to count into the set of ordered scores.
4. The value of the median equals the value of the score located at the
middle position.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 33

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

5. If the middle position is not a whole number, as in the right-hand panel


below, use the two nearest whole numbers to count into the set of ordered
scores.
6. The value of the median equals the value midway between those of the two
middlemost scores; to find the midway value, add the two given values
and divide by 2.

Figure 2.9 Median

Example 2.11:
Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.

Example 2.12:
Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.

median = 27.15 (halfway between 26.9 and 27.4)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 34

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

MEAN
 The mean is found by adding all scores and then dividing by the number of
scores.
Mean = sum of all scores / number of scores

Types of mean
 population mean
 sample mean
Sample Mean
 Sample Size (n) - The total number of scores in the sample.
 Sample Mean is obtained by dividing the sum of all scores in the sample by
the number of scores in the sample.

 “X-bar equals the sum of the variable X divided by the sample size n.”

Population Mean (μ)


 Population Size (N) - The total number of scores in the population.
 Population Mean (μ) is obtained by dividing the sum for all scores in the
population by the number of scores in the population.
 The population mean is represented by μ (pronounced “mu”),

Where the uppercase letter N refers to the population size.

Example 2.13
Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.

Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 35

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Mean as Balance Point


 The mean serves as the balance point for its frequency distribution.

If Distribution Is Not Skewed


 When a distribution of scores is not too skewed, the values of the
mode, median, and mean are similar, and any of them can be used to
describe the central tendency of the distribution.

If Distribution Is Skewed

Figure 2.10 - Mode, median, and mean in positively and negatively


skewed distributions.

Example 2.14
Indicate whether the following skewed distributions are
positively
skewed because the mean exceeds the median or negatively
skewed because the median exceeds the mean.
a. a distribution of test scores on an easy test, with most
students scoring high and a few students scoring low
Solution
- negatively skewed because the median exceeds the mean

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 36

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

b. a distribution of ages of college students, with most students


in their late teens or early twenties and a few students in
their fifties or sixties
Solution
- positively skewed because the mean exceeds the median
c. a distribution of loose change carried by classmates, with
most carrying less than $1 and with some carrying $3 or $4
worth of loose change
Solution - positively skewed
d. a distribution of the sizes of crowds in attendance at a
popular movie theater, with most audiences at or near
capacity
Solution - negatively skewed

 AVERAGES FOR QUALITATIVE DATA


 The mode always can be used with qualitative data.
 The median can be used whenever it is possible to order qualitative data
from least to most because the level of measurement is ordinal.

Example 2.15
College students were surveyed about where they would most
like to spend their spring break: Daytona Beach (DB), Cancun,
Mexico (C), South Padre Island (SP), Lake Havasu (LH), or other
(O). The results were as follows:

Find the mode and, if possible, the median.


Solution –
mode = DB (Daytona Beach)
Impossible to find the median when qualitative data are unordered,
with only nominal measurement.

 AVERAGES FOR RANKED DATA


 When the data consist of a series of ranks, with its ordinal level of
measurement, the median rank always can be obtained.
 It’s simply the middlemost or average of the two middlemost ranks.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 37

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

8 Explain in detail about the Describing data with Variability.

Measures of Variability
 Variability for Quantitative Data
 Range
 Variance
 Standard Deviation
 Variability for Qualitative Data
 Variability for Ranked Data

MEASURES OF VARIABILITY
 Variability is the description of the amount by which scores are dispersed
or scattered in a distribution.
 There are many ways to describe variability or spread including:
 Range
 Variance
 Standard Deviation

 VARIABILITY FOR QUANTITATIVE DATA

RANGE
 It is the difference between the largest and smallest scores.

FIGURE 2.11 - Three distributions with the same mean (10) but
different amounts of variability. Numbers in the boxes indicate
distances from the mean.
In Figure 2.11, distribution A, the least variable, has the smallest range
of 0 (from 10 to 10); distribution B, the moderately variable, has an
intermediate range of 2 (from 11 to 9); and distribution C, the most
variable, has the largest range of 6 (from 13 to 7),

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 38

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

VARIANCE
 The variance is the mean of all squared deviation scores.
 A deviation from the mean is how far a score lies from the mean.
 Variance is the square of the standard deviation.

Example
To get variance, square the standard deviation.
s = 95.5
s2 = 95.5 x 95.5 = 9129.14
The variance of your data is 9129.14.
 Variance reflects the degree of spread in the data set.
 The more spread the data, the larger the variance is in relation to the
mean.

SUM OF SQUARES (SS)


o Sum of Squares (SS) - The sum of squared deviation scores.

 Sum of Squares Formulas for Population

Where SS represents the sum of squares, Σ directs us to sum over the


expression to its right, and (X − μ) 2 denotes each of the squared deviation
scores.
 Steps:
1. Subtract the population mean, μ, from each original score, X, to
obtain a deviation score, X − μ.
2. Square each deviation score, (X − μ)2 , to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ) 2

 where ∑X2 , the sum of the squared X scores, is obtained by first


squaring each X score and then summing all squared X scores;
 ∑ X2 , the square of sum of all X scores, is obtained by first adding all
X scores and then squaring the sum of all X scores;
 N is the population size

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 39

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Sum of Squares Formulas for Sample

where X, the sample mean, replaces μ, the population mean, and n,


the sample size, replaces N, the population size.

STANDARD DEVIATION
 The square root of the mean of all squared deviations from the mean, that is,
standard deviation = √variance

Standard Deviation for Population σ

variance = sum of all squared deviation scores /number of scores

where σ2, represents the population variance, SS is the sum of squared


deviations for the population, and N is the population size.

where σ represents the population standard deviation.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 40

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Table 2.22: Calculation of Population Standard Deviation Σ


(Definition Formula)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 41

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Table 2.23: Calculation Of Population Standard Deviation (Σ)


(Computation Formula)

Standard Deviation for Sample (s)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 42

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Table 2.24 Calculation of Sample Standard Deviation (S) (Definition Formula)

Table 2.25 Calculation of Sample Standard Deviation (S) (Computation


Formula)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 43

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 DEGREES OF FREEDOM (df)


 Degrees of freedom (df) refers to the number of values that are free to
vary, given one or more mathematical restrictions
 Degrees of freedom rewrite the formulas for the sample variance and
standard deviation:

where s2 and s represent the sample variance and standard deviation, SS is


the sum of squares and df is the degrees of freedom and equals n − 1.

 INTERQUARTILE RANGE (IQR)


 The interquartile range (IQR), is the range for the middle 50 percent of the
scores.

Table 2.26: Calculation of the IQR

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 44

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Note: Measures of variability are virtually nonexistent for qualitative


and ranked data.

Example 2.16
Using the computation formula for the sum of squares,
Calculate the population standard deviation for the scores in (a) and
the sample standard deviation for the scores in (b).
(a) 1, 3, 7, 2, 0, 4, 7, 3

(b) 10, 8, 5, 0, 1, 1, 7, 9, 2

Example 2.17
Days absent from school for a sample of 10 first-grade children are:
8, 5, 7, 1, 4, 0, 5, 7, 2, 9.
a) Before calculating the standard deviation, decide whether the
definitional or computational formula would be more efficient.
Why?
Solution - computation formula since the mean is not a whole number.
b) Use the more efficient formula to calculate the sample standard
deviation.

Example 2.18
As a first step toward modifying his study habits, Phil keeps daily
records of his study time.
a. During the first two weeks, Phil’s mean study time equals 20 hours
per week. If he studied 22 hours during the first week, how many
hours did he study during the second week?
Solution - 18 hours
b. During the first four weeks, Phil’s mean study time equals 21
hours. If he studied 22, 18, and 21 hours during the first, second,

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 45

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

and third weeks, respectively, how many hours did he study during
the fourth week?
Solution - 23 hours
c. If the information in (a) and (b) is to be used to estimate some
unknown population characteristic, the notion of degrees of
freedom can be introduced. How many degrees of freedom are
associated with (a) and (b)?
Solution - df = 1 in (a) and df = 3 in (b)
d. Describe the mathematical restriction that causes a loss of degrees
of freedom in (a) and (b).
Solution - When all observations are expressed as deviations from
their mean, the sum of all deviations must equal zero.

Example 2.19
Determine the values of the range and the IQR for the following sets of
data.
a. Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
range = 25; IQR = 65 – 60 = 5
b. Residence changes: 1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4
range = 11; IQR = 4 – 1 = 3

9 Discuss in detail about Normal Distribution, Normal Curve, Z - Scores


and list out the properties of normal Curve?
Normal Curve
 A theoretical curve noted for its symmetrical bell-shaped form.
Important properties of the normal curve:
 The normal curve is a theoretical curve defined for a continuous variable,
in symmetrical bell-shaped form.
 Because the normal curve is symmetrical, its lower half is the mirror
image of its upper half.
 Being bell shaped, the normal curve peaks above a point midway along
the horizontal spread and then tapers off gradually in either direction
from the peak.
 The values of the mean, median and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 46

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

FIGURE 2.12 Normal curve superimposed on the distribution of heights.

Different Normal Curves

Standard Normal Curve.


 Standard Normal Curve is the one normal curve for which a table is
actually available.
 Standard Normal Curve is the tabled normal curve for z scores, with a mean
of 0 and a standard deviation of 1.
 To verify (rather than prove) that the mean of a standard normal distribution
equals 0, replace X in the z score formula with μ, the mean of any
(nonstandard) normal distribution, and then solve for z:

 To verify that the standard deviation of the standard normal distribution


equals 1, replace X in the z score formula with μ + 1σ, the value
corresponding to one standard deviation above the mean for any
(nonstandard) normal distribution, and then solve for z:

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 47

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Although there is an infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal curve,
with a mean of 0 and a standard deviation of 1.

FIGURE 2.13 Converting three normal curves to the standard normal


curve.
 Figure 2.13 illustrates the emergence of the standard normal curve from
three different normal curves: that for the men’s heights, with a mean of 69
inches and a standard deviation of 3 inches; that for the useful lives of 100-
watt electric light bulbs, with a mean of 1200 hours and a standard
deviation of 120 hours; and that for the IQ scores of fourth graders, with a
mean of 105 points and a standard deviation of 15 points.
 Converting all original observations into z scores leaves the normal shape
intact but not the units of measurement. Shaded observations of 66 inches,
1080 hours, and 90 IQ points all reappear as a z score of –1.00.

10 Discuss in detail about z SCORES.


 A z score is a unit-free, standardized score that, indicates how many
standard deviations a score is above or below the mean of its distribution.

 where X is the original score and μ and σ are the mean and the standard
deviation, respectively, for the normal distribution of the original scores.
 A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean;
2. a number indicating the size of its deviation from the mean in standard
deviation units.
Example
 A z score of 2.00 always signifies that the original score is exactly two
standard deviations above its mean.
 Similarly, a z score of –1.27 signifies that the original score is exactly 1.27
standard deviations below its mean.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 48

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 A z score of 0 signifies that the original score coincides with the mean.

Example 2.20
Express each of the following scores as a z score:
a. Margaret’s IQ of 135, given a mean of 100 and a standard deviation
of 15
b. a score of 470 on the SAT math test, given a mean of 500 and a
standard deviation of 100
c. a daily production of 2100 loaves of bread by a bakery, given a
mean of 2180 and a standard deviation of 50
d. Sam’s height of 69 inches, given a mean of 69 and a standard
deviation of 3
e. a thermometer-reading error of –3 degrees, given a mean of 0
degrees and a standard deviation of 2 degrees
(a) 2.33 (b) –0.30 (c) –1.60 (d) 0.00 (e) –1.50

Standard Normal Table


 The standard normal table consists of columns of z scores coordinated with
columns of proportions. In a typical problem, access to the table is gained
through a z score, such as –1.00.

TABLE 2.27 Proportions (Of Areas) Under The Standard Normal Curve For
Values Of Z

Using the Top Legend of the Table


 Table 2.27 shows an abbreviated version of the standard normal curve, The
columns are arranged in sets of three, designated as A, B, and C in the
legend at the top of the table.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 49

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 When using the top legend, all entries refer to the upper half of the standard
normal curve. The entries in column A are z scores, beginning with 0.00 and
ending with 4.00.
 Given a z score of zero or more, columns B and C indicate how the z score
splits the area in the upper half of the normal curve.
 The shading in the top legend, column B indicates the proportion of area
between the mean and the z score, and column C indicates the proportion of
area beyond the z score, in the upper tail of the standard normal curve.

Using the Bottom Legend of the Table


 The columns are designated as A′, B′, and C′ in the legend at the bottom of
the table. When using the bottom legend, all entries refer to the lower half of
the standard normal curve.
 Imagine that the nonzero entries in column A′ are negative z scores,
beginning with –0.01 and ending with –4.00.
 Given a negative z score, columns B′ and C′ indicate how that z score splits
the lower half of the normal curve.
 The shading in the bottom legend of the table, column B′ indicates the
proportion of area between the mean and the negative z score, and column C′
indicates the proportion of area beyond the negative z score, in the lower tail
of the standard normal curve.

Example 2.21
Using Table A in Appendix C, find the proportion of the total area
identified with the following statements:
(a) above a z score of 1.80
(b) between the mean and a z score of –0.43
(c) below a z score of –3.00
(d) between the mean and a z score of 1.65
(e) between z scores of 0 and –1.96
Solution : (a) .0359 (b) .1664 (c) .0013 (d) .4505 (e) .4750

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 50

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

11 Explain in detail about finding proportions and Finding Scores.

FIGURE 2.14 Normal Curve Problems


 When using the standard normal table, the corresponding proportions in
columns B and C (or columns B′ and C′) always sum to .5000.
 Similarly, the total area under the normal curve always equals 1.0000, the
sum of the proportions in the lower and upper halves, that is, .5000 + .5000.
 Finally, although a z score can be either positive or negative, the proportions
of area under the curve are always positive or zero but never negative.

Steps for finding proportions

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 51

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Finding proportion for One Score


 Find the proportion who are shorter than exactly 66 inches, given that the
distribution of heights approximates a normal curve with a mean of 69
inches and a standard deviation of 3 inches.
1. Sketch a normal curve and shade in the target area. Being less than the
mean of 69, 66 is located to the left of the mean.

FIGURE 2.15: Finding proportions.


2. Plan your solution according to the normal table.
3. Convert X to z. Express 66 as a z score:

4. Find the target area. and note the corresponding proportion of .1587 in
column C’:

Example 2.22
Assume that GRE scores approximate a normal curve with a mean of
500 and a standard deviation of 100.
(a) Sketch a normal curve and shade in the target area described by
each of the following statements:
(i) less than 400
(ii) more than 650
(iii) less than 700
(b) Plan solutions (in terms of columns B, C, B′, or C′ of the standard
normal table, as well as the fact that the proportion for either the
entire upper half or lower half always equals .5000) for the target
areas in part (a).
(c) Convert to z scores and find the proportions that correspond to the
target areas in part (a)

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 52

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Finding Proportions between Two Scores


 Assume that, the gestation periods for human fetuses approximate a normal
curve with a mean of 270 days (9 months) and a standard deviation of 15
days. What proportion of gestation periods will be between 245 and 255
days?
1. Sketch a normal curve and shade in the target area, as in the top panel of
Figure 5.7.

FIGURE 2.16: Finding proportions


2. Plan your solution according to the normal table. The basic idea is to
identify the target area with the difference between two overlapping areas
whose values can be read from column C′ of Table A. The larger area (less
than 255 days) contains two sectors: the target area (between 245 and 255
days) and a remainder (less than 245 days). The smaller area contains only
the remainder (less than 245 days). Subtracting the smaller area (less than
245 days) from the larger area (less than 255 days), therefore, eliminates
the common remainder (less than 245 days), leaving only the target area
(between 245 and 255 days).

3. Convert X to z by expressing 255 as

4. Find the target area.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 53

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Finding Proportions beyond Two Scores


 Assume that high school students’ IQ scores approximate a normal
distribution with a mean of 105 and a standard deviation of 15. What
proportion of IQs are more than 30 points either above or below the mean?
1. Sketch a normal curve and shade in the two target areas, as in the top
panel of Figure 2.17

Figure 2.17 Finding proportions


2. Plan your solution according to the normal table.

3. Convert X to z by expressing IQ scores of 135 and 75 as

4. Find the target area. In Table A, locate a z score of 2.00 in column A, and
note the corresponding proportion of .0228 in column C. Because of the
symmetry of the normal curve, you need not enter the table again to find
the proportion below z score of –2.00. Instead, merely double the above
proportion of .0228 to obtain .0456, which represents the proportion of
students with IQs more than 30 points either above or below the mean.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 54

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Example 2.23:
Assume that SAT math scores approximate a normal curve with a
mean of 500 and a standard deviation of 100.
(a) Sketch a normal curve and shade in the target area(s) described
by each of the following statements:
(i) more than 570
(ii) less than 515
(iii) between 520 and 540
(iv) between 470 and 520
(v) more than 50 points above the mean
(vi) more than 100 points either above or below the mean
(vii) within 50 points either above or below the mean</RLLNL>
(b) Plan solutions (in terms of columns B, C, B′, and C′) for the
target areas in part (a).
(c) Convert to z scores and find the target areas in part (a).

Solution:

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 55

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 56

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 57

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

12. Explain in detail about correlations and the types of relationships in


correlation.

CONTENTS:
 Correlation
 Types of relationship
o Positive Relationship
o Negative Relationship
o Little or no relationship
 Correlation Coefficient r
o Key properties of r
 Computational formula for
Correlation Coefficient
 Computational Sequence
 Data and Computation
o
Correlation
 Two variables are related if pairs of scores show an orderliness that can
be depicted graphically with a scatter plot and numerically with a
correlation coefficient.
Table 2.8 Greeting Cards Sent and Received by five Friends

 Types of relationship:
1. Positive Relationship
2. Negative Relationship
3. Little or no relationship

To illustrate the types of relationships refer Table 2.8 Greeting cards


sent and received by friends.

Positive relationship:
 Two variables are positively related if pairs of scores tend to occupy
similar relative positions (high with high and low with low) in their
respective distributions.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 58

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Trends among pairs of scores can be detected most easily by


constructing a list of paired scores in which the scores along one
variable are arranged from largest to smallest.
 In Table 2.9, the five pairs of scores are arranged from the largest (13)
to the smallest (1) number of cards sent.
 Relatively low values are paired with relatively low values, and
relatively high values are paired with relatively high values, the
relationship is positive.
Table 2.9 Positive Relationship

Negative relationship:
 Two variables are negatively related if pairs of scores tend to occupy
dissimilar relative positions (high with low and vice versa) in their
respective distributions.
 Occurs insofar as pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective
distributions the relationship is negative.
 Notice the pattern among the pairs in Table 2.10. Now there is a
pronounced tendency for pairs of scores to occupy dissimilar and
opposite relative positions in their respective distributions.
 This relationship implies that relatively low values are paired with
relatively high values, and relatively high values are paired with
relatively low values, the relationship is negative.
Table 2.10 Negative Relationship

Little or No Relationship
 No regularity is apparent among the pairs of scores in Table 2.11.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 59

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 For instance, although both Andrea and John sent relatively few cards
(5 and 1, respectively), Andrea received relatively few cards (6) and
John received relatively many cards (14).
 Given this lack of regularity, if any, relationship exists between the
two variables it is little or no relationship
Table 2.12 Little or No Relationship

Example – 2.34
Indicate whether the following statements suggest a positive or
negative relationship:
a. More densely populated areas have higher crime rates.
b. Schoolchildren who often watch TV perform more poorly on
academic achievement tests.
c. Heavier automobiles yield poorer gas mileage.
d. Better-educated people have higher incomes.
e. More anxious people voluntarily spend more time performing a
simple repetitive task.
Solution:
a. Positive. The crime rate is higher, square mile by square mile, in
densely populated cities than in sparsely populated rural areas.
b. Negative. As TV viewing increases, performance on academic
achievement tests tends to decline.
c. Negative. Increases in car weight are accompanied by decreases in
miles per gallon.
d. Positive. Increases in educational level—grade school, high school,
college—tend to be associated with increases in income.
e. Positive. Highly anxious people willingly spend more time performing a
simple repetitive task than do less anxious people.

CORRELATION COEFFICIENT R

 A correlation coefficient is a number between –1 and 1 that describes the


relationship between pairs of variables.

Key Properties of r
 Named in honor of the British scientist Karl Pearson, the Pearson
correlation coefficient, r, can equal any value between –1.00 and +1.00.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 60

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Furthermore, the following two properties apply:


1. The sign of r indicates the type of linear relationship, whether positive
or negative.
A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a negative
relationship.
2. The numerical value of r, indicates the strength of the linear
relationship.

The more closely a value of r approaches either –1.00 or +1.00, the


stronger (more regular) the relationship.

The more closely the value of r approaches 0, the weaker (less regular)
the relationship.

 COMPUTATIONAL FORMULA FOR CORRELATION COEFFICIENT


Calculate a value for r by using the following computation formula:

Where
 SPxy, - Sum of the products for each pair of deviation scores defined
as

 SSx and SSy - summing the squared deviation scores for either X or Y
is defines as

 Notice in Formula that, since the terms in the denominator must be


positive, only the sum of the products, SPxy, determines whether the
value of r is positive or negative.
 Furthermore, the size of SPxy mirrors the strength of the relationship;
stronger relationships are associated with larger positive or negative
sums of products.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 61

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Computational Sequence
 Assign a value to n (1), representing the number of pairs of scores.
 Sum all scores for X (2) and for Y (3).
 Find the product of each pair of X and Y scores (4), one at a time, then
add all of these products (5).
 Square each X score (6), one at a time, then add all squared X scores
(7).
 Square each Y score (8), one at a time, then add all squared Y scores
(9).
 Substitute numbers into formulas (10) and solve for SPxy, SSx, and SSy.
 Substitute into formula (11) and solve for r.

 Data and Computation

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 62

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Example 2.35:
Supply a verbal description for each of the following
correlations.
a. an r of –.84 between total mileage and automobile resale
value
b. an r of –.35 between the number of days absent from school
and performance on a math achievement test
c. an r of .03 between anxiety level and college GPA
d. an r of .56 between age of schoolchildren and reading
comprehension

Solution:
a. Cars with more total miles tend to have lower resale values.
b. Students with more absences from school tend to score lower on math
achievement tests.
c. Little or no relationship between anxiety level and college GPA.
d. Older school children tend to have better reading comprehension.

Example 2.36:
Couples who attend a clinic for first pregnancies are asked to estimate
(Independently of each other) the ideal number of children. Given that
X and Y represent the estimates of females and males, respectively,
the results are as follows:

Calculate a value for r, using the computation formula .


Solution:

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 63

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

13.Explain in detail about Scatterplots and diagrammatically represent the


relationship between variables using scatterplot.

CONTENTS:
 Scatterplot
 Construction of Scatterplot
 Types of relationship in Scatterplot
o Positive Relationship
o Negative Relationship
o Little or no Relationship
o Strong or Weaker Relationship
o Linear Relationship
o Perfect Relationship
o Curvilinear Relationship

Scatterplot
 A scatterplot is a graph containing a cluster of dots that represents all
pairs of scores.
Construction of Scatterplot
 To construct a scatterplot, scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate a
dot within the scatterplot.
 For example, the pair of numbers for Mike, 7 and 12, define points along
the X and Y axes, respectively.
 Using these points to anchor lines perpendicular (at right angles) to each
axis, locate Mike’s dot where the two lines intersect.
 Repeat this process, with imaginary lines, for each of the four remaining
pairs of scores to create the scatterplot refer Figure 3.1

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 64

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Figure 2.14 – Scatterplot for Greeting Card Exchange

 Figure 2.14 has shown the basic idea of correlation and the construction
of a scatterplot.

Types of Relationship in Scatterplot


 Positive Relationship
 Negative Relationship
 Little or no Relationship
 Strong or Weaker Relationship
 Linear Relationship
 Perfect Relationship
 Curvilinear Relationship

Positive Relationship
 A dot cluster that has a slope from the lower left to the upper right,
reflects a positive relationship.
 Small values of one variable are paired with small values of the other
variable, and large values are paired with large values.
 In Figure 2.15, short people tend to be light, and tall people tend to be
heavy.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 65

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Figure 2.15 – Positive Relationship

Negative Relationship
 A dot cluster that has a slope from the upper left to the lower right,
reflects a negative relationship.
 Small values of one variable tend to be paired with large values of the
other variable, and vice versa.
 Figure 2.16 reflects the relationship between life expectancy and Heavy
Smoking.

Figure 2.16 – Negative Relationship

Little or No Relationship
 A dot cluster that lacks any apparent slope, reflects little or no
relationship.
 Small values of one variable are just as likely to be paired with small,
medium, or large values of the other variable.
 In Figure 2.17, dots are seen about in an irregular fashion, suggesting
that there is little or no relationship between the height of young adults
and their life expectancies.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 66

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Figure 2.17 – Little or No Relationship

Strong or Weak Relationship


 The more closely the dot cluster approximates a straight line, the stronger
the relationship will be.
 The more scattered the dot cluster approximates a weaker the
relationship will be.

Figure 2.18 – Three Positive Relationships

 Figure 2.18 shows a series of scatterplots, each representing a different


positive relationship between IQ scores for pairs of people whose
backgrounds reflect different degrees of genetic overlap, ranging from
minimum overlap between foster parents and foster children to maximum
overlap between identical twins.
 Notice that the dot cluster more closely approximates a straight line for
people with greater degrees of genetic overlap—for parents and children
in panel B of Figure 3.5 and even more so for identical twins in panel C.

Linear Relationship
 A relationship that can be described best with a straight line.

Perfect Relationship
 A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 67

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Curvilinear Relationship
 Sometimes a dot cluster approximates a bent or curved line, as in Figure
2.19, and therefore reflects a curvilinear relationship.

Figure 2.19 – Curvilinear Relationship

Example 2.37:
Critical reading and math scores on the SAT test for students A, B, C,
D, E, F, G, and H are shown in the following scatterplot:

a. Which student(s) scored about the same on both tests?


b. Which student(s) scored higher on the critical reading test than
on the math test?
c. Which student(s) will be eligible for an honors program that
requires minimum scores of
d. 700 in critical reading and 500 in math?
e. Is there a negative relationship between the critical reading and
math scores?
Solution:
a. I, D, F
b. B, H, E
c. E, H
d. No. The relationship is positive.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 68

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

14.Explain in detail about Regression Line.

CONTENTS:
 Regression Line
 Example
 Placement of Line
 Predictive Errors
 Total Predictive Errors

Regression Line
 The regression line is a straight line rather than a curved line because
of the linear relationship between two variables.

Example: Rough Prediction

Figure 2.20 – A rough prediction

 To obtain a slightly more precise prediction for Emma, refer to the


scatter plot for the original five friends shown in Figure 2.20.
 Notice that Emma’s plan to send 11 cards locates her along the X axis
between the 9 cards sent by Steve and the 13 sent by Doris.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 69

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Using the dots for Steve and Doris as guides, construct two strings of
arrows, one beginning at 9 and ending at 18 for Steve and the other
beginning at 13 and ending at 14 for Doris.
 Focusing on the interval along the Y axis between the two strings of
arrows, could predict that Emma’s return should be between 14 and
18 cards, the numbers received by Doris and Steve.

Prediction using Regression Line

Figure 2.21 - Prediction using the Regression Line

 All five dots contribute to the more precise prediction, illustrated in


Figure 2.21, that Emma will receive 15.20 cards.
 The solid line designated as the regression line in Figure 2.21, which
guides the string of arrows, beginning at 11, toward the predicted
value of 15.20.
 The regression line is a straight line rather than a curved line because
of the linear relationship between cards sent and cards received.

Predictive Errors
 Figure 2.22 illustrates the predictive errors that would have occurred
if the regression line had been used to predict the number of cards
received by the five friends.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 70

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Solid dots reflect the actual number of cards received, and open dots,
always located along the regression line, reflect the predicted number
of cards received

Figure 2.22 - Predictive Errors

Total Predictive Error


 It is desirable for the regression line to be placed in a position that
minimizes the total predictive error.

15.Describe about Least squares Regression Line.

CONTENTS:
 Least squares Regression Line
 Least Squares Regression Equation
 Computational Sequence
 Computations

Least Squares Regression Line

 The regression line is often referred to as the least squares regression line
to minimize the total predictive error thereby providing a more favorable
prognosis for the predictions.

Least Squares Regression Equation


 An equation pinpoints the exact least squares regression line for any
scatterplot.
 The equation that minimizes the total of all squared prediction errors for
known Y scores in the original correlation analysis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 71

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

o where Y´ represents the predicted value;


o X represents the known value;
o b and a represent numbers calculated from the original correlation
analysis

Finding Values of b and a


o The expression for b reads:

 where r represents the correlation between X and Y;


 SSy represents the sum of squares for all Y scores;
 SSx represents the sum of squares for all X scores

o The expression for a reads:

where Y and X refer to the sample means for all Y and X


scores, respectively, and b is defined by the preceding
expression.

Computational Sequence
 Determine values of SSx′ SSy′ and r (1) by referring to the original
correlation analysis
 Substitute numbers into the formula (2) and solve for b.
 Assign values to X and Y (3) by referring to the original correlation
analysis
 Substitute numbers into the formula (4) and solve for a.
 Substitute numbers for b and a in the least squares regression equation
(5).

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 72

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Computations

where .80 and 6.40 represent the values computed for b and a,
respectively.

Example 2.37:
Assume that an r of .30 describes the relationship between
educational level (highest grade completed) and estimated
number of hours spent reading each week. More specifically:

(a) Determine the least squares equation for predicting weekly


reading time from educational level.
(b) Faith’s education level is 15. What is her predicted reading
time?
(c) Keegan’s educational level is 11. What is his predicted
reading time?

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 73

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution:

16.Explain in detail about Standard error of estimate.

CONTENTS:
 Standard Error of Estimate
 Finding the Standard Error of Estimate
 Definition Formula
 Computation Formula
 Computational Sequence
 Computation
 Importance of r

Standard Error of Estimate


 The task is to estimate the amount of error associated with the
predictions.
Finding the Standard Error of Estimate
 The standard error of estimate and symbolized as Sy|x, for any sample
standard deviation, that is, the square root of a sum of squares term
divided by its degrees of freedom.
 The symbol Sy|x is read as “S sub y given x.”
Definition Formula
 The formula for Sy|x reads:

where
SSy|x , represents the sum of the squares for predictive errors,
Y − Y′, and the degrees of freedom term in the denominator,
n − 2, reflects the loss of two degrees of freedom

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 74

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Computation Formula

where SSy is the sum of the squares for Y scores

r is the correlation coefficient

Computational Sequence
 Assign values to SSy and r (1) by referring to previous work with the least
squares regression equation.
 Substitute numbers into the formula (2) and solve for sy|x.
Computation

Importance of r
 Let’s substitute a few extreme values for r, the sum of squares for
predictive errors, SSy|x .
 Substituting a value of 1 for r,

when predictions are based on perfect relationships, the sum of squares


for predictive errors equals zero, and there is no predictive error.
 Substituting a value of 0 for r,

when predictions are based on a nonexistent relationship, the sum


of squares for predictive errors equals SSy, and there is no reduction in
predictive error.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 75

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

17. Explain in detail about Interpretation of r2

CONTENTS:
 Squared correlation coefficient, r2
 Two kinds of predictive errors
 Repetitive Prediction of the Mean
 Predictive Errors
 Error Variability (Sum of Squares)
 Proportion of Predicted Variability

Squared correlation coefficient, r2


 The squared correlation coefficient, r2, defines the interpretation of the
correlation coefficient and also a measure of predictive accuracy that
supplements the standard error of estimate, sy|x.
Two kinds of predictive errors
o repetitive prediction of the mean
o due to the regression equation.
Repetitive Prediction of the Mean
 Given the present restricted circumstances, statisticians recommend
repetitive predictions of the mean, Y, for a variety of reasons, although
the predictive error for any individual might be quite large, the sum of all
of the resulting predictive errors always equals zero
Predictive Errors

Figure 2.23 – Predictive Error using mean

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 76

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

The figure 2.23 shows the predictive errors for all five friends when the
mean for all five friends, Y, of 12 is always used to predict each of
their five Y scores.

Figure 2.24– Predictive Error using Least Square Equation

 The figure 3.12 shows the corresponding predictive errors for all five
friends when a series of different Y′ values, obtained from the least
squares equation is used to predict each of their five Y scores.
 For example, Figure 3.11 shows the error for John when the mean for
all five friends, Y, of 12 is used to predict his Y score of 6. Shown as a
broken vertical line, the error of −6 for John (from Y – Y’ = 6 − 12 = −6)
indicates that Y overestimates John’s Y score by 6 cards.
 Figure 2.24 shows a smaller error of −1.20 for John when a Y′ value of
7.20 is used to predict the same Y score of 6.
 This Y’ value of 7.20 is obtained from the least squares equation,

 Positive and negative errors indicate that Y scores are either above or
below their corresponding predicted scores.
 Overall, errors are smaller when a customized prediction of Y′ from the
least squares equation than the repetitive prediction of Y.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 77

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Error Variability (Sum of Squares)


 The sum of squares of any set of deviations, called errors, can be
calculated by first squaring each error, then summing all squared errors.
 The error variability for the repetitive prediction of the mean can be
designated as SSy , since each Y score is expressed as a squared deviation
from Y and then summed, that is

 Using the errors for the five friends shown in Figure 3.11, this becomes

 The error variability for the customized predictions from the least squares
equation can be designated as SSy|x, since each Y score is expressed as a
squared deviation from its corresponding Y’ and then summed, that is

Using the errors for the five friends shown in Figure 3.12, we obtain:

Proportion of Predicted Variability


 To obtain an SS measure of the actual gain in accuracy due to the least
squares predictions, subtract the residual variability from the total
variability, that is, subtract SSy|x from SSy , to obtain

 To express this difference, 51.2, as a gain in accuracy relative to the


original error variability for the repetitive prediction of Y, divide the above
difference by SSy , that is,

 This result, .64 or 64 percent, represents the proportion or percent gain


in predictive accuracy when the repetitive prediction of Y is replaced by a
series of customized Y′ predictions based on the least squares equation.
 In other words, .64 or 64 percent represents the proportion or percent of
the total variability of SSy that is predictable from its relationship with
the X variable.
 The square of the correlation coefficient, r2, always indicates the
proportion of total variability in one variable that is predictable from its
relationship with the other variable.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 78

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

o Expressing the equation for r2 in symbols,

 where the sum of squares term, SSy′, is simply the variability explained by
or predictable from the regression equation, that is,

 Accordingly, r2 provides us with a straightforward measure of the worth of


our least squares predictive effort.

Problem 2.38:
o Assume that an r of .30 describes the relationship between
educational level and estimated hours spent reading each week.
a. According to r², what percent of the variability in weekly
reading time is predictable from its relationship with
educational level?
b. What percent of variability in weekly reading time is not
predictable from this relationship?
c. Someone claims that 9 percent of each person’s estimated
reading time is predictable from the relationship. What is
wrong with this claim?

Solution:
(a) 9 percent predicted.
(b) 91 percent not predicted.
(c) 9 percent refers to the variability of all estimated reading
times.

Problem 2.39:
As the correlation between the IQ scores of parents and children
is .50, and that between the IQ scores of foster parents and
foster children is .27.
a. Does this signify, therefore, that the relationship between
foster parents and foster children is about one-half as
strong as the relationship between parents and children?
b. Use r2 to compare the strengths of these two correlations.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 79

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution:
(a) No
(b) The r2 of .25 for parents and children is about four times
greater than the r2 of .07 for foster parents and foster children.

18. Explain in detail about Multiple regression equations.


 A least squares equation that contains more than one predictor or X
variable
 For instance, a serious effort to predict college GPA might culminate in
the following equation:

where Y′ represents predicted college GPA


X1, X2, and X3 refer to high school GPA, IQ score, and SAT score,
respectively.
 By capitalizing on the combined predictive power of several predictor
variables, these multiple regression equations supply more accurate
predictions for Y′ than could be obtained from a simple regression
equation.

19. Explain in detail about Regression toward the mean.


 Regression toward the mean refers to a tendency for scores, particularly
extreme scores, to shrink toward the mean.
 This tendency often appears among subsets of observations whose values
are extreme and at least partly due to chance.
 Regression toward the mean appears among subsets of extreme
observations for a wide variety of distributions.
Table 2.13 – Regression towards mean

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 80

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

 Table 2.13 lists the top 10 hitters in the major leagues during 2014
and shows how they fared during 2015.
 Notice that 7 of the top 10 batting averages regressed downward,
toward 260s, the approximate mean for all hitters during 2015.
 Hitters among the top 10 in 2014, who were not among the top 10 in
2015, were replaced by other mostly above-average hitters, who also
were very lucky during 2015.
 Observed regression toward the mean occurs for individuals or
subsets of individuals, not for entire groups.

The Regression Fallacy


 The regression fallacy is committed whenever regression toward the
mean is interpreted as a real, rather than a chance, effect.
 A classic example of the regression fallacy occurred in an Israeli Air
Force study of pilot training Some trainees were praised after very
good landings, while others were reprimanded after very bad landings.
 On their next landings, praised trainees did more poorly and
reprimanded trainees did better.
 It was concluded, therefore, that praise hinders but a reprimand helps
performance.
 It’s reasonable to assume that, in addition to skill, chance plays a role
in landings.

Avoiding the Regression Fallacy


 The regression fallacy can be avoided by splitting the subset of
extreme observations into two groups.
 One group of trainees would continue to be praised after very good
landings and reprimanded after very poor landings.
 A second group of trainees would receive no feedback whatsoever after
very good and very bad landings.
 In effect, the second group would serve as a control for regression
toward the mean, since any shift toward the mean on their second
landings would be due to chance.

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 81

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 82

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

BOOK EXERCISES

Solution

Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 83

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution

Solution

Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 84

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution

Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 85

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution

Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 86

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution

Solution

Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 87

Downloaded by Dinesh 1812 ([email protected])


lOMoARcPSD|33947538

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS UNIT 2

Solution

PREPARED BY: Dr. S. ARTHEESWARI, Prof.& HEAD/AI&DS 88

Downloaded by Dinesh 1812 ([email protected])

You might also like