School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
In the field of data, there is nothing more important than understanding the data
that needs to be analyzed. In order to understand the data, it is important to
understand the purpose of the analysis because this will help to save time and
dictate how to go about analyzing the data.
Exploratory data analysis (EDA) can be classified as univariate, bivariate,
and multivariate analysis. EDA refers to the critical process of performing initial
investigations on data to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical
representations.
So, the goal is to present data in a form that makes sense to people. Tools that
are used to do this include:
Graphs: bar charts, pie charts, histograms, scatter charts, and time series
graphs.
Numerical summary measures: counts, percentages, averages, and
measures of variability
Tables of summary measures: totals, averages, and counts, grouped by
categories
School of Computer Engineering
Univariate data and its analysis
4
This type of data consists of only one variable. A variable is a characteristic that
can be measured and that can assume different values. Height, age, income,
province or country of birth, grades obtained at school and type of housing are all
examples of variables.
The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes.
It does not deal with causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist within it.
The example of a univariate data can be height (in cm)
Height 164 168 170 169 173 175 180 175 176
Suppose that the heights of seven students of a class is recorded, there is only
one variable that is height and it is not dealing with any cause or relationship.
The description of patterns found in this type of data can be made by drawing
conclusions using central tendency measures (mean, median and mode),
dispersion or spread of data (range, minimum, maximum, quartiles, variance
and standard deviation) and by using frequency distribution tables, histograms,
pie charts, frequency polygon and bar charts.
School of Computer Engineering
Bivariate data and its analysis
5
Population
KIIT administrator wants to analyse the final exam scores of all
graduating students to see if there is a trend. Since they are interested in
applying their findings to all graduating students at KIIT university, they
use the whole population dataset.
Sample
KIIT want to study political attitudes in students. KIIT population is
around the 30,000 undergraduate students. Because it’s not practical to
collect data from all of them, so one may use a sample of 300
undergraduate volunteers from different school who meet the inclusion
criteria. This is the group who are expected to be part of the survey.
Illustrate
variables and
observations
Types of
data
There are only a few possibilities for describing a categorical variable, all based on
counting i.e.,
Count the number of categories
Count the distinguished categories names.
Count the number of observations in each category and the resulting counts can
be reported as raw counts or as percentages of totals. For example, if there are
1000 observations, one can report that there are 560 males and 440 females, or
one can report that 56% of the observations are males and 44% are females.
Once the counts are available, it can be displayed graphically, usually in a column
chart or a pie chart.
Summarize
the
categorical
variables
Definition:
k= the kth percentile. It may or may not be part of the data.
i= the index (ranking or position of a data value)
n= the total number of data
To calculate percentile:
1. Order the data from smallest to largest.
2. Calculate i = (k / 100) * (n + 1)
3. If i is an integer, then the kth percentile is the data value in the ith position in
the ordered set of data. If i is not an integer, then round i up and round i down to
the nearest integers. Average the two data values in these two positions in the
ordered data set.
Listed are 29 ages for academy award winning best actors in order from
smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64;
67; 69; 71; 72; 73; 74; 76; 77
Find the 70th percentile:
i = (k / 100) * (n + 1) = (70 / 100) * (29 + 1) = 21
21 is an integer, and the data value in the 21st position in the ordered data set is
64. The 70th percentile is 64 years.
Find the 83rd percentile:
i = (k / 100) * (n + 1) = (83 / 100) * (29 + 1) = 24.9 which is not an integer.
Round it down to 24 and up to 25. The age in the 24th position is 71 and the age
in the 25th position is 72. Average 71 and 72. The 83rd percentile is 71.5 years.
The sample standard deviation, denoted by s, is the square root of the sample
variance.
Mean absolute deviation (MAD): It is the average distance between each data point
and the mean.
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and
300mm. Find out the range, the mean, the variance, and the standard deviation.
To calculate the variance, take each difference, square it, and then average the result:
σ2 = (2062 + 762 + (−224)2 + 362 + (−94)2)/5 = 21704
The standard deviation is just the square root of variance, so:
σ = √21704 = 147.32... = 147
The good thing about the standard deviation is that it is useful. Now, it can be
shown which heights are within one standard deviation (147mm) of the mean. So,
using the standard deviation, there is a "standard" way of knowing what is normal,
and what is extra large or extra small. Rottweilers are tall dogs and Dachshunds
are a bit short, right?
Q1. Consider your travel time in minutes from the hostel to library : 15, 29, 8,
42, 35, 21, 18, 42, 26. Calculate:
Mean
Median
Mode
Standard deviation
Variance
Range
Quartiles
Sample size
Q2. The following dollar amounts were the hourly collections from a Salvation
Army kettle at a local store one day in December: $12, $12, $12, $12, $12,
$12, $12, $12, $12, $12, $12, and $12. Determine the Interquartile Range for
the amount collected.
School of Computer Engineering
Normal Distribution
32
Histogram
Plot Histogram
Skewed to the Right: Data that are skewed to the right have a long tail that
extends to the right. In this situation, the mean and the median are both
greater than the mode.
Skewed to the Left: Data that are skewed to the left have a long tail that
extends to the left. In this situation, the mean and the median are both less
than the mode.
Using the data from the example above (12 13 54 56 25), determine kurtosis.
Most real data sets have gaps in the data. There are two issues: how to
detect these missing values and what to do about them.
Missing values occur when no data value is stored for the variable in an
observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data.
What to do about them:
One option is to simply ignore them. Then you will have to be aware of
how the software deals with missing values.
Another option is to fill in missing values with the average of non
missing values, but this isn’t usually a very good option.
A third option is to examine the non missing values in the row of a
missing value; these values might provide clues on what the missing value
should be.
The middle table indicates that only 6% of the nondrinkers are heavy
smokers, whereas 31% of the heavy drinkers are heavy smokers.
Similarly, the bottom table indicates that 43.1% of the nonsmokers are
nondrinkers, whereas only 11.3% of the heavy smokers are nondrinkers.
In short, these tables indicate that smoking and drinking habits tend to go
with one another.
These tendencies are reinforced by the column charts of the two percentage
tables.
Correlation is a technique that can show whether and how strongly pairs of
variables are related. For example, height and weight are related; taller people
tend to be heavier than shorter people. The relationship isn't perfect. People of
the same height vary in weight. Nonetheless, the average weight of people 5'5''
is less than the average weight of people 5'6'', and their average weight is less
than that of people 5'7'', etc. Correlation can tell you just how much of the
variation in peoples' weights is related to their heights. The main result of a
correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to
+1.0. The closer r is to +1 or -1, the more closely the two variables are related.
Sales 215 325 185 332 406 522 412 614 544 421 445 408
Let us call the two sets of data "x" and "y" (in our case Temperature is x and Ice
Cream Sales is y)
1. Step 1: Find the mean of x, and the mean of y
2. Step 2: Subtract the mean of x from every x value (call them "a"), do the
same for y (call them "b")
3. Step 3: Calculate: a*b, a2 and b2 for every value
4. Step 4: Sum up a*b, sum up a2 and sum up b2
5. Step 5: Divide the sum of a*b by the square root of [(sum of a2) × (sum of
b2)]
Where
ρ(X,Y) – the correlation between the variables X and Y
Cov(X,Y) – the covariance between the variables X and Y
σX – the standard deviation of the X-variable
σY – the standard deviation of the Y-variable
Where:
Xi – the values of the X-variable
Yj – the values of the Y-variable
X̄ – the mean (average) of the X-variable
Ȳ – the mean (average) of the Y-variable
n – the number of data points
The prices of ABC Corp. and the S&P 500 are as follows. Find the covariance.
Year S&P 500 ABC Corp
2013 1692 68
2014 1978 102
2015 1884 110
2016 2151 112
2017 2519 154
Find the relationships of salary between male and female of below sample by
illustrating with the box plot.