Unit I - Part I Notes
Unit I - Part I Notes
UNIT I
EXPLORATORY DATA ANALYSIS
EDA Fundamentals
Introduction
➢ Data encompasses a collection of discrete objects, numbers,
words, events, facts, measurements, observations, or even
descriptions of things.
➢ Such data is collected and stored by every event or process
occurring in several disciplines, including biology, economics,
engineering, marketing, and others.
➢ Processing such data elicits useful information and processing
such information generates useful knowledge.
➢ Exploratory Data Analysis enables generating meaningful and
useful information from such data.
➢ Exploratory Data Analysis (EDA) is a process of examining the
available dataset to discover patterns, spot anomalies, test
hypotheses, and check assumptions using statistical measures.
➢ Primary aim of EDA is to examine what data can tell us before
actually going through formal modeling or hypothesis formulation.
Understanding data science
➢ Data science involves cross-disciplinary knowledge from
computer science, data, statistics, and mathematics.
➢ There are several phases of data analysis, including
1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. Exploratory data analysis
6. Modeling and algorithms
7. Data product and communication
➢ These phases are similar to the CRoss-Industry Standard
Process for data mining (CRISP) framework in data mining.
1. Data requirements
• There can be various sources of data for an organization.
• It is important to comprehend what type of data is required
for the organization to be collected, curated, and stored.
• For example, an application tracking the sleeping pattern of
patients suffering from dementia requires several types of
sensors' data storage, such as sleep data, heart rate from the
patient, electro-dermal activities, and user activities
patterns.
• All of these data points are required to correctly diagnose the
mental state of the person.
• Hence, these are mandatory requirements for the application.
• It is also required to categorize the data, numerical or
categorical, and the format of storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the
correct format and transferred to the right information
technology personnel within a company.
• Data can be collected from several objects during several
events using different types of sensors and storage tools.
3. Data processing
• Preprocessing involves the process of pre-curating
(selecting and organizing) the dataset before actual analysis.
• Common tasks involve correctly exporting the dataset,
placing them under the right tables, structuring them, and
exporting them in the correct format.
4. Data cleaning
• Preprocessed data is still not ready for detailed analysis.
• It must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value
check.
• This stage involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset,
understanding the overall data quality, removing duplicate
items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study.
• Hence, it is essential for data scientists or EDA experts to
comprehend different types of datasets.
• An example of data cleaning is using outlier detection
methods for quantitative data cleaning.
5. EDA
• Exploratory data analysis is the stage where the message
contained in the data is actually understood.
• Several types of data transformation techniques might be
required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent or
exhibit relationships among different variables, such as
correlation or causation.
• These models or equations involve one or more variables that
depend on other variables to cause an event.
• For example, when buying pens, the total price of pens(Total)
= price for one pen(UnitPrice) * the number of pens bought
(Quantity). Hence, our model would be Total = UnitPrice *
Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent variable
and the unit price is referred to as an independent variable.
• In general, a model always describes the relationship
between independent and dependent variables.
• Inferential statistics deals with quantifying relationships
between particular variables.
• The Judd model for describing the relationship between data,
model, and the error still holds true: Data = Model + Error.
7. Data Product
• Any computer software that uses data as inputs, produces
outputs, and provides feedback based on the output to control
the environment is referred to as a data product.
• A data product is generally based on a model developed
during data analysis
• Example: a recommendation model that inputs user purchase
history and recommends a related item that the user is highly
likely to buy.
8. Communication
• This stage deals with disseminating the results to end
stakeholders to use the result for business intelligence.
• One of the most notable steps in this stage is data
visualization.
• Visualization deals with information relay techniques such as
tables, charts, summary diagrams, and bar charts to show
the analyzed result.
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results
1. Problem Definition
• It is essential to define the business problem to be solved before
trying to extract useful insight from the data.
• The problem definition works as the driving force for a data
analysis plan execution
• The main tasks involved in problem definition are
o defining the main objective of the analysis
o defining the main deliverables
o outlining the main roles and responsibilities
o obtaining the current status of the data
o defining the timetable, and
o performing cost/benefit analysis
• Based on the problem definition, an execution plan can be created.
2. Data Preparation
• This step involves methods for preparing the dataset before actual
analysis.
• This step involves
o defining the sources of data
o defining data schemas and tables
o understanding the main characteristics of the data
o cleaning the dataset
o deleting non-relevant datasets
o transforming the data
o dividing the data into required chunks for analysis
3. Data analysis
o This is one of the most crucial steps that deals with
descriptive statistics and analysis of the data
o The main tasks involve
o summarizing the data
o finding the hidden correlation
o relationships among the data
o developing predictive models
o evaluating the models
o calculating the accuracies
➢ Some of the techniques used for data summarization are
o summary tables
o graphs
o descriptive statistics
o inferential statistics
o correlation statistics
o searching
o grouping
o mathematical models
4. Development and representation of the results
• This step involves presenting the dataset to the target
audience in the form of graphs, summary tables, maps, and
diagrams.
• This is also an essential step as the result analyzed from the
dataset should be interpretable by the business
stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include
o scattering plots
o character plots
o histograms
o box plots
o residual plots
o mean plots
Making Sense of Data
➢ It is crucial to identify the type of data under analysis.
➢ Different disciplines store different kinds of data for different
purposes.
➢ Example: medical researchers store patients' data, universities store
students' and teachers' data, and real estate industries storehouse
and building datasets.
➢ A dataset contains many observations about a particular object.
➢ For instance, a dataset about patients in a hospital can contain many
observations.
➢ A patient can be described by a
o patient identifier (ID)
o name
o address
o weight
o date of birth
o address
o email
o gender
➢ Each of these features that describes a patient is a variable.
➢ Each observation can have a specific value for each of these
variables.
➢ For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = [email protected]
Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for
analysis.
➢ Most of this data is stored in some sort of database management
system in tables/schema.
Table for storing patient information
➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and
categorical data.
Numerical data
➢ This data has a sense of measurement involved in it
➢ For example, a person's age, height, weight, blood pressure, heart
rate, temperature, number of teeth, number of bones, and the
number of family members.
➢ This data is often referred to as quantitative data in statistics.
➢ The numerical dataset can be either discrete or continuous types.
Discrete data
o The Country variable can have values such as Nepal, India, Norway,
and Japan.
o The Rank variable of a student in a classroom can take values from
1, 2, 3, 4, 5, and so on.
Continuous data
Example table:
Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of
the movies.
➢ This data is often referred to as qualitative datasets in statistics.
➢ Examples of categorical data
Measurement scales
➢ There are four different types of measurement scales in statistics:
nominal, ordinal, interval, and ratio.
➢ These scales are used more in academic industries.
➢ Understanding the type of data is required to understand
Nominal
➢ These are used for labeling variables without any quantitative
value. The scales are generally referred to as labels.
➢ These scales are mutually exclusive and do not carry any
numerical importance.
➢ Examples:
Likert scale:
Ratio
➢ Ratio scales contain order, exact values, and absolute zero.
➢ They are used in descriptive and inferential statistics.
➢ These scales provide numerous possibilities for statistical
analysis.
➢ Mathematical operations, the measure of central tendencies, and
the measure of dispersion and coefficient of variation can also be
computed from such scales.
➢ Examples: the measure of energy, mass, length, duration, electrical
energy, plan angle, and volume.
NumPy
➢ NumPy is a Python library.
➢ NumPy is short for "Numerical Python".
➢ NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra,
fourier transform, and matrices.
import numpy as np
# Importing numpy
import numpy as np
# Defining 1D array
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)
# Array of ones
ones = np.ones((3,4))
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
# Full array
fullArray = np.full((2,2),7)
print(fullArray)
Broadcasting
Broadcasting is a mechanism that permits NumPy to operate with
arrays of different shapes when performing arithmetic operations.
#NumPy mathematics
➢ Pandas can clean messy data sets, and make them readable and
relevant.
➢ Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
df =pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databas
es/adult/adult.data',names=columns)
df.head(10)
# Selects 10 rows
df.iloc[0:10]
SciPy
➢ SciPy is a scientific computation library that
uses NumPy underneath.
➢ SciPy stands for Scientific Python.
➢ It provides more utility functions for optimization, stats and signal
processing.
➢ Like NumPy, SciPy is open source so we can use it freely.
➢ SciPy has optimized and added functions that are frequently used
in NumPy and Data Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves
as a visualization utility.
➢ It provides a huge library of customizable plots, along with a
comprehensive set of backends.
➢ It can be utilized to create professional reporting applications,
interactive analytical applications, complex dashboard
applications, web/GUI applications, embedded views, and many
more