Types of Data
Types of Data
DATA MINING:
Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering
AGENDA
ISSUES
RELATED TO
DATA
01
TYPES OF
02 DATA
DATA
QUALITY 03
2
1
21-09-2020
LEARNING OUTCOMES
Define attributes,
precision, bias and
Accuracy 01 Explain the
importance of
02 knowing nature of
input data.
Discuss issues in
Measurement and
Data Collection.
03 Classify attributes
04
as binary, discrete
or continuous.
ISSUES IN DATA
Types of Data
I
Quality of Data
II
Pre processing
IIi
Data Analysis
IV
4
2
21-09-2020
Hi,
I have attached the data file that I mentioned earlier. Each line contains
the information for a single patient and consists of five fields. We want
to predict the last field using the other fields. I do not have time to
provide any more information about the data since ill be out of station
but hopefully that wont slow you down.
Thanks and see you in a couple of days
Despite some misgivings, you proceed to analyse the data. First few
rows are as follows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
3
21-09-2020
4
21-09-2020
ATTRIBUTES &
MEASUREMENTS TYPES OF DATA SETS
We address the issue
of describing data
01 TYPES OF DATA 02 We describe some
most common data
using attributes types
TYPES OF DATA
● A data set is a
collection of data
objects.
10
5
21-09-2020
What is an Attribute?
123 Apurva 18
124 Shivangi 17
125 Shreya 18
11
What is a Measurement?
Measurement
scale: Decimal
number upto 1
decimal place
12
6
21-09-2020
Type of an Attribute
13
DISTINCTNESS NOMINAL
ORDER ORDINAL
ADDITION
INTERVAL
MULTIPLICATION
RATIO
14
7
21-09-2020
15
Difference between
temperature (Farenheit), temperature (Celcius), pH, SAT score
INTERVAL measurements but
(200-800), credit score (300-850)
no true zero
Difference between enzyme activity, dose amount, reaction rate, flow rate,
RATIO measurements, concentration, pulse, weight, length, temperature in Kelvin (0.0
true zero exists Kelvin really does mean “no heat”), survival time.
16
8
21-09-2020
Any order preserving socio economic status (“low income", "middle income", "high income”)
ORDINAL
change of values can also be well represented as (1,2,3}
new_value = a*old_value Scale (0 – 800) if to be mapped to another interval say (1000 – 1800)
INTERVAL + b; To map 400: x = 400 + 1000= 1400; a = 1, b=1000
Scale (200 – 800) if to be mapped to (1000 – 1800)
a and b are constants
400: X=(600/800)*400 + 1000 = 1300
17
Represented as
Integers
18
9
21-09-2020
Gender
Roll No
Name
Blood type
19
Asymmetric Attributes
Only Presence – a non-zero attribute value - is regarding as
important
Object: Student
Attribute: Records if student took a particular course at a university
Attribute value: 1/0
1 2 3 4 5 6
A 1 0 0 1 1 0
B 1 1 0 0 1 0
C 0 1 1 0 1 0
D 1 0 0 1 0 1
E 1 1 0 1 0 0
20
10
21-09-2020
21
Data set
22
11
21-09-2020
Record based
Data set
23
24
12
21-09-2020
Transaction or Market
Basket Data
•Variation of record data
•Each record involves a
set of items.
•Can be viewed as set of
records whose attributes
are asymmetric
•Most often attributes are
binary
•Can be discrete or
continuous
25
26
13
21-09-2020
27
Data with
relationships Data with Objects
among other that are graphs
objects
28
14
21-09-2020
•Relationships
among object •Objects contain
convey a lot of sub objects
information •Represented as
•Data represented graphs
as a graph •Substructure
•Objects: nodes mining
•Relationships: links
29
Ordered Data
Ordered
Data set
30
15
21-09-2020
Sequential Data
•Also referred to as
temporal data
•Each record has time
associated with it
•Eg: Sequential
transaction data
31
Sequence Data
32
16
21-09-2020
33
Spatial Data
Positions or areas
Eg: weather data
Spatial Autocorrelation:
objects that are physically
close tend to be similar in
other ways as well
34
17
21-09-2020
35
36
18
21-09-2020
Data Quality
Data Mining applications are often applied to data that was
collected for another purpose or for the future, but unspecified
application
Data mining uses algorithms that can tolerate poor data quality.
37
38
19
21-09-2020
39
40
20
21-09-2020
41
42
21
21-09-2020
43
44
22
21-09-2020
Outliers
45
Missing Values
Ignore the
Eliminate
Estimate missing
Data
Missing value
Objects or
Values during
attributes
analysis
46
23
21-09-2020
47
48
24
21-09-2020
Inconsistent Values
49
Duplicate Values
Data set might include values that are duplicate of each other.
If there are two objects that actually represent a single object then
the values of corresponding attributes may differ and these
inconsistent values must be resolved.
50
25
21-09-2020
Issues Related to
Applications
•Some data starts to age as soon as it is collected
Timeliness •If the data is out of date so are the models and patterns based on
it.
51
THANKS
52
26