0% found this document useful (0 votes)
12 views

CH2 Data

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

CH2 Data

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

ARIN2137

KNOWLEDGE DISCOVERY AND DATA


MINING

TOPIC 2 :

1
Outline
1. Definition of Dataset, Attribute
2. Types of Attribute
3. Types of Dataset
4. Issues related to Dataset

2
DATA SOURCE

Identif Perform Perform Pattern Applying


y data data pre- data Evaluation knowledge
source processing mining

Where data comes from? - primary and


secondary sources
• Transactional database ~ IMTIAZ
•, Hospital
Why we need data: •Data warehouse
•Survey and observational (behavioral,
Provide the necessary
attitudes, opinions)
input, measure •Experimental (scientific data-DNA)
performance, assist
•Server Log
formulating solution and
•Using published sources of data (eg:
satisfy our curiosity. ABI/INFORM data for business data, UCI
Machine Learning database for KDD).
3
•Data can be in fix row & column format,
image, audio, video!
DATA TYPE
• Structured
(GENERAL)
– Well defined field - most business dataset
• Semi Structured
– Electronic image of business document, medical report,
executive summaries, repair manual
• Unstructured
– Video recorded by a surveillance camera
Tools for
these
data?
4
dataset

• Data set: collection of data records

• Other name: data objects, records, point, event, case, sample, observation, entity

• Describe by “Attribute” ~ capture the basic characteristic of dataset

5
dataset
• Data set is a file, which consists of record (or object, pattern, case,
• sample) in row and attribute (or field, attribute, dimension,
variable)
File Name: Student.xls
• in column

Data Roll No. Name Year CGPA


Set
record 64752 Anas Kareem 2 3.6
67984 Hajra Shahid 3 3.4
74571 … .. ..

Attribute
DATAset :
attribute
• Is a property or characteristic of record that
may vary, either from one object to another or
from one time to another.

What
attribute can
describe this
aeroplane?
Grasshopper?
DATAset :
attribute &
record

Data Set

Records

Attributes
DATAset : types
of attribute
Discrete Continuous
[Nominal and Ordinal] [Interval and Ratio]

Has only a finite or countably


Has real numbers as attribute values
infinite set of values

Practically, real values can only be


Often represented as integer
variables measured and represented using a
finite number of digits

binary attributes are a special


typically represented as floating-
case of discrete attributes
point variables.

Eg: zip codes, counts, or the set


of words in a collection Eg: temperature, height, weight
of documents
DATAset : types
of attribute

N
U
M
E
R
I
C
12
types of dataset

Record
data

Graph-
Ordered
based
data
data

13
• Collection of records, each of
which consists of a fixed set of
attributes
• Stored in flat file, relational
database
• Types: Market-Basket Data
(Transaction data), Data Matrix,
Sparse Data Matrix

14
• Data is represented in form of graph ~
relationship in graph, link in website

15
 The attributes have relationships that involve order in time
or space.
 Example:
Sequential data/temporal data – has a time associated
with it.
Sequence data – consists of a data set that is a
sequence of individual entities (exp: a sequence of words
or letters). No time stamps.
Time Series data – a special type of sequential data
(each record is a time series – a series of measurements
taken over time).
Spatial data – such as positions or areas.

16
17
18
19
20
21
Data in reality!
• Too many data.. However, far from perfect!

• Data in the real world is dirty, no quality


– incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., occupation=“”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
How to ‘clean’ dirty data, we perform data preprocessing

Input Data Data Post


Information
data Preprocessing Mining Processing
22
UCI Machine Learning Repository

23

http://archive.ics.uci.edu/ml/datasets.html
24
25

You might also like