Chapter 2-Getting To Know Your Data

This document discusses getting to know your data by understanding the types of attributes, values, and distributions in your dataset. It covers identifying discrete and continuous attributes, visualizing the data to understand patterns and outliers, and measuring similarity between data objects. Basic statistical descriptions like mean, median, mode, range and standard deviation are also important for understanding properties of the data. Different techniques for visualizing data are then presented, including pixel-oriented, geometric projection, icon-based, and hierarchical techniques.

Uploaded by

Yrga Weldegiwergs

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

Chapter 2-Getting To Know Your Data

Uploaded by

Yrga Weldegiwergs

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 2

Getting to Know Your Data

Con.
 Knowledge about your data is useful for data preprocessing
 You will want to know the following:
 What are the types of attributes or fields that make up your
data?
 What kind of values does each attribute have?

 Which attributes are discrete, and which are continuous-

valued?
 What do the data look like?
 How are the values distributed?

 Are there ways we can visualize the data to get a better sense
of it all?
 Can we spot any outliers?
 Can we measure the similarity of some data objects with
respect to others?
2.1 Data Objects and Attribute Types
 Data sets are made up of data objects.
 A data object represents an entity
 In a sales database
 The objects may be customers, store items, and sales;
 In a medical database
 The objects may be patients;

 In a university database
 The objects may be students, professors, and courses.

 Data objects are typically described by attributes.

 Data objects can also be referred to as samples, examples,
instances, data points, or objects.
2.1.1 What Is an Attribute?
 An attribute is a data field, representing a characteristic or
feature of a data object.
 The nouns attribute, dimension, feature, and variable are often
used interchangeably in the literature.
 Observed values for a given attribute are known as observations.
 A set of attributes used to describe a given object is called an
attribute vector (or feature vector).
 The distribution of data involving one attribute (or variable) is
called uni-variate.
 A bi-variate distribution involves two attributes, and so on.
Con.

 Nominal Attributes
 Nominal means “relating to names.”
 The values of a nominal attribute are symbols or names of things.
 Each value represents some kind of category, code, or state, and
so nominal attributes are also referred to as categorical.
 The values do not have any meaningful order.
 Suppose that hair_color and marital_status are two attributes
describing person objects.
 In our application, possible values for hair_color are black,
brown, blond, red, auburn, gray, and white.
 The attribute marital_status can take on the values single,
married, divorced, and widowed.
Con.
 Binary Attributes
 A binary attribute is a nominal attribute with only two categories
or states: 0 or 1, where 0 typically means that the attribute is
absent, and 1 means that it is present.
 Binary attributes are referred to as Boolean if the two states
correspond to true and false.
 Ordinal Attributes
 An ordinal attribute is an attribute with possible values that
have a meaningful order or ranking among them, but the
magnitude between successive values is not known.
 Size: small, medium, and large
 Grade: A+, A, A-, B+
 Customer satisfaction: very dissatisfied, somewhat dissatisfied,
neutral, satisfied, and very satisfied.
Con.

 Numeric Attributes
 A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values.
 Numeric attributes can be interval-scaled or ratio-scaled.

 Discrete versus Continuous Attributes

 A discrete attribute has a finite or countably infinite set of values,
which may or may not be represented as integers.
 An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one-to-one correspondence
with natural numbers.
2.2 Basic Statistical Descriptions of Data
 Basic statistical descriptions can be used to identify properties of
the data and highlight which data values should be treated as
noise or outliers.
 Three areas of basic statistical descriptions
 Measures of central tendency (Mean, Median, Mode)
 The dispersion of the data (Range, Variance, Standard Deviation)
 Graphic displays to visually inspect our data (bar charts, pie
charts, and line graphs)
2.3 Data Visualization
 How can we convey data to users effectively?
 Data visualization aims to communicate data clearly and
effectively through graphical representation.
 Data visualization has been used extensively in many
applications—for example,
 At work for reporting,
 Managing business operations, and

 Tracking progress of tasks.

 More popularly, we can take advantage of visualization

techniques to discover data relationships that are otherwise
not easily observable by looking at the raw data
Con.
 We start with multidimensional data such as those stored
in relational databases.
 There are several representative approaches, including
 pixel-oriented techniques,
 geometric projection techniques,
 icon-based techniques, and
 hierarchical and graph-based techniques.
Con.
2.3.1 Pixel-Oriented Visualization Techniques

 A simple way to visualize the value of a dimension is to use a

pixel where the color of the pixel reflects the dimension’s value.
 For a data set of m dimensions, pixel-oriented techniques
create m windows on the screen, one for each dimension.
 The m dimension values of a record are mapped to m pixels at
the corresponding positions in the windows.
 AllElectronics maintains a customer information table, which
consists of four dimensions: income, credit_limit,
transaction_volume, and age.
 Can we analyze the correlation between income and the other
attributes by visualization?
Con.

 The pixel colors are chosen so that the smaller the value, the lighter the shading.
 Using pixel based visualization, we can easily observe the following: credit limit
increases as income increases; customers whose income is in the middle range
are more likely to purchase more from AllElectronics; but there is no clear
correlation between income and age.
Con.
2.3.2 Geometric Projection Visualization Techniques
 Pixel-oriented visualization techniques drawback
 Cannot help us much in understanding the data distribution in a
multi-dimensional space (dense, sparse).
 The central challenge the geometric projection
techniques try to address is how to visualize a high-
dimensional space on a 2-D display.
Con.
Con.
2.3.3 Icon-Based Visualization Techniques
 Use small icons to represent multidimensional data values.
 We look at two popular icon-based techniques: Chernoff faces and stick
figures.
1) Chernoff faces
 They display multidimensional data of up to 18 variables (or dimensions) as
a cartoon human
 Help reveal trends in the data.
 Components of the face, such as the eyes, ears, mouth, and nose, represent
values of the dimensions by their shape, size, placement, and orientation.
 For example, dimensions can be mapped to the following facial
characteristics: eye size, eye spacing, nose length, nose width, mouth
curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye
eccentricity, and head eccentricity.
 Make use of the ability of the human mind to recognize small differences in
facial characteristics and to assimilate many facial characteristics at once.
Con.
Con.
2) stick figures
 The stick figure visualization technique maps
multidimensional data to five-piece stick figures, where
each figure has four limbs and a body.
 Two dimensions are mapped to the display (x and y) axes
and the remaining dimensions are mapped to the angle
and/or length of the limbs.
Con.
Cont.
2.3.4 Hierarchical Visualization Techniques
 The visualization techniques discussed so far focus on
visualizing multiple dimensions simultaneously.
 However, for a large data set of high dimensionality, it would
be difficult to visualize all dimensions at the same time.
 Hierarchical visualization techniques partition all dimensions
into subsets (i.e., subspaces).
 The subspaces are visualized in a hierarchical manner.
Con.

1) Worlds-within-Worlds
 Suppose we want to visualize a 6-D data set, where the
dimensions are F,X1, … ,X5.We want to observe how
dimension F changes with respect to the other dimensions.
 We can first fix the values of dimensions X3,X4,X5 to some
selected values, say, c3, c4, c5.
 We can then visualize F,X1,X2 using a 3-D plot, called a
world.
 The position of the origin of the inner world is located at the
point (c3, c4, c5) in the outer world, which is another 3-D plot
using dimensions X3,X4,X5.
Con.

“Worlds-within-Worlds” (also known as n-Vision.

Con.
2) tree-maps
 Display hierarchical data as a set of nested rectangles.
 The figure bellow shows a tree-map visualizing Google news
stories.
 All news stories are organized into seven categories, each
shown in a large rectangle of a unique color.
 Within each category (i.e., each rectangle at the top level), the
news stories are further partitioned into smaller subcategories.
Con.

Use of tree-maps to visualize Google news headline stories

Ict Igcse Paper 3 Revision Webauthoring
No ratings yet
Ict Igcse Paper 3 Revision Webauthoring
8 pages
NVDXT Parameters
50% (2)
NVDXT Parameters
5 pages
02 Data
No ratings yet
02 Data
47 pages
L5 Data Visualization
No ratings yet
L5 Data Visualization
33 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
5 Da
No ratings yet
5 Da
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
76 pages
IDS Unit 5 Visualization
No ratings yet
IDS Unit 5 Visualization
24 pages
Data Visulization Techniques
No ratings yet
Data Visulization Techniques
10 pages
Data Analytics - Unit-V
No ratings yet
Data Analytics - Unit-V
9 pages
Unit 4 Part A
No ratings yet
Unit 4 Part A
51 pages
02 Data
No ratings yet
02 Data
42 pages
DA UNIT- V
No ratings yet
DA UNIT- V
14 pages
DA Unit-5
No ratings yet
DA Unit-5
6 pages
DA UNIT 5
No ratings yet
DA UNIT 5
11 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages
Da Unit-5
100% (1)
Da Unit-5
19 pages
Chapter 3 Non Spatial Data Visualization
No ratings yet
Chapter 3 Non Spatial Data Visualization
45 pages
5 knowledge representation
No ratings yet
5 knowledge representation
19 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Week 02.1 Chaptr002
No ratings yet
Week 02.1 Chaptr002
29 pages
5 Data Exploration
No ratings yet
5 Data Exploration
41 pages
Knowledge Representation in Data Mining
No ratings yet
Knowledge Representation in Data Mining
22 pages
Lect 3
No ratings yet
Lect 3
51 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
FDS Notes 3
No ratings yet
FDS Notes 3
6 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Data Mining Notes C3
No ratings yet
Data Mining Notes C3
11 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
Information Visualization: Dr. Parvathi.R VIT University, Chennai
No ratings yet
Information Visualization: Dr. Parvathi.R VIT University, Chennai
73 pages
Data Visualization Techniques: Dr. D. Koteswara Rao
No ratings yet
Data Visualization Techniques: Dr. D. Koteswara Rao
41 pages
Data Visualization Unit-V 21.11.24
No ratings yet
Data Visualization Unit-V 21.11.24
69 pages
Applications and Trends in Data Mining: - Chapter 11
No ratings yet
Applications and Trends in Data Mining: - Chapter 11
63 pages
02data Part1
No ratings yet
02data Part1
19 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Data Visualization 1
No ratings yet
Data Visualization 1
5 pages
WINSEM2022-23 CSI3005 ETH VL2022230503218 ReferenceMaterialI WedMar0100 00 00IST2023 MultivariateDataVisualization PDF
No ratings yet
WINSEM2022-23 CSI3005 ETH VL2022230503218 ReferenceMaterialI WedMar0100 00 00IST2023 MultivariateDataVisualization PDF
56 pages
4 - Exploring Data
No ratings yet
4 - Exploring Data
32 pages
unit1-visual search-strategies
No ratings yet
unit1-visual search-strategies
35 pages
03 Temporal, Geospatial Multivariate Data
No ratings yet
03 Temporal, Geospatial Multivariate Data
69 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
DA-Unit-5-Trio
No ratings yet
DA-Unit-5-Trio
4 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Visualization
No ratings yet
Data Visualization
8 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Visualization 2 Data Representation 1
No ratings yet
Visualization 2 Data Representation 1
59 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Mining and Analysis
No ratings yet
Data Mining and Analysis
25 pages
Sci Vis 2005
No ratings yet
Sci Vis 2005
54 pages
Data Visualization
No ratings yet
Data Visualization
14 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
DM Unit 3
No ratings yet
DM Unit 3
18 pages
02 Data
No ratings yet
02 Data
62 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Data Analytics-Data Visualization UNIT-V
No ratings yet
Data Analytics-Data Visualization UNIT-V
11 pages
A Preliminary Exploration of The Data To Better Understand Its Characteristics
No ratings yet
A Preliminary Exploration of The Data To Better Understand Its Characteristics
35 pages
IDV-05-Visualization for Multivariated Data
No ratings yet
IDV-05-Visualization for Multivariated Data
75 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Persepolis Station Rotations
No ratings yet
Persepolis Station Rotations
1 page
Geography Lesson - by Slidesgo
No ratings yet
Geography Lesson - by Slidesgo
71 pages
BMG201 Introduction To WEb Development
No ratings yet
BMG201 Introduction To WEb Development
184 pages
Graphics Hardware and Display Devices
No ratings yet
Graphics Hardware and Display Devices
32 pages
BTLED ICT 2G3 Softcopy Animation Group 1
No ratings yet
BTLED ICT 2G3 Softcopy Animation Group 1
24 pages
KCA052
No ratings yet
KCA052
2 pages
Flappy Bird: Team Name: The Brainy Fools
No ratings yet
Flappy Bird: Team Name: The Brainy Fools
8 pages
Homework 2: Modeling in Blender: Computer Graphics
No ratings yet
Homework 2: Modeling in Blender: Computer Graphics
3 pages
Lectures 8-10 - Planar Projections and Pictorial Views
No ratings yet
Lectures 8-10 - Planar Projections and Pictorial Views
44 pages
UCEED 2021 Paper: Undergraduate Common Entrance Examination For Design
No ratings yet
UCEED 2021 Paper: Undergraduate Common Entrance Examination For Design
38 pages
Unit2 Notes - Quaternions
No ratings yet
Unit2 Notes - Quaternions
8 pages
Historia de WordStar
No ratings yet
Historia de WordStar
14 pages
UXD Viva Q&A (E-Next - In)
No ratings yet
UXD Viva Q&A (E-Next - In)
4 pages
List of Medical in Eluru - Pythondeals3
No ratings yet
List of Medical in Eluru - Pythondeals3
5 pages
Mind Maps in Classroom Teaching and Learning
No ratings yet
Mind Maps in Classroom Teaching and Learning
16 pages
Navigator 7 Operator Manual-R50622 PDF
No ratings yet
Navigator 7 Operator Manual-R50622 PDF
582 pages
Alora
No ratings yet
Alora
3 pages
Cad Test: Fig Shows Front View and L.H.S.V of An Object. Create 3D Model
No ratings yet
Cad Test: Fig Shows Front View and L.H.S.V of An Object. Create 3D Model
1 page
projection
No ratings yet
projection
38 pages
(Chapter Four - Color Images Formats (RGB, HSV and YCbCr
No ratings yet
(Chapter Four - Color Images Formats (RGB, HSV and YCbCr
14 pages
Laser Software Manual
No ratings yet
Laser Software Manual
56 pages
Summative Test in Tle
No ratings yet
Summative Test in Tle
4 pages
Open Source, Experimental, and Tiny Tools Roundup
100% (1)
Open Source, Experimental, and Tiny Tools Roundup
19 pages
Foxcon A6gmv PDF
No ratings yet
Foxcon A6gmv PDF
76 pages
Diabetic Retinopathy Detection and Stage Classific
No ratings yet
Diabetic Retinopathy Detection and Stage Classific
32 pages
Pescadito Autocad
No ratings yet
Pescadito Autocad
1 page
Online Notebook Green Variant - by Slidesgo
No ratings yet
Online Notebook Green Variant - by Slidesgo
50 pages
شجون المسجون سيدي ابن عربي PDF
No ratings yet
شجون المسجون سيدي ابن عربي PDF
98 pages