0% found this document useful (0 votes)
76 views

Chapter 2-Getting To Know Your Data

This document discusses getting to know your data by understanding the types of attributes, values, and distributions in your dataset. It covers identifying discrete and continuous attributes, visualizing the data to understand patterns and outliers, and measuring similarity between data objects. Basic statistical descriptions like mean, median, mode, range and standard deviation are also important for understanding properties of the data. Different techniques for visualizing data are then presented, including pixel-oriented, geometric projection, icon-based, and hierarchical techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Chapter 2-Getting To Know Your Data

This document discusses getting to know your data by understanding the types of attributes, values, and distributions in your dataset. It covers identifying discrete and continuous attributes, visualizing the data to understand patterns and outliers, and measuring similarity between data objects. Basic statistical descriptions like mean, median, mode, range and standard deviation are also important for understanding properties of the data. Different techniques for visualizing data are then presented, including pixel-oriented, geometric projection, icon-based, and hierarchical techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 2

Getting to Know Your Data


Con.
 Knowledge about your data is useful for data preprocessing
 You will want to know the following:
 What are the types of attributes or fields that make up your
data?
 What kind of values does each attribute have?

 Which attributes are discrete, and which are continuous-


valued?
 What do the data look like?
 How are the values distributed?

 Are there ways we can visualize the data to get a better sense
of it all?
 Can we spot any outliers?
 Can we measure the similarity of some data objects with
respect to others?
2.1 Data Objects and Attribute Types
 Data sets are made up of data objects.
 A data object represents an entity
 In a sales database
 The objects may be customers, store items, and sales;
 In a medical database
 The objects may be patients;

 In a university database
 The objects may be students, professors, and courses.

 Data objects are typically described by attributes.


 Data objects can also be referred to as samples, examples,
instances, data points, or objects.
2.1.1 What Is an Attribute?
 An attribute is a data field, representing a characteristic or
feature of a data object.
 The nouns attribute, dimension, feature, and variable are often
used interchangeably in the literature.
 Observed values for a given attribute are known as observations.
 A set of attributes used to describe a given object is called an
attribute vector (or feature vector).
 The distribution of data involving one attribute (or variable) is
called uni-variate.
 A bi-variate distribution involves two attributes, and so on.
Con.

 Nominal Attributes
 Nominal means “relating to names.”
 The values of a nominal attribute are symbols or names of things.
 Each value represents some kind of category, code, or state, and
so nominal attributes are also referred to as categorical.
 The values do not have any meaningful order.
 Suppose that hair_color and marital_status are two attributes
describing person objects.
 In our application, possible values for hair_color are black,
brown, blond, red, auburn, gray, and white.
 The attribute marital_status can take on the values single,
married, divorced, and widowed.
Con.
 Binary Attributes
 A binary attribute is a nominal attribute with only two categories
or states: 0 or 1, where 0 typically means that the attribute is
absent, and 1 means that it is present.
 Binary attributes are referred to as Boolean if the two states
correspond to true and false.
 Ordinal Attributes
 An ordinal attribute is an attribute with possible values that
have a meaningful order or ranking among them, but the
magnitude between successive values is not known.
 Size: small, medium, and large
 Grade: A+, A, A-, B+
 Customer satisfaction: very dissatisfied, somewhat dissatisfied,
neutral, satisfied, and very satisfied.
Con.

 Numeric Attributes
 A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values.
 Numeric attributes can be interval-scaled or ratio-scaled.

 Discrete versus Continuous Attributes


 A discrete attribute has a finite or countably infinite set of values,
which may or may not be represented as integers.
 An attribute is countably infinite if the set of possible values is
infinite but the values can be put in a one-to-one correspondence
with natural numbers.
2.2 Basic Statistical Descriptions of Data
 Basic statistical descriptions can be used to identify properties of
the data and highlight which data values should be treated as
noise or outliers.
 Three areas of basic statistical descriptions
 Measures of central tendency (Mean, Median, Mode)
 The dispersion of the data (Range, Variance, Standard Deviation)
 Graphic displays to visually inspect our data (bar charts, pie
charts, and line graphs)
2.3 Data Visualization
 How can we convey data to users effectively?
 Data visualization aims to communicate data clearly and
effectively through graphical representation.
 Data visualization has been used extensively in many
applications—for example,
 At work for reporting,
 Managing business operations, and

 Tracking progress of tasks.

 More popularly, we can take advantage of visualization


techniques to discover data relationships that are otherwise
not easily observable by looking at the raw data
Con.
 We start with multidimensional data such as those stored
in relational databases.
 There are several representative approaches, including
 pixel-oriented techniques,
 geometric projection techniques,
 icon-based techniques, and
 hierarchical and graph-based techniques.
Con.
2.3.1 Pixel-Oriented Visualization Techniques

 A simple way to visualize the value of a dimension is to use a


pixel where the color of the pixel reflects the dimension’s value.
 For a data set of m dimensions, pixel-oriented techniques
create m windows on the screen, one for each dimension.
 The m dimension values of a record are mapped to m pixels at
the corresponding positions in the windows.
 AllElectronics maintains a customer information table, which
consists of four dimensions: income, credit_limit,
transaction_volume, and age.
 Can we analyze the correlation between income and the other
attributes by visualization?
Con.

 The pixel colors are chosen so that the smaller the value, the lighter the shading.
 Using pixel based visualization, we can easily observe the following: credit limit
increases as income increases; customers whose income is in the middle range
are more likely to purchase more from AllElectronics; but there is no clear
correlation between income and age.
Con.
2.3.2 Geometric Projection Visualization Techniques
 Pixel-oriented visualization techniques drawback
 Cannot help us much in understanding the data distribution in a
multi-dimensional space (dense, sparse).
 The central challenge the geometric projection
techniques try to address is how to visualize a high-
dimensional space on a 2-D display.
Con.
Con.
2.3.3 Icon-Based Visualization Techniques
 Use small icons to represent multidimensional data values.
 We look at two popular icon-based techniques: Chernoff faces and stick
figures.
1) Chernoff faces
 They display multidimensional data of up to 18 variables (or dimensions) as
a cartoon human
 Help reveal trends in the data.
 Components of the face, such as the eyes, ears, mouth, and nose, represent
values of the dimensions by their shape, size, placement, and orientation.
 For example, dimensions can be mapped to the following facial
characteristics: eye size, eye spacing, nose length, nose width, mouth
curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye
eccentricity, and head eccentricity.
 Make use of the ability of the human mind to recognize small differences in
facial characteristics and to assimilate many facial characteristics at once.
Con.
Con.
2) stick figures
 The stick figure visualization technique maps
multidimensional data to five-piece stick figures, where
each figure has four limbs and a body.
 Two dimensions are mapped to the display (x and y) axes
and the remaining dimensions are mapped to the angle
and/or length of the limbs.
Con.
Cont.
2.3.4 Hierarchical Visualization Techniques
 The visualization techniques discussed so far focus on
visualizing multiple dimensions simultaneously.
 However, for a large data set of high dimensionality, it would
be difficult to visualize all dimensions at the same time.
 Hierarchical visualization techniques partition all dimensions
into subsets (i.e., subspaces).
 The subspaces are visualized in a hierarchical manner.
Con.

1) Worlds-within-Worlds
 Suppose we want to visualize a 6-D data set, where the
dimensions are F,X1, … ,X5.We want to observe how
dimension F changes with respect to the other dimensions.
 We can first fix the values of dimensions X3,X4,X5 to some
selected values, say, c3, c4, c5.
 We can then visualize F,X1,X2 using a 3-D plot, called a
world.
 The position of the origin of the inner world is located at the
point (c3, c4, c5) in the outer world, which is another 3-D plot
using dimensions X3,X4,X5.
Con.

“Worlds-within-Worlds” (also known as n-Vision.


Con.
2) tree-maps
 Display hierarchical data as a set of nested rectangles.
 The figure bellow shows a tree-map visualizing Google news
stories.
 All news stories are organized into seven categories, each
shown in a large rectangle of a unique color.
 Within each category (i.e., each rectangle at the top level), the
news stories are further partitioned into smaller subcategories.
Con.

Use of tree-maps to visualize Google news headline stories

You might also like