0% found this document useful (0 votes)
96 views

DSM050 Data Visualisation Topic3

The document provides an overview of key concepts in quantitative relationships, variables, and types of data for data visualization. It discusses primary and secondary data, different types of variables (nominal, ordinal, interval, ratio), quantitative stories and relationships between variables, and tidy data formatting. It also outlines different types of analysis (univariate, bivariate, multivariate) and provides examples of analyzing and visualizing different variable types.

Uploaded by

LICHEN YU
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

DSM050 Data Visualisation Topic3

The document provides an overview of key concepts in quantitative relationships, variables, and types of data for data visualization. It discusses primary and secondary data, different types of variables (nominal, ordinal, interval, ratio), quantitative stories and relationships between variables, and tidy data formatting. It also outlines different types of analysis (univariate, bivariate, multivariate) and provides examples of analyzing and visualizing different variable types.

Uploaded by

LICHEN YU
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

DSM050 Data Visualisation

Topic 3: Quantitative relationships, variables and types


Topic Learning Outcomes
By the end of this topic, you should be able to:
1. critically evaluate the use of primary and secondary data in visualisation
2. identify the variables in a data visualisation
3. describe the relationships between variables in a data visualisation and
explain how they create a quantitative story
4. categorise variables into different types
5. manipulate real-world data to produce tidy data, i.e. data in a form
suitable for analysis
6. apply appropriate mathematical operations to different variable types
7. design and implement a critically informed survey suitable for generating
valid and unbiased data

DSM050 Data Visualisation - JLChew 2


Key concepts
This topic includes:
• primary and secondary data
• data variables
• quantitative stories and relationships
• variable types: nominal, ordinal, interval and ratio
• tidy data
• survey methodology

DSM050 Data Visualisation - JLChew 3


Variables and Quantitative Stories
Lesson 1

DSM050 Data Visualisation - JLChew 4


Data Semantics
• A dataset is a collection of values.
• Values are usually either numbers (quantitative) or strings
(qualitative).
• Every value belongs to a variable and an observation.
• A variable contains all values that measure the same underlying
attribute across units.
• An observation contains all values measures on the same unit across
attributes.

DSM050 Data Visualisation - JLChew 5


Data Semantics

DSM050 Data Visualisation - JLChew 6


Data Structures

• Databases
Structured • Files with delimiter (e.g. comma or tab)
• Extensible Markup Language (XML)
• JavaScript Object Notation (JSON)
Semi-structured • Email
• Web pages

• Audio • Natural language


Unstructured • Video • Documents
• Image

DSM050 Data Visualisation - JLChew 7


Types of Analysis
Univariate Bivariate Multivariate

Exploring individual features Exploring features relationships

• Histogram • Scatterplot • Scatterplot Matrix


• Value Count Plot • Bar Graph • Correlation Matrix
• Scatterplot • Scatterplot Matrix • 3D Scatterplot
• Boxplot • Correlation Matrix • 3D Surface plot
• etc. • etc. • etc.

DSM050 Data Visualisation - JLChew 8


Univariate Analysis – Histogram

• Count the number of occurrences for each bin


• Primarily for continuous numerical data
DSM050 Data Visualisation - JLChew 9
Univariate Analysis – Histogram

• Similar to histogram
• Primarily for categorical data
DSM050 Data Visualisation - JLChew 10
Univariate Analysis – Scatterplot

• Show the spread of values


DSM050 Data Visualisation - JLChew 11
Univariate Analysis – Boxplot

• Show the distribution of a set of data


DSM050 Data Visualisation - JLChew 12
Bivariate Analysis – Scatterplot

• Show the relationship between 2 features


DSM050 Data Visualisation - JLChew 13
Bivariate Analysis – Bar Graph

• Show the mean of every feature


DSM050 Data Visualisation - JLChew 14
Bivariate Analysis – Scatterplot Matrix

• Takes all the numerical features and plot them against one another in
a matrix of scatterplot
DSM050 Data Visualisation - JLChew 15
Bivariate Analysis – Correlation Matrix

• Takes all the numerical features and calculates the distances


between the features
DSM050 Data Visualisation - JLChew 16
Multivariate Analysis – 3D Scatterplot

DSM050 Data Visualisation - JLChew 17


Multivariate Analysis – 3D Surface Plot

DSM050 Data Visualisation - JLChew 18


Different types of data variables
Lesson 2

DSM050 Data Visualisation - JLChew 19


Data Types in Computer Science vs Statistics
The term data types have different meanings across
Computer Science and Statistics.
Computer Science Statistics

• The term variable refers to placeholders for • The term variable refers to an
storing state/items of data. aspect/attribute and can take on different
values for a given variable.
• The data types are:
qCharacter qFloat • The scales of measurements are:

qString qEnumerated type qNominal qInterval


qBoolean qList qOrdinal qRatio
qInteger qetc. • The variable can be classified as either
numerical/quantitative or
categorical/qualitative.
DSM050 Data Visualisation - JLChew 20
Data Types – Scales of Measurement
Numerical
• Deals with numbers and things that can be
or
measured or counted objectively.
Quantitative

Categorical • Deals with characteristics, descriptors or


or dimensions that cannot be easily measured,
Qualitative but can be observed subjectively.
• Consist of values from a discrete and finite set
of values.
• Can also take on numerical values but these
numbers don’t have mathematical meaning.
DSM050 Data Visualisation - JLChew 21
Data Types – Scales of Measurement
Numerical Ratio Differences between measurements and true zero exists
or
Quantitative Interval Differences between measurements and no true zero

Categorical Ordinal Units are ordered


or
Qualitative Nominal Units have no order

Represents discrete units with no quantitative value.

DSM050 Data Visualisation - JLChew 22


Data Types
• Counted
Discrete • Values are distinct & separate
Numerical • Digital
or
Quantitative • Measured
Continuous • Values can be reduced to finer levels
• Analog

DSM050 Data Visualisation - JLChew 23


Examples of Nominal data types
• Race
• Gender
• Blood type
• Country
• Political party

DSM050 Data Visualisation - JLChew 24


Examples of Ordinal data types
• Age
• Income level
• Education level
• Satisfaction rating

DSM050 Data Visualisation - JLChew 25


Examples of Interval data type
• Time
• Date
• pH (0 to 14)
• Temperature (Celsius)
• Temperature (Fahrenheit)

DSM050 Data Visualisation - JLChew 26


Example of Ratio data type
• Concentration
• Weight
• Length
• Temperature (Kelvin)

DSM050 Data Visualisation - JLChew 27


Mathematical Properties

S. S. Stevens (1946). ‘On the Theory of Scales of Measurement’. In: Science 103.2684, pp. 677–680. ISSN: 0036-8075.
DSM050 Data Visualisation - JLChew 28
Data pre-processing and cleaning
Lesson 3

DSM050 Data Visualisation - JLChew 29


Data Cleaning

Process of detecting and correcting (or removing)


corrupt/inaccurate data

• Dealing with outliers


• Data parsing
• Dealing with missing values
• Reshaping into Tidy Data format

https://en.wikipedia.org/wiki/Data_cleansing
DSM050 Data Visualisation - JLChew 30
How to deal with missing values?
• Remove rows with any missing values
• Remove columns with any missing values
• Impute missing values
q Zero
q Mean
q Median
q Forward fill
q Backward fill
q Interpolate

DSM050 Data Visualisation - JLChew 31


Dealing with missing values

https://livebook.manning.com/#!/book/real-world-machine-learning/chapter-2/106
DSM050 Data Visualisation - JLChew 32
Tidy Data
A standard way of mapping the meaning of a dataset
to its structure.
• Each variable forms a column.
• Each column must have the same unit of measurement.
• Each observation forms a row.
• There can be multiple observed variables per row.
• Each type of observational unit forms a table.

Wickham, H. (2014). ‘Tidy Data’. In: Journal of Statistical Software 59.10. ISSN: 1548-7660. DOI: 10.18637/jss.v059.i10
DSM050 Data Visualisation - JLChew 33
Common Problems with Messy Data
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Tables are stored in both rows and columns.
• Multiple types of observational units are stored in the same table.
• A single observational unit is stored in multiple tables.

DSM050 Data Visualisation - JLChew 34


Example of Messy Data
• Avg Monthly Household Expenditure, by Income Quintile, and Type of Dwelling
• Which are the independent and dependent variables?

https://data.gov.sg/dataset/average-monthly-household-expenditure-by-income-quintile-and-type-of-dwelling-quinquennial
DSM050 Data Visualisation - JLChew 35
Example of Tidy Data

https://data.gov.sg/dataset/average-monthly-household-expenditure-by-income-quintile-and-type-of-dwelling-quinquennial
DSM050 Data Visualisation - JLChew 36
Survey Design
Lesson 4

DSM050 Data Visualisation - JLChew 37


Surveys
A systematic method for gathering information
(primary data) to construct quantitative descriptors.

• The success of survey research depends on how closely the


respondents’ answers match how people think and act in reality.

DSM050 Data Visualisation - JLChew 38


Types of Surveys - Study Design
• Sample survey or experiment?
Design
Planning / Implementing a study • How to choose subjects for the study, and how many?
• What questions to ask to find answers to the research questions?

Descriptive • Describe phenomena and summarise them.


Graphical / Numerical methods for summarising • Graphs, tables and numerical summaries are examples of
data
descriptive statistics.

• Inferential statistics uses methods for making predictions about a


Inferential population based on data from a sample.
Making predictions based on the data
• Measure associations, e.g. income and quality of life.

DSM050 Data Visualisation - JLChew 39


Survey Designs
• Data collected at one point in time selected to represent a larger
Cross-sectional population.

• Trend – surveys of sample population at different time.


Longitudinal • Cohort – study of sample population, but samples studied maybe
different.
• Panel – Data collection at various time with the same sample of
respondents.

DSM050 Data Visualisation - JLChew 40


Population vs Sample

Population Sample

• The entire group that we want to draw • A specific group where data collection was
conclusions about. done.
• A population doesn’t always refer to people. • Consists 1 or more observations from the
Thus, it also includes all elements from a set population.
of data.

https://en.wikipedia.org/wiki/Sampling_(statistics)
DSM050 Data Visualisation - JLChew 41
Concepts, Variables & Data
Data

Concepts Variables

• An abstract idea in the domain of • A quantity or quality varies across


enquiry. people or situations.
• A quantitative variable that is
measured by assigning a number to
each individual.
• A categorical variable is a quality
that is measured by assigning a
category label to each individual.

DSM050 Data Visualisation - JLChew 42


Concepts → Variables → Data
Define all relevant concepts in the study
• Use standard meanings in the field whenever possible. • Age – years since birth
• Simplifies relating results from different studies.

Operationalise concepts
• Show how data related to each concept will be generated. • years since birth
• Specify procedures that will be used to classify or measure. q Absolute number – ask people
how old they are
q Date of Birth
• (Approximately) Equal interval range
q Select age category from a list.
E.g. 18 – 20, 21 – 25, etc.
DSM050 Data Visualisation - JLChew 43
How to Write Questions
• Ask only 1 thing per question.
Give the questions a clear purpose

• Be specific and clear.


Use concise, familiar language

• Approach the question with neutrality.


Phrase questions impartially

Be clear about the type of answer • Indicate clearly whether you want facts or feelings
you are looking for and phrase questions to match.

https://evals.stanford.edu/end-term-feedback/how-write-questions
DSM050 Data Visualisation - JLChew 44
Types of Questions
• Questions for which the respondent is asked to provide his/her
Open-ended own answers.
• Requires complex data analysis.
• Tend to provide more qualitative data.

• Questions for which the respondent is asked to select an answer


Closed-ended from a list provided by the researcher.
• Popular as it provide a greater uniformity of responses and easily
processed.
• Not all answers can be defined into a list.
• A long list increases the cognitive load for the respondent.

DSM050 Data Visualisation - JLChew 45


Ranking vs Ratings
Rating Ranking

• Ask respondents to assign a value or score to • Ask respondents to order items by


a particular question. The value or score is importance.
from a pre-defined scale or categories.
• Tends to provide more reliable data, but
• Tends to be easier for analysis. might be more difficult for analysis.
• Allow respondents to assign different • Guarantee that each item ranked has an
questions or items with the same value. unique value, but forces respondents to
• A narrow distribution of ratings would not differentiate between items that may have
provide complete analysis or understanding. been equivalent.

• Tends to yield lower-quality data that might • Tends to increase the cognitive load for the
also varies over time. respondent.

DSM050 Data Visualisation - JLChew 46

You might also like