DSM050 Data Visualisation Topic3
DSM050 Data Visualisation Topic3
• Databases
Structured • Files with delimiter (e.g. comma or tab)
• Extensible Markup Language (XML)
• JavaScript Object Notation (JSON)
Semi-structured • Email
• Web pages
• Similar to histogram
• Primarily for categorical data
DSM050 Data Visualisation - JLChew 10
Univariate Analysis – Scatterplot
• Takes all the numerical features and plot them against one another in
a matrix of scatterplot
DSM050 Data Visualisation - JLChew 15
Bivariate Analysis – Correlation Matrix
• The term variable refers to placeholders for • The term variable refers to an
storing state/items of data. aspect/attribute and can take on different
values for a given variable.
• The data types are:
qCharacter qFloat • The scales of measurements are:
S. S. Stevens (1946). ‘On the Theory of Scales of Measurement’. In: Science 103.2684, pp. 677–680. ISSN: 0036-8075.
DSM050 Data Visualisation - JLChew 28
Data pre-processing and cleaning
Lesson 3
https://en.wikipedia.org/wiki/Data_cleansing
DSM050 Data Visualisation - JLChew 30
How to deal with missing values?
• Remove rows with any missing values
• Remove columns with any missing values
• Impute missing values
q Zero
q Mean
q Median
q Forward fill
q Backward fill
q Interpolate
https://livebook.manning.com/#!/book/real-world-machine-learning/chapter-2/106
DSM050 Data Visualisation - JLChew 32
Tidy Data
A standard way of mapping the meaning of a dataset
to its structure.
• Each variable forms a column.
• Each column must have the same unit of measurement.
• Each observation forms a row.
• There can be multiple observed variables per row.
• Each type of observational unit forms a table.
Wickham, H. (2014). ‘Tidy Data’. In: Journal of Statistical Software 59.10. ISSN: 1548-7660. DOI: 10.18637/jss.v059.i10
DSM050 Data Visualisation - JLChew 33
Common Problems with Messy Data
• Column headers are values, not variable names.
• Multiple variables are stored in one column.
• Tables are stored in both rows and columns.
• Multiple types of observational units are stored in the same table.
• A single observational unit is stored in multiple tables.
https://data.gov.sg/dataset/average-monthly-household-expenditure-by-income-quintile-and-type-of-dwelling-quinquennial
DSM050 Data Visualisation - JLChew 35
Example of Tidy Data
https://data.gov.sg/dataset/average-monthly-household-expenditure-by-income-quintile-and-type-of-dwelling-quinquennial
DSM050 Data Visualisation - JLChew 36
Survey Design
Lesson 4
Population Sample
• The entire group that we want to draw • A specific group where data collection was
conclusions about. done.
• A population doesn’t always refer to people. • Consists 1 or more observations from the
Thus, it also includes all elements from a set population.
of data.
https://en.wikipedia.org/wiki/Sampling_(statistics)
DSM050 Data Visualisation - JLChew 41
Concepts, Variables & Data
Data
Concepts Variables
Operationalise concepts
• Show how data related to each concept will be generated. • years since birth
• Specify procedures that will be used to classify or measure. q Absolute number – ask people
how old they are
q Date of Birth
• (Approximately) Equal interval range
q Select age category from a list.
E.g. 18 – 20, 21 – 25, etc.
DSM050 Data Visualisation - JLChew 43
How to Write Questions
• Ask only 1 thing per question.
Give the questions a clear purpose
Be clear about the type of answer • Indicate clearly whether you want facts or feelings
you are looking for and phrase questions to match.
https://evals.stanford.edu/end-term-feedback/how-write-questions
DSM050 Data Visualisation - JLChew 44
Types of Questions
• Questions for which the respondent is asked to provide his/her
Open-ended own answers.
• Requires complex data analysis.
• Tend to provide more qualitative data.
• Tends to yield lower-quality data that might • Tends to increase the cognitive load for the
also varies over time. respondent.