Data Handling: Import, Cleaning and Visualisation
Lecture 3:
A Brief Introduction to Data and Data Processing
Dr. Aurélien Sallin
Recap and warm-up
Basic programming concepts
· Values, variables
· Vectors
· Matrices
· Loops
· Logical statements
· Control statements
· Functions
Three tutorials
· Compute the mean with your own function
· Evolution in action: fast and slow sloths -> exercise session
· Append and lists -> exercise session
Warm-up
# Vectors
some_numbers <- c(30, 50, 60)
some_numbers[c(2,3)]
some_numbers > 3
some_numbers * 5
Warm-up
What is total_sum?
numbers <- 1:4
total_sum <- 0
n <- length(numbers)
# start loop
for (i in 1:n) {
if(i %% 2 == 0){
total_sum <- total_sum + numbers[i]
} else {
total_sum <- total_sum + 2*numbers[i]
}
}
Don’t forget…
Data Processing
The binary system
Microprocessors can only represent two signs (states):
· ‘Off’ = 0
· ‘On’ = 1
The binary counting frame
· Only two signs: 0, 1.
· Base 2.
· Columns: 2 0
= 1 ,2
1
= 2 ,2
2
= 4 , and so forth.
The binary counting frame
What is the decimal number 139 in the binary counting frame?
The binary counting frame
What is the decimal number 139 in the binary counting frame?
· Solution:
7 3 1 0
(1 × 2 ) + (1 × 2 ) + (1 × 2 ) + (1 × 2 ) = 139.
The binary counting frame
What is the decimal number 139 in the binary counting frame?
· Solution:
7 3 1 0
(1 × 2 ) + (1 × 2 ) + (1 × 2 ) + (1 × 2 ) = 139.
· More precisely:
7 6 5 4 3
(1 × 2 ) + (0 × 2 ) + (0 × 2 ) + (0 × 2 ) + (1 × 2 )
2 1 0
+ (0 × 2 ) + (1 × 2 ) + (1 × 2 ) = 139.
· That is, the number 139 in the decimal system corresponds to 10001011 in
the binary system.
Conversion between binary and decimal
Number 128 64 32 16 8 4 2 1
Conversion between binary and decimal
Number 128 64 32 16 8 4 2 1
0= 0 0 0 0 0 0 0 0
1= 0 0 0 0 0 0 0 1
2= 0 0 0 0 0 0 1 0
3= 0 0 0 0 0 0 1 1
139 = 1 0 0 0 1 0 1 1
The binary counting frame
· Sufficient to represent all natural numbers in the decimal system.
The binary counting frame
· Sufficient to represent all natural numbers in the decimal system.
· Representing fractions is tricky
- e.g. 1/3 = 0.333.. actually constitutes an infinite sequence of 0s and 1s.
- Solution: ‘floating point numbers’ (not 100% accurate)
Floating point numbers: a strange phenomenon
# Subtracting two nearly identical floating-point numbers
x <- 0.3 - 0.2
y <- 0.1
# Check if they are equal
result <- x == y
print(x)
## [1] 0.1
print(y)
## [1] 0.1
print(result)
## [1] FALSE
Floating point numbers: a strange phenomenon
print(format(x, digits = 20)) # prints a more precise value of x
## [1] "0.099999999999999977796"
print(format(y, digits = 20)) # prints a more precise value of y
## [1] "0.10000000000000000555"
tolerance <- 1e-9
equal <- abs(x - y) < tolerance
print(equal)
## [1] TRUE
Decimal numbers in a computer
If computers only understand 0 and 1, how can they express decimal numbers
like 139?
Decimal numbers in a computer
If computers only understand 0 and 1, how can they express decimal numbers
like 139?
· Standards define how symbols, colors, etc are shown on the screen.
· Facilitates interaction with a computer (our keyboards do not only consist of
a 0/1 switch).
What time is it?
The hexadecimal system
· Binary numbers can become quite long rather quickly.
· Computer Science: refer to binary numbers with the hexadecimal system.
The hexadecimal system
· 16 symbols:
- 0-9 (used like in the decimal system)…
- and A-F (for the numbers 10 to 15).
The hexadecimal system
· 16 symbols:
- 0-9 (used like in the decimal system)…
- and A-F (for the numbers 10 to 15).
· 16 symbols >>> base 16: each digit represents an increasing power of 16 (
16 , 16 , etc.).
0 1
The hexadecimal system
What is the decimal number 139 expressed in the hexadecimal system?
The hexadecimal system
What is the decimal number 139 expressed in the hexadecimal system?
· Solution:
1 0
(8 × 16 ) + (11 × 16 ) = 139.
· More precisely:
1 0
(8 × 16 ) + (B × 16 ) = 8B = 139.
· Hence: 10001011 (in binary) = 8B (in hexadecimal) = 139 in decimal.
The hexadecimal system
Advantages (when working with binary numbers)
1. Shorter than raw binary representation
2. Much easier to translate forth and back between binary and hexadecimal
than binary and decimal.
WHY?
😆
Character Encoding
Computers and text
How can a computer understand text if it only understands 0s and 1s?
A modified version of South Korean Dubeolsik (two-set type) for old hangul letters. (Illustration by Yes0song 2010, Creative Commons Attribution-Share Alike 3.0
Unported)
Computers and text
How can a computer understand text if it only understands 0s and 1s?
· Standards define how 0s and 1s correspond to specific letters/characters of
different human languages.
· These standards are usually called character encodings.
· Coded character sets that map unique numbers (in the end in binary coded
values) to each character in the set.
Computers and text
How can a computer understand text if it only understands 0s and 1s?
· Standards define how 0s and 1s correspond to specific letters/characters of
different human languages.
· These standards are usually called character encodings.
· Coded character sets that map unique numbers (in the end in binary coded
values) to each character in the set.
· For example, ASCII (American Standard Code for Information Interchange),
now superseded by utf-8 (Unicode).
ASCII logo. (public domain).
ASCII Table
Binary Hexadecimal Decimal Character
0011 1111 3F 63 ?
0100 0001 41 65 A
0110 0010 62 98 b
Character encodings: why should we care?
Character encodings: why should we care?
· In practice, Data Science means handling digital data of all formats and
shapes.
- Diverse sources.
- Different standards.
- Different languages (Japanese vs English).
- read/store data.
· At the lowest level, this means understanding/handling encodings.
Computer Code and Text-Files
Putting the pieces together…
Two core themes of this course:
1. How can data be stored digitally and be read by/imported to a computer?
2. How can we give instructions to a computer by writing computer code?
Putting the pieces together…
Two core themes of this course:
1. How can data be stored digitally and be read by/imported to a computer?
2. How can we give instructions to a computer by writing computer code?
In both of these domains we mainly work with one simple type of document:
text files.
Text-files
· A collection of characters stored in a designated part of the computer
memory/hard drive.
· An easy-to-read representation of the underlying information (0s and 1s)!
Text-files
· A collection of characters stored in a designated part of the computer
memory/hard drive.
· An easy to read representation of the underlying information (0s and 1s)!
· Common device to store data:
- Structured data (tables)
- Semi-structured data (websites)
- Unstructured data (plain text)
· Typical device to store computer code.
Text-editors: RStudio, Atom, VsCode
Install RStudio from here!
Install Atom from here!
Install VScode from here!
Install Sublime text from here!
Data Processing Basics
The ‘blackbox’ of data processing.
Components of a standard computing environment
Basic components of a standard computing environment.
Central Processing Unit
· R runs on one CPU core by default.
· All modern CPUs have multiple cores.
· Advanced: explore parallelization with plyr, doParallel() and future
Random Access Memory
Random Access Memory
large_matrix <- matrix(1, nrow=1e8, ncol=1e8)
## Error in matrix(1, nrow = 1e+08, ncol = 1e+08): Vektor ist zu groß
· Try to create a matrix with 10 8
× 10
8
elements.
· Assuming each number is stored using 8 bytes, this matrix would require
8 × 10 6 bytes of RAM (more on bytes in the next lecture).
1
Mass storage: hard drive
Network: Internet, cloud, etc.
Putting the pieces together…
Recall the initial example (survey) of this course.
1. Access a website (over the Internet), use keyboard to enter data into a
website (a Google sheet in that case).
2. R program accesses the data of the Google sheet (again over the Internet),
downloads the data, and loads it into RAM.
3. Data processing: produce output (in the form of statistics/plots), output on
screen.
5468616E6B7320616E642073656520796F75206E657874207765656B21
🤓
Q&A
References