0% found this document useful (0 votes)

56 views10 pages

Rmarkdown

Machine learning

Uploaded by

cristy alejandra medina armijo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views10 pages

Rmarkdown

Machine learning

Uploaded by

cristy alejandra medina armijo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Managing and Understanding Data

Escribir vuestro nombre y apellidos

10 de septiembre, 2018

Contents
R data structures 1
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Exploring and understanding data 2

Exploring the structure of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Show some registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exploring numeric variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Table with information about mileage and price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Some descriptive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Visualizing numeric variables - boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Measuring spread - quartiles and the five-number summary . . . . . . . . . . . . . . . . . . . . . . 9
Measuring spread - variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Addenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

References 9
• Primera toma de contacto con un informe dinámico donde se muestra algunas de sus
caracteristicas.
• Son diferentes trozos de un libro.
• Se lee un archivo csv

2018-09-10
By the end of this notes, you will understand:
• The basic R data structures and how to use them to store and extract data
• How to get data into R from a variety of source formats
• Common methods for understanding and visualizing complex data

R data structures
The R data structures used most frequently in machine learning are vectors, factors, lists, arrays, and data
frames.
To find out more about machine learning see (Andrieu et al. 2003; Goldberg and Holland 1988).

Vectors

The fundamental R data structure is the vector, which stores an ordered set of values called elements. A
vector can contain any number of elements. However, all the elements must be of the same type; for instance,
a vector cannot contain both numbers and text.

1
There are several vector types commonly used in machine learning:integer(numbers without decimals),
numeric (numbers with decimals), character (text data), or logical (TRUE or FALSE values). There are
also two special values: NULL, which is used to indicate the absence of any value, and NA, which indicates a
missing value.
...
...
...
Create vectors of data for three medical patients:
# create vectors of data for three medical patients
subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)

Access the second element in body temperature vector:

# access the second element in body temperature vector
temperature[2]

## [1] 98.6
Examples of accessing items in vector include items in the range 2 to 3.
## examples of accessing items in vector
# include items in the range 2 to 3
temperature[2:3]

## [1] 98.6 101.4

Exclude item 2 using the minus sign
# exclude item 2 using the minus sign
temperature[-2]

## [1] 98.1 101.4

Use a vector to indicate whether to include item
# use a vector to indicate whether to include item
temperature[c(TRUE, TRUE, FALSE)]

## [1] 98.1 98.6

Exploring and understanding data

After collecting data and loading it into R data structures, the next step in the machine learning process
involves examining the data in detail. It is during this step that you will begin to explore the data’s features
and examples, and realize the peculiarities that make your data unique. The better you understand your data,
the better you will be able to match a machine learning model to your learning problem. The best way to
understand the process of data exploration is by example. In this section, we will explore the [Link]
dataset, which contains actual data about used cars recently advertised for sale on a popular U.S. website.
...
...
...

2
Since the dataset is stored in CSV form, we can use the [Link]() function to load the data into an R
data frame:
##### Exploring and understanding data --------------------

## data exploration example using used car data

usedcars <- [Link](file1, stringsAsFactors = FALSE)

Exploring the structure of data

One of the first questions to ask in your investigation should be about how data is organized. If you are
fortunate, your source will provide a data dictionary, a document that describes the data’s features. In our
case, the used car data does not come with this documentation, so we’ll need to create our own.
# get structure of used car data
str(usedcars)

## '[Link]': 150 obs. of 6 variables:

## $ year : int 2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
## $ model : chr "SEL" "SEL" "SEL" "SEL" ...
## $ price : int 21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
## $ mileage : int 7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
## $ color : chr "Yellow" "Gray" "Silver" "Gray" ...
## $ transmission: chr "AUTO" "AUTO" "AUTO" "AUTO" ...

Show some registers

# Table of 6 first registers

kable(head(usedcars), caption = "6 first registers of data")

Table 1: 6 first registers of data

year model price mileage color transmission

2011 SEL 21992 7413 Yellow AUTO
2011 SEL 20995 10926 Gray AUTO
2011 SEL 19995 7351 Silver AUTO
2011 SEL 17809 11613 Gray AUTO
2012 SE 17500 8367 White AUTO
2010 SEL 17495 25125 Silver AUTO

Exploring numeric variables

## Exploring numeric variables -----

# summarize numeric variables

summary(usedcars$year)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 2000 2008 2009 2009 2010 2012

3
summary(usedcars[c("price", "mileage")])

## price mileage
## Min. : 3800 Min. : 4867
## 1st Qu.:10995 1st Qu.: 27200
## Median :13592 Median : 36385
## Mean :12962 Mean : 44261
## 3rd Qu.:14904 3rd Qu.: 55125
## Max. :21992 Max. :151479
# calculate the mean income
(36000 + 44000 + 56000) / 3

## [1] 45333.33
mean(c(36000, 44000, 56000))

## [1] 45333.33
# the median income
median(c(36000, 44000, 56000))

## [1] 44000
# the min/max of used car prices
range(usedcars$price)

## [1] 3800 21992

# the difference of the range
diff(range(usedcars$price))

## [1] 18192
# IQR for used car prices
IQR(usedcars$price)

## [1] 3909.5
# use quantile to calculate five-number summary
quantile(usedcars$price)

## 0% 25% 50% 75% 100%

## 3800.0 10995.0 13591.5 14904.5 21992.0
# the 99th percentile
quantile(usedcars$price, probs = c(0.01, 0.99))

## 1% 99%
## 5428.69 20505.00
# quintiles
quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))

## 0% 20% 40% 60% 80% 100%

## 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0

4
Table with information about mileage and price

mileage<-summary(usedcars$mileage)
price<-summary(usedcars$price)
kable(rbind(mileage,price), caption= "Descriptive statistic: mileage and price")

Table 2: Descriptive statistic: mileage and price

Min. 1st Qu. Median Mean 3rd Qu. Max.

mileage 4867 27200.25 36385.0 44260.65 55124.5 151479
price 3800 10995.00 13591.5 12961.93 14904.5 21992

Some descriptive graphics

par(mfrow=c(2,2))
hist(usedcars$mileage, xlab="Mileage", main="Histogram of mileage",col="grey85")
hist(usedcars$price, xlab="Price", main="Histogram of price",col="grey85")
usedcars$transmission <- factor(usedcars$transmission)
plot(usedcars$mileage, usedcars$price, pch=16,
col=usedcars$transmission,xlab="Mileage", ylab="Price")
legend("topright", pch=16, c("AUTO","MANUAL"), col=1:2, cex=0.5)

Visualizing numeric variables - boxplots

# boxplot of used car prices and mileage

boxplot(usedcars$price, main="Boxplot of Used Car Prices",ylab="Price ($)")

boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage",

ylab="Odometer (mi.)")

# histograms of used car prices and mileage

hist(usedcars$price, main = "Histogram of Used Car Prices",
xlab = "Price ($)")
hist(usedcars$mileage, main = "Histogram of Used Car Mileage",
xlab = "Odometer (mi.)")
# variance and standard deviation of the used car data
var(usedcars$price)

## [1] 9749892
sd(usedcars$price)

## [1] 3122.482
var(usedcars$mileage)

## [1] 728033954
sd(usedcars$mileage)

## [1] 26982.1

5
Histogram of mileage Histogram of price
20 40 60

40
Frequency

Frequency

20
0

0 50000 100000 150000 0 5000 10000 15000 20000

Mileage Price

AUTO
MANUAL
15000
Price

5000

0 50000 100000 150000

Mileage

Figure 1: Descriptive graphics

6
Boxplot of Used Car Prices
15000
Price ($)

5000

Figure 2: Boxplot of prices

Boxplot of Used Car Mileage

Odometer (mi.)

100000
0

Figure 3: Boxplot of Mileage

7
50
40 Histogram of Used Car Prices
Frequency

30
20
10
0

5000 10000 15000 20000

Price ($)

Figure 4: Histogram of Used Car Prices

Histogram of Used Car Mileage

20 40 60
Frequency

0 50000 100000 150000

Odometer (mi.)

Figure 5: Histogram of mileage

8
Measuring spread - quartiles and the five-number summary

The five-number summary is a set of five statistics that roughly depict the spread of a dataset. All five of
the statistics are included in the output of the summary() function. Written in order, they are:
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)

Measuring spread - variance and standard deviation

In order to calculate the standard deviation, we must first obtain the variance, which is defined as the
average of the squared differences between each value and the mean value. In mathematical notation, the
variance of a set of n values of x is defined by the following formula. The Greek letter mu (µ) (similar in
appearance to an m) denotes the mean of the values, and the variance itself is denoted by the Greek letter
sigma (σ) squared (similar to a b turned sideways):

n
1X
V ar(X) = σ 2 = (xi − µ)2
n i=1

The standard deviation is the square root of the variance, and is denoted by sigma as shown in the following
formula:
v
u n
u1 X
StdDev(X) = σ = t (xi − µ)2
n i=1

Note. For more details on using mathematical expressions in Latex (R Markdown) see [Link]
com/learn/Mathematical_expressions.

Addenda

All these these methods should be used to analyze data and solve problems like the ozone layer (1986) or
socioeconomic problems like the precarious work (2000).
The main goal is to accomplish long-term growth as stated in Doppelhofer, Miller, and others (2004).

References
Andrieu, Christophe, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. 2003. “An Introduction to
Mcmc for Machine Learning.” Machine Learning 50 (1-2). Springer: 5–43.
Beck, Ulrich. 2000. Un Nuevo Mundo Feliz: La Precariedad Del Trabajo En La Era de La Globalización.
Doppelhofer, Gernot, Ronald I Miller, and others. 2004. “Determinants of Long-Term Growth: A Bayesian
Averaging of Classical Estimates (Bace) Approach.” The American Economic Review 94 (4). American

9
Economic Association: 813–35.
Goldberg, David E, and John H Holland. 1988. “Genetic Algorithms and Machine Learning.” Machine
Learning 3 (2). Springer: 95–99.
López Zavala, A, and others. 1986. “Capa de Ozono.” In Congreso Nacional de Ingeniería Sanitaria Y
Ambiental, 5, 304–8. SMISAAC.

Introduction to Big Data and R
No ratings yet
Introduction to Big Data and R
65 pages
R Data Analysis & Visualization Guide
No ratings yet
R Data Analysis & Visualization Guide
47 pages
Module 2.5
No ratings yet
Module 2.5
19 pages
R Programming Essentials
No ratings yet
R Programming Essentials
27 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Tutorial 1
No ratings yet
Tutorial 1
29 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
318 pages
R Programming Basics and Data Structures
No ratings yet
R Programming Basics and Data Structures
14 pages
R Arrays, Data Frames, and Factors Guide
No ratings yet
R Arrays, Data Frames, and Factors Guide
23 pages
R for Big Data and Statistics
No ratings yet
R for Big Data and Statistics
57 pages
An Ordered Book For R Language
No ratings yet
An Ordered Book For R Language
92 pages
Data Types & RStudio Basics
No ratings yet
Data Types & RStudio Basics
42 pages
R Data Transformation & Visualization Lab
No ratings yet
R Data Transformation & Visualization Lab
16 pages
R Concepts - 25092018 PDF
No ratings yet
R Concepts - 25092018 PDF
51 pages
Data Analytics Using R
100% (1)
Data Analytics Using R
37 pages
Getting Started with R Programming
No ratings yet
Getting Started with R Programming
34 pages
R Basics for Business Analytics
No ratings yet
R Basics for Business Analytics
7 pages
1 Introduction
No ratings yet
1 Introduction
88 pages
Consolidate AmitRana
No ratings yet
Consolidate AmitRana
58 pages
Intro to Data Science with R
No ratings yet
Intro to Data Science with R
40 pages
Chapter 1 Introduction - An R Companion For Introduction To Data Mining
No ratings yet
Chapter 1 Introduction - An R Companion For Introduction To Data Mining
9 pages
Understanding Data Science Fundamentals
No ratings yet
Understanding Data Science Fundamentals
44 pages
Loading Data into R: A Comprehensive Guide
No ratings yet
Loading Data into R: A Comprehensive Guide
15 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
107 pages
CH 3
No ratings yet
CH 3
33 pages
Data Science Training Overview
No ratings yet
Data Science Training Overview
188 pages
Advanced R Data Structures Guide
No ratings yet
Advanced R Data Structures Guide
12 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
78 pages
ANOVA Analysis with R Programming
No ratings yet
ANOVA Analysis with R Programming
32 pages
Understanding Machine Learning Data
No ratings yet
Understanding Machine Learning Data
27 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
12 pages
RStudio Setup for Statistical Analysis
No ratings yet
RStudio Setup for Statistical Analysis
5 pages
R Programming: Scalars and Vectors Guide
No ratings yet
R Programming: Scalars and Vectors Guide
11 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
39 pages
ML-UNIT - I - Part B
No ratings yet
ML-UNIT - I - Part B
38 pages
Wa0001.
No ratings yet
Wa0001.
46 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
24 pages
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
Essential R Data Structures Explained
No ratings yet
Essential R Data Structures Explained
18 pages
Lecture Notes - Programming in R
No ratings yet
Lecture Notes - Programming in R
9 pages
Introduction To R
No ratings yet
Introduction To R
21 pages
R for Data Science Beginners
No ratings yet
R for Data Science Beginners
37 pages
R
No ratings yet
R
15 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
23 pages
Advantages of R for Data Analytics
100% (1)
Advantages of R for Data Analytics
27 pages
R Programming
No ratings yet
R Programming
22 pages
R Basics: Vectors and Data Frames
No ratings yet
R Basics: Vectors and Data Frames
39 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
R Topicscovered
No ratings yet
R Topicscovered
22 pages
Data Analytic Using R - Advanced
No ratings yet
Data Analytic Using R - Advanced
51 pages
Machine Learning Process Overview
No ratings yet
Machine Learning Process Overview
62 pages
Data Exploration and Analysis Techniques
No ratings yet
Data Exploration and Analysis Techniques
23 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
109 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Ba Assignment Sem 6 (22504025) Dhruvi Pathania
No ratings yet
Ba Assignment Sem 6 (22504025) Dhruvi Pathania
28 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
82 pages
Practical 3 Intro To R
No ratings yet
Practical 3 Intro To R
10 pages
Samsung Cl29k40mq2xxap Chassis K16a N Ray
No ratings yet
Samsung Cl29k40mq2xxap Chassis K16a N Ray
74 pages
10 Lotus Development Corp Vs Borland
No ratings yet
10 Lotus Development Corp Vs Borland
7 pages
Employee Performance Measurement Template
No ratings yet
Employee Performance Measurement Template
7 pages
Case Study On Dubuque
0% (2)
Case Study On Dubuque
3 pages
Oracle - Overview of Oracle Spatial
No ratings yet
Oracle - Overview of Oracle Spatial
20 pages
DLD Course Outline
No ratings yet
DLD Course Outline
1 page
Configuring NetFlow on Cisco Routers
No ratings yet
Configuring NetFlow on Cisco Routers
11 pages
Quick Guide Synapse 5seven
No ratings yet
Quick Guide Synapse 5seven
5 pages
SQL Queries for Employee Management
No ratings yet
SQL Queries for Employee Management
4 pages
User'S Guide: Smart Humidifier
No ratings yet
User'S Guide: Smart Humidifier
38 pages
3.TouchGFX Framework
No ratings yet
3.TouchGFX Framework
33 pages
Mahavidwan R. Raghava Iyengar Books
No ratings yet
Mahavidwan R. Raghava Iyengar Books
6 pages
IDDRPM404006: Product Information Packet
No ratings yet
IDDRPM404006: Product Information Packet
6 pages
Full Stack Web Dev Seminar Report
No ratings yet
Full Stack Web Dev Seminar Report
159 pages
Physical Database Design Using Oracle
No ratings yet
Physical Database Design Using Oracle
264 pages
.Msbte DGMST 426861202424511430481III
No ratings yet
.Msbte DGMST 426861202424511430481III
1 page
Telecom Network Engineer Resume
No ratings yet
Telecom Network Engineer Resume
3 pages
Ethical Hacking Fundamentals and Trends
No ratings yet
Ethical Hacking Fundamentals and Trends
21 pages
Gamma Tips and Tricks
No ratings yet
Gamma Tips and Tricks
13 pages
Week 2 Operating Systems and File Management
No ratings yet
Week 2 Operating Systems and File Management
12 pages
JSS Blog Platform
No ratings yet
JSS Blog Platform
10 pages
Cymgrd 6.5 Reference Manual and Users Guide: July 2011
No ratings yet
Cymgrd 6.5 Reference Manual and Users Guide: July 2011
126 pages
Tiger Tracks 5 Test Unit 1
90% (10)
Tiger Tracks 5 Test Unit 1
10 pages
JNCIA-Cloud 34day StudyPlan WithResources
No ratings yet
JNCIA-Cloud 34day StudyPlan WithResources
2 pages
Robodrill Presentation - JUMBO On CAT 938M - 04.04.2025
No ratings yet
Robodrill Presentation - JUMBO On CAT 938M - 04.04.2025
32 pages
Canon iR1023/iR1025 Scan Setup Guide
No ratings yet
Canon iR1023/iR1025 Scan Setup Guide
2 pages
6g Whitepaper Challenges For Trust Security Privacy 1599843465
No ratings yet
6g Whitepaper Challenges For Trust Security Privacy 1599843465
36 pages
Unit Iv Network Security and Firewall
No ratings yet
Unit Iv Network Security and Firewall
22 pages
DWDM Notes 5 Units
0% (1)
DWDM Notes 5 Units
110 pages
宝马进阶编程设码技术 57-95
No ratings yet
宝马进阶编程设码技术 57-95
39 pages

Rmarkdown

Uploaded by

Rmarkdown

Uploaded by

Managing and Understanding Data

Escribir vuestro nombre y apellidos

Exploring and understanding data 2

Access the second element in body temperature vector:

## [1] 98.6 101.4

## [1] 98.1 101.4

## [1] 98.1 98.6

Exploring and understanding data

## data exploration example using used car data

Exploring the structure of data

## '[Link]': 150 obs. of 6 variables:

Show some registers

# Table of 6 first registers

Table 1: 6 first registers of data

year model price mileage color transmission

Exploring numeric variables

## Exploring numeric variables -----

# summarize numeric variables

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## [1] 3800 21992

## 0% 25% 50% 75% 100%

## 0% 20% 40% 60% 80% 100%

Table 2: Descriptive statistic: mileage and price

Min. 1st Qu. Median Mean 3rd Qu. Max.

Some descriptive graphics

Visualizing numeric variables - boxplots

# boxplot of used car prices and mileage

boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage",

# histograms of used car prices and mileage

0 50000 100000 150000 0 5000 10000 15000 20000

0 50000 100000 150000

Figure 1: Descriptive graphics

Figure 2: Boxplot of prices

Boxplot of Used Car Mileage

Figure 3: Boxplot of Mileage

5000 10000 15000 20000

Figure 4: Histogram of Used Car Prices

Histogram of Used Car Mileage

0 50000 100000 150000

Figure 5: Histogram of mileage

Measuring spread - variance and standard deviation

You might also like