Unit 3

The document discusses reading data into R from various sources, including locally stored files, web URLs, databases, and APIs. It covers using functions like read_csv(), read_tsv(), and read_excel() to import tabular data files in different formats. For databases, it describes connecting to SQLite and retrieving table names. The document also defines tidy data as having variables as columns, observations as rows, and each type of observational unit in its own table. This standard structure aids in data analysis and sharing results.

Uploaded by

liman69609

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Unit 3

Uploaded by

liman69609

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Reading in data locally and from the web

 Reading data is the gateway for any data analysis.

 Data can be read from local device or from web.
 In R, “Reading” or “loading” is the process of converting data (stored as plain text, a
database, HTML, etc.) into an object (e.g., a data frame)
 There are many ways to store data as well as many ways to read them.
 Different functions are available in R to import data from various file formats.
 While loading a data set into R, we need to tell R where those files live. The file could live
on your computer (local) or somewhere on the internet (remote).
 The place where the file lives on your computer is called the “path.”
 There are two kinds of paths: relative paths and absolute paths.
 A relative path is where the file is with respect to our current computer.
 An absolute path is where the file is in respect to the computer’s file system.
 As per the figure,
o We are working in a file named worksheet_02.ipynb .
o If we want to read the .csv file named happiness_report.csv into R, we could do this
using either a relative or an absolute path.

Reading happiness_report.csv using a relative path

happy_data <- read_csv("data/happiness_report.csv")

Reading happiness_report.csv using an absolute pat:

happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")

 In case of remote files, a Uniform Resource Locator (URL) (web address) indicates the
location of a file/resource.
Reading tabular data from a plain text file into R
 read_csv() to read in comma-separated files (csv file)
data <- read_csv("data/xyz.csv")

Data filename is “xyz.csv” stored under “data” folder.

 read_tsv to read in tab-separated files

data <- read_tsv("data/xyz.tsv")

Reading tabular data directly from a URL

 read_csv( ), read_tsv( ), read_delim( ) functions are used to read in data directly from
a Uniform Resource Locator (URL) that contains tabular data.
url <- "https://xxx.com/data/xyz.csv"
data <- read_csv(url)

Reading tabular data from a Microsoft Excel file

data <- read_excel("data/xyz.xlsx")

Reading data from a database

 Relational database is a common form of data storage for large data sets or multiple
users working on a project.
 There are many relational database management systems, such as SQLite, MySQL,
PostgreSQL, Oracle and many more.
 Reading data from a SQLite database
o SQLite database is self-contained and usually stored and accessed locally.
o Data is usually stored in a file with a .db extension.
o To read data into R from a database we need to connect the database.
o dbConnect( ) function is used from the DBI (database interface) package to
connect the database.
data <- dbConnect(RSQLite::SQLite(), "data/xyz.db")
o Relational databases may have many tables. In order to retrieve data from a
database, we need to know the name of the table in which the data is stored.
o We can get the names of all the tables in the database using
the dbListTables function:
tables <- dbListTables(conn_lang_data)

Obtaining data from the web using API

 Accessing data stored in a plain text, spread sheets, comma or tab separated files from a
web URL using one of the read_* functions from the tidyverse.
 Now websites use Application Programming Interface (API), which provides a
programmatic way to read data set.
 This allows the website owner to control who has access to the data, what portion of the
data they have access to, and how much data they can access.
 We can collect data programmatically - in the form of Hypertext Markup Language
(HTML) and Cascading Style Sheet (CSS) code - and process it to extract useful
information.
 HTML provides the basic structure of a site and CSS helps style the content.
What is Tidy Data?
 In a Data Science project, tidying data is a necessary after importing data in order to
communicate results.

 Tidy datasets provide a standardized way to link the structure of a dataset (its physical
layout) with its semantics (its meaning).
o Structure is the form and shape of data. In statistics, most datasets are rectangular
data tables(data frames) and are made up of rows and columns.
o Semantics is the meaning for the dataset. Datasets are a collection of values,
either quantitative or qualitative. These values are organized in 2 ways —
variable & observation.
 Variables — all values that measure the same underlying attribute across units
 Observations — all values measured on the same unit across attributes
o The 3 rules of tidy data help simplify the concept and make it more intuitive.

 Each variable is a column

 Each observation is a row
 Each type of observational unit is a table

Messy Data
 Messy data is any kind of data that does not follow the above framework.
 To narrow it down, the paper gives 5 common problems of messy data:
o Column headers are values, not variable names.
o Multiple variables are stored in one column.
o Variables are stored in both rows and columns.
o Multiple types of observational units are stored in the same table.
o A single observational unit is stored in multiple tables.

Why is Tidy Data important?

 If the data set is in standardized framework then we spend less time on data cleaning and
wrangling and more time to focus on answering the problem.
 It is a good practice to have the data in a format which makes it reproducible and easy for
others to understand.
 Another more technical reason is that the concept of tidy data is complemented with the tools
in R to work with. Since R works with vectors of values (R functions are vectorized by nature),
we able to naturally apply our tidy data to the tools used.

Project Report (Sample)
No ratings yet
Project Report (Sample)
75 pages
Cape Notes Unit 2 Module 1 Content 1 3
0% (1)
Cape Notes Unit 2 Module 1 Content 1 3
12 pages
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
PLANT, S - On The Matrix - Cyberfeminist Simulations
No ratings yet
PLANT, S - On The Matrix - Cyberfeminist Simulations
12 pages
Mining Kind of data
No ratings yet
Mining Kind of data
24 pages
Module 1
No ratings yet
Module 1
11 pages
Module 1 Notes
No ratings yet
Module 1 Notes
7 pages
SQL 2
No ratings yet
SQL 2
7 pages
Inserttt
No ratings yet
Inserttt
25 pages
CSC 217 Data Structure 1
No ratings yet
CSC 217 Data Structure 1
9 pages
DAR lecture 8
No ratings yet
DAR lecture 8
11 pages
2 Manipulating Processing Data
No ratings yet
2 Manipulating Processing Data
81 pages
Lesson 2
No ratings yet
Lesson 2
50 pages
DBMS-UNIT1
No ratings yet
DBMS-UNIT1
10 pages
DSA W1 Introduction
No ratings yet
DSA W1 Introduction
4 pages
Lec01 (Week 1) - CS 232 - Data Structure
No ratings yet
Lec01 (Week 1) - CS 232 - Data Structure
25 pages
Database Concept
No ratings yet
Database Concept
11 pages
Employee Data Analysis System ( Ip Class 12 ) ( 2024-25 )
No ratings yet
Employee Data Analysis System ( Ip Class 12 ) ( 2024-25 )
30 pages
DS Unit 1 Notes
No ratings yet
DS Unit 1 Notes
10 pages
Database Design and Creation
No ratings yet
Database Design and Creation
248 pages
Unit_6ppt_(1)[1]
No ratings yet
Unit_6ppt_(1)[1]
28 pages
Database Management System
No ratings yet
Database Management System
19 pages
Data Structure Ch1
No ratings yet
Data Structure Ch1
22 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
20 pages
Database Models: Hierarchical Model
No ratings yet
Database Models: Hierarchical Model
6 pages
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
No ratings yet
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
26 pages
Access
No ratings yet
Access
105 pages
Application
No ratings yet
Application
11 pages
Database 2nd Semester
No ratings yet
Database 2nd Semester
18 pages
Interacting With Data Using The: Filehash Package For R
No ratings yet
Interacting With Data Using The: Filehash Package For R
6 pages
DBMSQ&A
No ratings yet
DBMSQ&A
14 pages
Handout 2
No ratings yet
Handout 2
15 pages
CH4_Dosdos-Lorioso-Tingzon.
No ratings yet
CH4_Dosdos-Lorioso-Tingzon.
11 pages
Fundamental Data Concept
100% (1)
Fundamental Data Concept
12 pages
What Are Database Types
No ratings yet
What Are Database Types
7 pages
Based On Material by Myra Cohen and Formatted by Robbie de La Vega
No ratings yet
Based On Material by Myra Cohen and Formatted by Robbie de La Vega
4 pages
Database System Concepts and Architecture
No ratings yet
Database System Concepts and Architecture
24 pages
Unit-I RDBMS Concepts
No ratings yet
Unit-I RDBMS Concepts
56 pages
DBMS NOTES 1,2,3
No ratings yet
DBMS NOTES 1,2,3
14 pages
Data Structure (Data Frame)
No ratings yet
Data Structure (Data Frame)
12 pages
Lecture №5
No ratings yet
Lecture №5
9 pages
database_programming
No ratings yet
database_programming
16 pages
Refer To Q. No 4 From March 2015
No ratings yet
Refer To Q. No 4 From March 2015
13 pages
chapter 3 1
No ratings yet
chapter 3 1
4 pages
Lecture 9
No ratings yet
Lecture 9
26 pages
QB DBMS Solution1
No ratings yet
QB DBMS Solution1
14 pages
POST MID TERMDATABASE MANAGEMENT SYSTEM-Notes
No ratings yet
POST MID TERMDATABASE MANAGEMENT SYSTEM-Notes
13 pages
Database Management System
No ratings yet
Database Management System
9 pages
Info125 Notes1
No ratings yet
Info125 Notes1
54 pages
Own Preparation What Is A Database?: Oracle
No ratings yet
Own Preparation What Is A Database?: Oracle
16 pages
Exp 1
No ratings yet
Exp 1
8 pages
Kinds of data
No ratings yet
Kinds of data
8 pages
Unit 1
No ratings yet
Unit 1
39 pages
DS Viva Questions
No ratings yet
DS Viva Questions
38 pages
Intro to Databases and SQL
No ratings yet
Intro to Databases and SQL
22 pages
SQL I
No ratings yet
SQL I
13 pages
Mysql
No ratings yet
Mysql
8 pages
20269
No ratings yet
20269
53 pages
Mobile Computimg and Information Super Highway 2023 Y3 s2 Nswadi William George COURSEWORK
No ratings yet
Mobile Computimg and Information Super Highway 2023 Y3 s2 Nswadi William George COURSEWORK
8 pages
DBMS Chapter 2
No ratings yet
DBMS Chapter 2
31 pages
DBMS
No ratings yet
DBMS
333 pages
DBT 3
No ratings yet
DBT 3
21 pages
15833ix Fit - 1
No ratings yet
15833ix Fit - 1
6 pages
Module 6 PROF ED 4
No ratings yet
Module 6 PROF ED 4
51 pages
Mouse - Inside The Computer
No ratings yet
Mouse - Inside The Computer
4 pages
MGW Hardware Units
No ratings yet
MGW Hardware Units
14 pages
Past Paper
No ratings yet
Past Paper
3 pages
Scheme of Examination Guru Gobind Singh Indraprastha University, Delhi Bachelor of Commerce (Hons.) Criteria For Internal Assessment
No ratings yet
Scheme of Examination Guru Gobind Singh Indraprastha University, Delhi Bachelor of Commerce (Hons.) Criteria For Internal Assessment
43 pages
Computer Architecture, Solved Problems
100% (1)
Computer Architecture, Solved Problems
9 pages
Product Specification CT800 T4 38B 1
No ratings yet
Product Specification CT800 T4 38B 1
4 pages
Introduction To Computers and Programming
No ratings yet
Introduction To Computers and Programming
46 pages
Boland College IT and Computer Science
100% (1)
Boland College IT and Computer Science
1 page
Ict SS 1
No ratings yet
Ict SS 1
37 pages
Os Dev
No ratings yet
Os Dev
77 pages
Nursing Informatics (Lecture)
No ratings yet
Nursing Informatics (Lecture)
50 pages
BCA-103: Computer Basics and PC Software: Periods/week: 3L, 1T Max. Marks: 100
No ratings yet
BCA-103: Computer Basics and PC Software: Periods/week: 3L, 1T Max. Marks: 100
12 pages
Lec 1-CH1 - Part 1
No ratings yet
Lec 1-CH1 - Part 1
25 pages
Performance and Evaluation CSC416 ECU Final
No ratings yet
Performance and Evaluation CSC416 ECU Final
39 pages
9 New IT-402
0% (3)
9 New IT-402
66 pages
Brochure
No ratings yet
Brochure
37 pages
Audacity
No ratings yet
Audacity
5 pages
POP Module 1
No ratings yet
POP Module 1
40 pages
RPG and Story-Based Game in Game Development
No ratings yet
RPG and Story-Based Game in Game Development
9 pages
Computer Fundamentals
No ratings yet
Computer Fundamentals
7 pages
UNIT 1 COA Part1
No ratings yet
UNIT 1 COA Part1
21 pages
Noisy
No ratings yet
Noisy
32 pages
English For Vocational Schools1
No ratings yet
English For Vocational Schools1
89 pages
Easy Thesis Topics For Computer Engineering
100% (2)
Easy Thesis Topics For Computer Engineering
5 pages
SSP 517 - The Golf 2013 Electrical Equipment
100% (1)
SSP 517 - The Golf 2013 Electrical Equipment
48 pages
DESIGN - AND - IMPLEMENTATION - OF - ONLINE - STUDENTonLIne AdMIssion SYSTEM
No ratings yet
DESIGN - AND - IMPLEMENTATION - OF - ONLINE - STUDENTonLIne AdMIssion SYSTEM
56 pages