0% found this document useful (0 votes)
11 views

Unit 3

The document discusses reading data into R from various sources, including locally stored files, web URLs, databases, and APIs. It covers using functions like read_csv(), read_tsv(), and read_excel() to import tabular data files in different formats. For databases, it describes connecting to SQLite and retrieving table names. The document also defines tidy data as having variables as columns, observations as rows, and each type of observational unit in its own table. This standard structure aids in data analysis and sharing results.

Uploaded by

liman69609
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 3

The document discusses reading data into R from various sources, including locally stored files, web URLs, databases, and APIs. It covers using functions like read_csv(), read_tsv(), and read_excel() to import tabular data files in different formats. For databases, it describes connecting to SQLite and retrieving table names. The document also defines tidy data as having variables as columns, observations as rows, and each type of observational unit in its own table. This standard structure aids in data analysis and sharing results.

Uploaded by

liman69609
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Reading in data locally and from the web

 Reading data is the gateway for any data analysis.


 Data can be read from local device or from web.
 In R, “Reading” or “loading” is the process of converting data (stored as plain text, a
database, HTML, etc.) into an object (e.g., a data frame)
 There are many ways to store data as well as many ways to read them.
 Different functions are available in R to import data from various file formats.
 While loading a data set into R, we need to tell R where those files live. The file could live
on your computer (local) or somewhere on the internet (remote).
 The place where the file lives on your computer is called the “path.”
 There are two kinds of paths: relative paths and absolute paths.
 A relative path is where the file is with respect to our current computer.
 An absolute path is where the file is in respect to the computer’s file system.
 As per the figure,
o We are working in a file named worksheet_02.ipynb .
o If we want to read the .csv file named happiness_report.csv into R, we could do this
using either a relative or an absolute path.

Reading happiness_report.csv using a relative path


happy_data <- read_csv("data/happiness_report.csv")

Reading happiness_report.csv using an absolute pat:


happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")

 In case of remote files, a Uniform Resource Locator (URL) (web address) indicates the
location of a file/resource.
Reading tabular data from a plain text file into R
 read_csv() to read in comma-separated files (csv file)
data <- read_csv("data/xyz.csv")

Data filename is “xyz.csv” stored under “data” folder.

 read_tsv to read in tab-separated files


data <- read_tsv("data/xyz.tsv")

Reading tabular data directly from a URL


 read_csv( ), read_tsv( ), read_delim( ) functions are used to read in data directly from
a Uniform Resource Locator (URL) that contains tabular data.
url <- "https://xxx.com/data/xyz.csv"
data <- read_csv(url)

Reading tabular data from a Microsoft Excel file


data <- read_excel("data/xyz.xlsx")

Reading data from a database


 Relational database is a common form of data storage for large data sets or multiple
users working on a project.
 There are many relational database management systems, such as SQLite, MySQL,
PostgreSQL, Oracle and many more.
 Reading data from a SQLite database
o SQLite database is self-contained and usually stored and accessed locally.
o Data is usually stored in a file with a .db extension.
o To read data into R from a database we need to connect the database.
o dbConnect( ) function is used from the DBI (database interface) package to
connect the database.
data <- dbConnect(RSQLite::SQLite(), "data/xyz.db")
o Relational databases may have many tables. In order to retrieve data from a
database, we need to know the name of the table in which the data is stored.
o We can get the names of all the tables in the database using
the dbListTables function:
tables <- dbListTables(conn_lang_data)

Obtaining data from the web using API


 Accessing data stored in a plain text, spread sheets, comma or tab separated files from a
web URL using one of the read_* functions from the tidyverse.
 Now websites use Application Programming Interface (API), which provides a
programmatic way to read data set.
 This allows the website owner to control who has access to the data, what portion of the
data they have access to, and how much data they can access.
 We can collect data programmatically - in the form of Hypertext Markup Language
(HTML) and Cascading Style Sheet (CSS) code - and process it to extract useful
information.
 HTML provides the basic structure of a site and CSS helps style the content.
What is Tidy Data?
 In a Data Science project, tidying data is a necessary after importing data in order to
communicate results.

 Tidy datasets provide a standardized way to link the structure of a dataset (its physical
layout) with its semantics (its meaning).
o Structure is the form and shape of data. In statistics, most datasets are rectangular
data tables(data frames) and are made up of rows and columns.
o Semantics is the meaning for the dataset. Datasets are a collection of values,
either quantitative or qualitative. These values are organized in 2 ways —
variable & observation.
 Variables — all values that measure the same underlying attribute across units
 Observations — all values measured on the same unit across attributes
o The 3 rules of tidy data help simplify the concept and make it more intuitive.

 Each variable is a column


 Each observation is a row
 Each type of observational unit is a table

Messy Data
 Messy data is any kind of data that does not follow the above framework.
 To narrow it down, the paper gives 5 common problems of messy data:
o Column headers are values, not variable names.
o Multiple variables are stored in one column.
o Variables are stored in both rows and columns.
o Multiple types of observational units are stored in the same table.
o A single observational unit is stored in multiple tables.

Why is Tidy Data important?


 If the data set is in standardized framework then we spend less time on data cleaning and
wrangling and more time to focus on answering the problem.
 It is a good practice to have the data in a format which makes it reproducible and easy for
others to understand.
 Another more technical reason is that the concept of tidy data is complemented with the tools
in R to work with. Since R works with vectors of values (R functions are vectorized by nature),
we able to naturally apply our tidy data to the tools used.

You might also like