0% found this document useful (0 votes)
74 views4 pages

Coursera Notes

Uploaded by

christineanne.28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views4 pages

Coursera Notes

Uploaded by

christineanne.28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

R PACKAGES FOR DATA SCIENCE

MODULE 1 ● Bundles together code, data.


Documentation, and tests
INTRODUCTION TO DATA ANALYSIS
WITH R ● Tidyverse Library: collection of
essential R packages for data science
Value derived from data depends on
Four Steps of Data Analysis
● Accuracy of Data
● Data Wrangling and Transformation
● Accessibility of data when we need it
○ Includes: dplyr and tidyr package
Data Asset eXchange (DAX)
○ Combine different functions using
● Curated free and open datasets under
pipe operator
open data licenses
● Data Import and Management
● Provides real-world data
○ Includes: readr package
● Vetted data
○ Solves problem of parsing a flat file
● Ready to use in the enterprise
(.csv file)
● Part of developer.ibm.com
● Functional Programming
● ibm.biz/data-exchange
○ Includes: purr package
○ Provides statistics for the dataset
WHY DATA ANALYSIS (calculating mean value for each
● Data is everywhere column)
● Helps us answer questions from data ● Data Visualization and Exploration
● Plays an important role in ○ Includes: ggplot2 package
○ Discovering useful information ○ Produces charts and visualization
○ Answering questions (box plots, density plots, violin pots,
○ Predicting the future or the unknown tile pots, and time series plots)
The Problem Basic Syntax of the dplyr package
● Can you predict the likelihood of a flight ● select (): select variables by their
delay? names
● Data Analysis using R libraries for ● filter (): filter observations based on
○ Data Cleaning values
○ Exploratory Analysis ● summarize (): compute summary
○ Model Development statistics
○ Model Evaluation ● arrange (): reorder the rows
● mutate (): create new variables

UNDERSTANDING THE DATA


IMPORTING & EXPORTING DATA IN R
● Dataset-Airline Performance
○ From Data Asset eXchange Importing Data
○ In (.csv) format ● Process of loading and reading data into
● Variables in the Dataset R from various resources
○ Performing statistical analysis on ● Important Factors:
selected columns from original data ○ Format of the File (.csv, .json, .xlsx,
set .hdf)
○ File Path of the Dataset
(computer/online)
Download and Extract Data Export to Different Formats in R
● Each row is one data point (observation)
● Many properties associated with each
point
● .csv Data Format: properties are
separated from each other by commas
Load the Package in E ANALYZING DATA IN R
● Install the package
Basic Insights from the Data
● Load the tidyverse library
● Understand your dat before you begin
○ Automatically loads the readr
any analysis
package
● Check:
Import .csv Files (readr package)
○ Variable Data Types
● Includes the read_csv() function
○ Data Distribution
● Tibble: used to read .csv files into a
● Identify potential issues with data
data frame
Basic Insights of a Dataset
● Pass the location of data you want to
● Known data types in tidyverse
use to the read_csv() function as a
○ Character, date, double, integer, and
filename
logical
Help Page
● glimpse() function: determines types of
● Adding question mark before function
variables in your dataset
name
● Shows number of rows and columns in
● Documentation: includes arguments for
the dataset
the function and examples of their use
● Importance of Checking Data Types
Read the Dataset from a URL
○ Potential information and type
● Define a variable that contains the URL
mismatch
path to the file
○ Compatibility with tidyverse functions
● Download the file locally using the
R doppler::summarize(), group_by()
download file() function
● Return a statistical summary of the data
○ First Argument: URL variable
○ Statistical Metrics: tells mathematical
○ Second Argument: local name for
issues (extreme outliers and large
downloaded file
deviations)
● Unzip the content using the untar()
function
● Read the data from the local file using
the read_csv() function
Print the Data in R
● HeadFunction: shows the first 6 rows of
data frame
● Tail Function: shows the bottom 6 rows
of data frame
● Export it to a new .csv file (optional)
○ Use the write_csv() function
How to Replace Missing Values in R
MODULE 2
● Use replace_na()
DATA WRANGLING

PRE-PROCESSING DATA IN R DATA FORMATTING IN R

Data Pre-Processing Data Formatting


● Converting or mapping data from the ● Data collected from different places and
initial raw form into another format stored in different formats
● Data cleaning or data wrangling ● Bringing data into a common standard of
Simple Data Frame Operations expression to make meaningful
● Perform data frame operations along comparisons
columns, wit beach row of the column ● Coherence: in statistics, an indication of
representing a sample the quality of the information within a
single dataset
Reformat an Entire Column
DEALING WITH MISSING VALUES IN R ● -
Missing Values Incorrect Data Types
● Missing values occur when no data ● -
value is stored for a variable in an
observation
DATA NORMALIZATION IN R
● Represented as “?”, “N/A”, 0 or just a
blank cell Data Normalization
How to Deal with Missing Data ● -
● Check with the data collection source Methods of Normalizing Data
● Drop the missing values ● -
○ Drop the variable Simple Feature Scaling in R
○ Drop the data entry ● -
● Replace the missing values Min-Max in R
○ Replace it with an average (similar ● -
data points) Z-score in R
○ Replace it with zero (frequency) ● -
○ Replace it based on other functions
● Leave it as missing data BINNING IN R
How to Check Missing Values in R
Data Normalization
● Use is.na() to count the number of
● -
missing values in columns
Methods of Normalizing Data
Drop Rows
● -
● Hyphen “-” operator: complement of a
Simple Feature Scaling in R
set in R
● -
How to Drop Missing Values in R
Min-Max in R
● Use drop_na()
● -
● Specify column names that contain
Z-score in R
missing values that you want to drop
-
Simple Feature Scaling in R
● -

You might also like