We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4
R PACKAGES FOR DATA SCIENCE
MODULE 1 ● Bundles together code, data.
Documentation, and tests INTRODUCTION TO DATA ANALYSIS WITH R ● Tidyverse Library: collection of essential R packages for data science Value derived from data depends on Four Steps of Data Analysis ● Accuracy of Data ● Data Wrangling and Transformation ● Accessibility of data when we need it ○ Includes: dplyr and tidyr package Data Asset eXchange (DAX) ○ Combine different functions using ● Curated free and open datasets under pipe operator open data licenses ● Data Import and Management ● Provides real-world data ○ Includes: readr package ● Vetted data ○ Solves problem of parsing a flat file ● Ready to use in the enterprise (.csv file) ● Part of developer.ibm.com ● Functional Programming ● ibm.biz/data-exchange ○ Includes: purr package ○ Provides statistics for the dataset WHY DATA ANALYSIS (calculating mean value for each ● Data is everywhere column) ● Helps us answer questions from data ● Data Visualization and Exploration ● Plays an important role in ○ Includes: ggplot2 package ○ Discovering useful information ○ Produces charts and visualization ○ Answering questions (box plots, density plots, violin pots, ○ Predicting the future or the unknown tile pots, and time series plots) The Problem Basic Syntax of the dplyr package ● Can you predict the likelihood of a flight ● select (): select variables by their delay? names ● Data Analysis using R libraries for ● filter (): filter observations based on ○ Data Cleaning values ○ Exploratory Analysis ● summarize (): compute summary ○ Model Development statistics ○ Model Evaluation ● arrange (): reorder the rows ● mutate (): create new variables
UNDERSTANDING THE DATA
IMPORTING & EXPORTING DATA IN R ● Dataset-Airline Performance ○ From Data Asset eXchange Importing Data ○ In (.csv) format ● Process of loading and reading data into ● Variables in the Dataset R from various resources ○ Performing statistical analysis on ● Important Factors: selected columns from original data ○ Format of the File (.csv, .json, .xlsx, set .hdf) ○ File Path of the Dataset (computer/online) Download and Extract Data Export to Different Formats in R ● Each row is one data point (observation) ● Many properties associated with each point ● .csv Data Format: properties are separated from each other by commas Load the Package in E ANALYZING DATA IN R ● Install the package Basic Insights from the Data ● Load the tidyverse library ● Understand your dat before you begin ○ Automatically loads the readr any analysis package ● Check: Import .csv Files (readr package) ○ Variable Data Types ● Includes the read_csv() function ○ Data Distribution ● Tibble: used to read .csv files into a ● Identify potential issues with data data frame Basic Insights of a Dataset ● Pass the location of data you want to ● Known data types in tidyverse use to the read_csv() function as a ○ Character, date, double, integer, and filename logical Help Page ● glimpse() function: determines types of ● Adding question mark before function variables in your dataset name ● Shows number of rows and columns in ● Documentation: includes arguments for the dataset the function and examples of their use ● Importance of Checking Data Types Read the Dataset from a URL ○ Potential information and type ● Define a variable that contains the URL mismatch path to the file ○ Compatibility with tidyverse functions ● Download the file locally using the R doppler::summarize(), group_by() download file() function ● Return a statistical summary of the data ○ First Argument: URL variable ○ Statistical Metrics: tells mathematical ○ Second Argument: local name for issues (extreme outliers and large downloaded file deviations) ● Unzip the content using the untar() function ● Read the data from the local file using the read_csv() function Print the Data in R ● HeadFunction: shows the first 6 rows of data frame ● Tail Function: shows the bottom 6 rows of data frame ● Export it to a new .csv file (optional) ○ Use the write_csv() function How to Replace Missing Values in R MODULE 2 ● Use replace_na() DATA WRANGLING
PRE-PROCESSING DATA IN R DATA FORMATTING IN R
Data Pre-Processing Data Formatting
● Converting or mapping data from the ● Data collected from different places and initial raw form into another format stored in different formats ● Data cleaning or data wrangling ● Bringing data into a common standard of Simple Data Frame Operations expression to make meaningful ● Perform data frame operations along comparisons columns, wit beach row of the column ● Coherence: in statistics, an indication of representing a sample the quality of the information within a single dataset Reformat an Entire Column DEALING WITH MISSING VALUES IN R ● - Missing Values Incorrect Data Types ● Missing values occur when no data ● - value is stored for a variable in an observation DATA NORMALIZATION IN R ● Represented as “?”, “N/A”, 0 or just a blank cell Data Normalization How to Deal with Missing Data ● - ● Check with the data collection source Methods of Normalizing Data ● Drop the missing values ● - ○ Drop the variable Simple Feature Scaling in R ○ Drop the data entry ● - ● Replace the missing values Min-Max in R ○ Replace it with an average (similar ● - data points) Z-score in R ○ Replace it with zero (frequency) ● - ○ Replace it based on other functions ● Leave it as missing data BINNING IN R How to Check Missing Values in R Data Normalization ● Use is.na() to count the number of ● - missing values in columns Methods of Normalizing Data Drop Rows ● - ● Hyphen “-” operator: complement of a Simple Feature Scaling in R set in R ● - How to Drop Missing Values in R Min-Max in R ● Use drop_na() ● - ● Specify column names that contain Z-score in R missing values that you want to drop - Simple Feature Scaling in R ● - ●