0% found this document useful (0 votes)
44 views

Data Management II

The document discusses various data management techniques in R including sorting data, merging datasets, and using aggregate functions. It shows how to sort a dataset by columns in ascending and descending order, including numeric, character, and factor variables. It also demonstrates different types of joins to merge datasets including inner, outer, left, and right joins. The aggregate function is used to calculate the mean of a variable grouped by a factor variable.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Data Management II

The document discusses various data management techniques in R including sorting data, merging datasets, and using aggregate functions. It shows how to sort a dataset by columns in ascending and descending order, including numeric, character, and factor variables. It also demonstrates different types of joins to merge datasets including inner, outer, left, and right joins. The aggregate function is used to calculate the mean of a variable grouped by a factor variable.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Management -II

What will we learn

• Sorting of data set

• Merging data sets

• Aggregating to get sum


Data Sorting in R

• Sorting data is one of the common activities in preparing data for analysis
• Sorting is storage of data in sorted order, it can be in ascending or descending order.
• We will be exploring all the ways in which sorting can be done.

#Import and attach basic_salary2 data from day 2 folder

salary<-read.csv(file.choose())
Use attach function
Data Sorting in R (Ascending)
• Sort salary by ba in ascending order
• order() sorts in ascending order by default

> ba_sorted<-salary[order(ba), ]
> head(ba_sorted)

First_Name Last_Name Grade Location Function ba ms


37 Archa Narvekar GR2 MUMBAI TECHNICAL 10940 11160
32 Anup Save GR2 MUMBAI SALES 11960 7880
33 Yogesh Lonkar GR2 MUMBAI TECHNICAL 12390 6630
38 Shiva Jathar GR2 MUMBAI FINANCE 12860 10940
41 Ketan Kharkar GR2 MUMBAI SALES 13140 9800
34 Sagar Chavan GR2 MUMBAI FINANCE 13390 6700

4
Data Sorting in R (Descending)
• Sort salary by ba in descending order
> ba_sorted_2<-salary[order(-ba), ]
> head(ba_sorted_2)

First_Name Last_Name Grade Location Function ba ms


12 Yogita Raje GR1 DELHI SALES 29080 8795
11 Raj Mohite GR1 DELHI FINANCE 26080 16970
10 Hameed Singh GR1 DELHI SALES 23720 15120
4 Priya Jain GR1 DELHI SALES 23280 13490
6 Mahesh Rane GR1 DELHI TECHNICAL 23160 14200
9 Nishi Kulkarni GR1 <NA> SALES 22620 16150

• The ‘- ‘ sign sorts numeric columns in descending order. Alternatively you can use decreasing=TRUE
Data Sorting in R
(Using Factor Variable)
• Sort data by column with characters / factors
#Sort salary by Grade

> gr_sorted<-salary[order(Grade), ]
> head(gr_sorted)

First_Name Last_Name Grade Location Function ba ms


1 Mahesh Joshi GR1 DELHI SALES 17990 16070
2 Rajesh Kolte GR1 DELHI FINANCE 19250 14960
3 Neha Rao GR1 DELHI FINANCE 19235 15200
4 Priya Jain GR1 DELHI SALES 23280 13490
5 Sneha Joshi GR1 DELHI FINANCE 20660 15660
6 Mahesh Rane GR1 DELHI TECHNICAL 23160 14200

• Note that by default order() sorts in ascending order


Data Sorting in R

(Using
Sort data by column with characters / factors
Factor Variable)
#Sort salary by Grade in descending order

> gr_sorted_2<-salary[order(Grade, decreasing=TRUE), ]


> head(gr_sorted_2)

First_Name Last_Name Grade Location Function ba ms


25 Priya Mittal GR2 DELHI TECHNICAL 15000 10680
26 Naresh Sinha GR2 DELHI TECHNICAL 13810 11540
27 Jivesh Shah GR2 <NA> FINANCE 16000 13730
28 Jigar Shah GR2 DELHI FINANCE 16230 NA
29 Gaurav Singh GR2 DELHI SALES 13760 13220
30 Amit Mehta GR2 DELHI TECHNICAL 13660 6840

• For reversing the sorting order for factor variables, include


logical argument decreasing=TRUE

7
Sorting Data by Multiple Variables
• Sort data by giving multiple columns; one column with characters / factors and one with
numerals
#Sort salary_data by Grade and ba

> grba_sorted<-salary[order(Grade, ba), ]


> head(grba_sorted)

First_Name Last_Name Grade Location Function ba ms


13 Anjali Sonar GR1 MUMBAI <NA> 14410 10450
15 Rahul Potdar GR1 MUMBAI SALES 15125 NA
14 Bipin Bhide GR1 MUMBAI FINANCE 15230 11010
17 Mangesh Oak GR1 MUMBAI SALES 15800 12420
18 Anand Soman GR1 <NA> FINANCE 16540 12780
19 Malhar Jadhav GR1 MUMBAI TECHNICAL 17240 13220

• Here, data is first sorted in increasing order of Grade then by increasing order of ba within Grade
Merging by Variables
#Import following 2 data sets

sal_data

Employee_ID First_Name Last_Name Basic_Salary


bonus_data
Employee_ID Bonus
1 E-1001 Mahesh Joshi 16070 1 E-1001 12050
2 E-1002 Rajesh Kolte 14960 2 E-1003 11400
3 E-1004 Priya Jain 13490
3 E-1004 10110
4 E-1006 10650
4 E-1005 Sneha Joshi 15660 5 E-1008 11910
5 E-1007 Ram Kanade 15850 6 E-1010 11340
6 E-1008 Nishi Honrao 15880

9
Types of Joins

LEFT RIGHT
JOIN JOIN

INNER OUTER
JOIN JOIN
Outer Joins
• Outer Join includes all employee ID’s from both data sets

> outerjoin<- merge(sal_data,bonus_data,


by=c("Employee_ID"), all=TRUE)
> outerjoin
Employee_ID First_Name Last_Name Basic_Salary Bonus
1 E-1001 Mahesh Joshi 16070 12050
2 E-1002 Rajesh Kolte 14960 NA
3 E-1004 Priya Jain 13490 10110
4 E-1005 Sneha Joshi 15660 NA
5 E-1007 Ram Kanade 15850 NA
6 E-1008 Nishi Honrao 15880 11910
7 E-1009 Hameed Singh 15120 NA
8 E-1003 <NA> <NA> NA 11400
9 E-1006 <NA> <NA> NA 10650
10 E-1010 <NA> <NA> NA 11340
11
Inner Join

• Inner Join includes employee ID only if present in both data sets

> innerjoin<- merge(sal_data,bonus_data,


by=c("Employee_ID"))
> innerjoin

Employee_ID First_Name Last_Name Basic_Salary Bonus


1 E-1001 Mahesh Joshi 16070 12050
2 E-1004 Priya Jain 13490 10110
3 E-1008 Nishi Honrao 15880 11910

12
Left Join

• Left Join includes all employee ID’s from first data set

> leftjoin<-merge(sal_data,bonus_data,
by=c("Employee_ID"), all.x=TRUE)
> leftjoin

Employee_ID First_Name Last_Name Basic_Salary Bonus


1 E-1001 Mahesh Joshi 16070 12050
2 E-1002 Rajesh Kolte 14960 NA
3 E-1004 Priya Jain 13490 10110
4 E-1005 Sneha Joshi 15660 NA
5 E-1007 Ram Kanade 15850 NA
6 E-1008 Nishi Honrao 15880 11910
7 E-1009 Hameed Singh 15120 NA
13
Right Join

• Right Join includes all employee ID’s from second data set

> rightjoin<-merge(sal_data,bonus_data,
by=c("Employee_ID"), all.y=TRUE)
> rightjoin

Employee_ID First_Name Last_Name Basic_Salary Bonus


1 E-1001 Mahesh Joshi 16070 12050
2 E-1004 Priya Jain 13490 10110
3 E-1008 Nishi Honrao 15880 11910
4 E-1003 <NA> <NA> NA 11400
5 E-1006 <NA> <NA> NA 10650
6 E-1010 <NA> <NA> NA 11340
14
Aggregate Function≈

#To calculate mean for variable ‘ba’ by Location variable

A<-aggregate(ba ~ Location, data = salary, FUN = mean )

Location ba
1 DELHI 19430.29
2 MUMBAI 15037.11

#Aggregate function by default ignores the missing data values.

Therefore, na.rm=TRUE is not required in mean function.

15

You might also like