0% found this document useful (0 votes)
2 views

r lang-Unit-02

The document provides an overview of reading and writing Excel and CSV files using R, including the installation and usage of the 'xlsx' package. It covers how to read data into data frames, perform data analysis, and write filtered data back to CSV files. Additionally, it introduces conditional statements and loops in R programming, explaining the syntax and usage of 'if' statements and 'for' loops.

Uploaded by

km587522
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

r lang-Unit-02

The document provides an overview of reading and writing Excel and CSV files using R, including the installation and usage of the 'xlsx' package. It covers how to read data into data frames, perform data analysis, and write filtered data back to CSV files. Additionally, it introduces conditional statements and loops in R programming, explaining the syntax and usage of 'if' statements and 'for' loops.

Uploaded by

km587522
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Nrupathunga University

Department of Computer Science


V Sem BCA (NEP)
Statistical Computing and R Programming Language
Unit -02
Unit 2: Reading and writing Excel, CSV files, Conditions and Loops, Functions-Calling Functions, Writing Functions, , Statistical
functions, mean, median, mode.

1. Reding and Writing CSV Files


Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format.
R can read directly from these files using some excel specific packages. Few such packages are - XLConnect, xlsx,
gdata etc. We will be using xlsx package. R can also write into excel file using this package.

1.1 Install xlsx Package

You can use the following command in the R console to install the "xlsx" package. It may ask to install some
additional packages on which this package is dependent. Follow the same command with required package
name to install the additional packages.

install.packages("xlsx")

1.2 Verify and Load the "xlsx" Package

Use the following command to verify and load the "xlsx" package.

# Verify the package is installed.


any(grepl("xlsx",installed.packages()))

# Load the library into R workspace.


library("xlsx")

When the script is run we get the following output.

[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
1.3 Input as xlsx File

Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.

id name salary start_date dept


1 Rick 623.3 1/1/2012 IT
2 Dan 515.2 9/23/2013 Operations
3 Michelle 611 11/15/2014 IT
4 Ryan 729 5/11/2014 HR
5 Gary 43.25 3/27/2015 Finance
6 Nina 578 5/21/2013 IT
7 Simon 632.8 7/30/2013 Operations
8 Guru 722.5 6/17/2014 Finance

Also copy and paste the following data to another worksheet and rename this worksheet to "city".

name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas

Save the Excel file as "input.xlsx". You should save it in the current working directory of the R workspace.

1.3 Reading the Excel File

The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data frame
in the R environment.

2. # Read the first worksheet in the file input.xlsx.


3. data <- read.xlsx("input.xlsx", sheetIndex = 1)
4. print(data)

5. When we execute the above code, it produces the following result −


id, name, salary, start_date, dept
1 Rick 623.30 2012-01-01 IT
2 Dan 515.20 2013-09-23 Operations
3 Michelle 611.00 2014-11-15 IT
4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 Nina 578.00 2013-05-21 IT
7 Simon 632.80 2013-07-30 Operations
8 Guru 722.50 2014-06-17 Finance

2. Reding and Writing CSV Files

In R, we can read data from files stored outside the R environment. We can also write data into files which
will be stored and accessed by the operating system. R can read and write into various file formats like csv,
excel, xml etc.

In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should
be present in current working directory so that R can read it. Of course we can also set our own directory and
read files from there.

2.1 Input a CSV File

The csv file is a text file in which the values in the columns are separated by a comma. Let's consider
the following data present in the file named input.csv.

You can create this file using windows notepad by copying and pasting this data. Save the file
as input.csv using the save As All files(*.*) option in notepad.

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

2.2 Reading a CSV File

Following is a simple example of read.csv() function to read a CSV file available in your current working
directory –
3. data <- read.csv("input.csv")
4. print(data)
5. When we execute the above code, it produces the following result −

id, name, salary, start_date, dept


1 Rick 623.30 2012-01-01 IT
2 Dan 515.20 2013-09-23 Operations
3 Michelle 611.00 2014-11-15 IT
4 Ryan 729.00 2014-05-11 HR
NA Gary 843.25 2015-03-27 Finance
6 Nina 578.00 2013-05-21 IT
7 Simon 632.80 2013-07-30 Operations
8 Guru 722.50 2014-06-17 Finance

2.3 Analyzing the CSV File

By default the read.csv() function gives the output as a data frame. This can be easily checked as
follows. Also we can check the number of columns and rows.

data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

When we execute the above code, it produces the following result −

[1] TRUE
[1] 5
[1] 8

Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in a
subsequent section.

Get the maximum salary


# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.


sal <- max(data$salary)
print(sal)
When we execute the above code, it produces the following result –

[1] 843.25

Get the details of the person with max salary


We can fetch rows meeting specific filter criteria similar to a SQL where clause.

# Create a data frame.


data <- read.csv("input.csv")

# Get the max salary from data frame.


sal <- max(data$salary)

# Get the person detail having max salary.


retval <- subset(data, salary == max(salary))
print(retval)

When we execute the above code, it produces the following result –

id name salary start_date dept


5 NA Gary 843.25 2015-03-27 Finance

# Create a data frame.


data <- read.csv("input.csv")

retval <- subset( data, dept == "IT")


print(retval)

When we execute the above code, it produces the following result –

id name salary start_date dept


1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT
Get the persons in IT department whose salary is greater than
600

# Create a data frame.


data <- read.csv("input.csv")

info <- subset(data, salary > 600 & dept == "IT")


print(info)

When we execute the above code, it produces the following result –

id name salary start_date dept


1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT

Get the people who joined on or after 2014

# Create a data frame.


data <- read.csv("input.csv")

retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))


print(retval)

When we execute the above code, it produces the following result −

id name salary start_date dept


3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance

2.4 Writing into a CSV File

R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This file gets
created in the working directory.

# Create a data frame.


data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.


write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result –
X id name salary start_date dept
13 3 Michelle 611.00 2014-11-15 IT
24 4 Ryan 729.00 2014-05-11 HR
35 NA Gary 843.25 2015-03-27 Finance
48 8 Guru 722.50 2014-06-17 Finance

Here the column X comes from the data set newper. This can be dropped using additional parameters while
writing the file.

# Create a data frame.


data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.


write.csv(retval,"output.csv", row.names = FALSE)
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result –
id name salary start_date dept
1 3 Michelle 611.00 2014-11-15 IT
2 4 Ryan 729.00 2014-05-11 HR
3 NA Gary 843.25 2015-03-27 Finance
4 8 Guru 722.50 2014-06-17 Finance

3. Conditionals and Loops

If" Statements

if
syntax

if statements let you execute statements conditionally: that is which statements execute depend on
whether some condition is true or false. For example:
func max(a int, b int) int {
var m int
if a > b {

m = a

} else {
m = b

The syntax of an if statement is:

if CONDITION {
THEN PART

} else {

ELSE PART

If CONDITION is true, then the statements in the THEN PART will be executed.

If CONDITION is false, then the statements in the ELSE PART will be executed.

The else clause is optional:


if a > b {
fmt.Println("a is bigger")

Schematically:

Test yourself! What kind of computer instruction do you think plays an important role in how if
statements are executed?

Conditions

The CONDITION part of an if statement can be any expression that evaluates to a Boolean value. For
example, the following comparison operators return Boolean values.

Boolean Operator Meaning

e1 > e2 e1 is greater than e2

e1 < e2 e1 is less than e2

e1 >= e2 e1 is greater than or equal to e2

e1 <= e2 e1 is less than or equal to e2

e1 == e2 e1 is equal to e2

e1 != e2 e1 is not equal to e2

!e1 true if and only if e1 is false


Examples:

a > 10 * b + c
10 == 10

square(10) < 101 1 + 2

The Boolean operators && and (and and or) are particularly useful in conditions.

|| if

Additional if examples

A max function:

// max() returns the larger of 2 ints


func max(a,b int) int {
if a > b {

return a

return b

The same function can be written with an else clause too:

// max() returns the larger of 2 ints equivalent to above


func max(a,b int) int {
if a > b {

return a
} else {

return b

The else clause is optional, and you can have as many statements as you want inside the THEN PART and
ELSE PART :
if temperature > 100 {
fmt.Println(“Warning: too hot!”)
fmt.Println("Run away!")
Go requires that

the { must be on same line as the if


} and must be on same line as the else

Another example:

// AbsInt() computes the absolute value of an integer.


func AbsInt(x int) int {
if x < 0 {

return x

Test yourself! What does the following print?

var a,b int = 3,3

if a < 10 {
a = a*a

if a * a > 3*b {
var t int = a
a = b

b = t

10 if a < b {

11
} else {

if statements can be nested: an if statement can appear inside the THEN PART or ELSE PART of
another if statement. This is common, and let's you make complex decisions.
// returns the smallest even number among 2 ints; returns 0 if both are odd
func smallestEven(a, b int) int {
if a % 2 == 0 {

if b % 2 == 0 {
// both a and b are even, so return smaller one
if a < b {

return a
} else {

return b

10
} else {
11 } else if b % 2 == 0 { // ***

12 // only b is even
return b
13
} else {
// both a and b are odd
14
return 0
15

16

Reminder: is the "mod" operator: is the remainder when is divided by .


x % y
else ***
Notice that you can put an if directly following an : see line marked with a above. This is the
same as:

if a % 2 == 0 {
...

} else {
...

but uses one fewer set of { } .

"For" Loops

for
syntax
Sometimes we want to execute a sequence of instructions many times. For this, a loop is what we need.

Go has only 1 kind of loop (with 2 variants): the for loop.

The statements in the body of the loop will be executed until the loop condition is false. Each time through the
loop is called an iteration.

An example:

func factorial(n int) int {


var f int = 1
var i int

for i = 1; i <= n; i=i+1 {


f = f * i

return f

The syntax for a for loop is:

There are 3 parts following the for and before the , and these parts are separated by semicolons.
These parts work as follows:

INITIALIZATION_STATEMENT: a single statement that is executed one time before the loop starts.
CONDITION: the loop will repeatedly execute until this condition is false.
POST-INTERATION_STATEMENT: this is run after each time the FOR-BODY is executed

Test yourself! How many times is the word "Hi" printed by this loop:

var i int
for i = 10; i < 20; i = i+1 {
fmt.Println("Hi")

Schematically, a for loop works like this:


Note: any of the three parts of the for statement can be omitted. If both the INITIALIZATION_STATEMENT
and the POST-ITERATION_STATEMENT are omitted, you can omit the

"While" loops

If we only include the CONDITION in a for loop, we will execute the FOR-BODY "while" CONDITION is true.
You could re-write factorial as:

var f int = 1
var i int = 1
for i <= n { // only condition in for

f = f * i
i = i + 1

Increment and Decrement operators

The and operators add 1 or subtract 1 from a variable. These are particularly useful in the

i++ i--
POST-ITERATION_STATEMENT part of a loop, where you can write instead of the longer (but
for i++
exactly equivalent) .
i = i + 1
:= variable declarations

Notice in the factorial example that we had to create a variable that just served to count how many
times we had executed the loop. This is quite common. Go provides a shorthand for this so that you can
declare a variable inside of the INITIALIZATION_STATEMENT:

v := 1

The := operator both declares and initializes a variable. The above is equivalent to:

var v int = 1

Question: In v := 1 , how does Go know what type is?

The answer is that Go knows must be an integer because is an integer. This works for s
string
and s too:
float64

r := 3.14159
s := "Hi there"

These statements save you typing var and the type. We can now rewrite in a more clear,
factorial
typical way:

func factorial(n int) int {


f := 1

for i := 1; i <= n; i++ {


f = f * i

return f

Question: What is the difference between these two snippets:

var f int = 1
for i := 1; i <= n; i++ {
f = f * i

and
var f int = 1
i := 1

for i <= n {
f = f * i
i++

The answer is the scope of the variable . In the first snippet, lasts only for the loop, while in the second

example, lasts after the loop completes.

Variable declarations in loop bodies

What will the following function print? Is it correct?

// BAD CODE
func sumSquares() {

// print partial sums of the sequence of squares

// of the numbers 1 to 10

for i := 1; i <= 10; i = i + 1 {


var j int

j = j + i * i

This is wrong! It will print:

1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100

which are the first 10 squares, not their sums. Why does this happen?
Variable is created and destroyed each time through the loop!

Nested loops

Loops can be nested just like if statements. For example:

func printSquare(n int) {


for i := 1; i <= n; i=i+1 {

for j := 1; j <= n; j=j+1 {


fmt.Print("#")

fmt.Println("")

will print:

carlk$ go run square.go

10

11

4. Functions

A function in R is an object containing multiple interrelated statements that are run together in a
predefined order every time the function is called. Functions in R can be built-in or created by the user (user-
defined). The main purpose of creating a user-defined function is to optimize our program, avoid the
repetition of the same block of code used for a specific task that is frequently performed in a particular
project, prevent us from inevitable and hard-to-debug errors related to copy-paste operations, and make the
code more readable. A good practice is creating a function whenever we're supposed to run a certain set of
commands more than twice.

4.1 Built-in Functions in R


There are plenty of helpful built-in functions in R used for various purposes. Some of the most popular ones are:

• min(), max(), mean(), median() – return the minimum / maximum / mean / median value of a numeric
vector, correspondingly
• sum() – returns the sum of a numeric vector
• range() – returns the minimum and maximum values of a numeric vector
• abs() – returns the absulute value of a number
• str() – shows the structure of an R object
• print() – displays an R object on the console
• ncol() – returns the number of columns of a matrix or a dataframe
• length() – returns the number of items in an R object (a vector, a list, etc.)
• nchar() – returns the number of characters in a character object
• sort() – sorts a vector in ascending or descending (decreasing=TRUE) order
• exists() – returns TRUE or FALSE depending on whether or not a variable is defined in the R environment

Let's see some of the above functions in action:

vector <- c(3, 5, 2, 3, 1, 4)

print(min(vector))

print(mean(vector))

print(median(vector))

print(sum(vector))

print(range(vector))

print(str(vector))
print(length(vector))

print(sort(vector, decreasing=TRUE))

print(exists('vector')) ## note the quotation marks

[1] 1
[1] 3
[1] 3
[1] 18
[1] 1 5
num [1:6] 3 5 2 3 1 4
NULL
[1] 6
[1] 5 4 3 3 2 1
[1] TRUE

4.3 Creating a Function in R


While applying built-in functions facilitates many common tasks, often we need to create our own function to
automate the performance of a particular task. To declare a user-defined function in R, we use the
keyword function. The syntax is as follows:

function_name <- function(parameters){


function body
}
Above, the main components of an R function are: function name, function parameters, and function body. Let's
take a look at each of them separately.

Function Name
This is the name of the function object that will be stored in the R environment after the function definition and
used for calling that function. It should be concise but clear and meaningful so that the user who reads our code
can easily understand what exactly this function does. For example, if we need to create a function for calculating
the circumference of a circle with a known radius, we'd better call this function circumference rather
than function_1 or circumference_of_a_circle. (Side note: While commonly we use verbs in function names, it's ok to use
just a noun if that noun is very descriptive and unambiguous.)
Function Parameters
Sometimes, they are called formal arguments. Function parameters are the variables in the function
definition placed inside the parentheses and separated with a comma that will be set to actual values
(called arguments) each time we call the function. For example:

circumference <- function(r){

2*pi*r

print(circumference(2))

[1] 12.56637

Above, we created a function to calculate the circumference of a circle with a known radius using the
formula �=2��, so the function has the only parameter r. After defining the function, we called it with
the radius equal to 2 (hence, with the argument 2).
It's possible, even though rarely useful, for a function to have no parameters:

hello_world <- function(){

'Hello, World!'

print(hello_world())

[1] "Hello, World!"

Also, some parameters can be set to default values (those related to a typical case) inside the function
definition, which then can be reset when calling the function. Returning to our circumference function, we
can set the default radius of a circle as 1, so if we call the function with no argument passed, it will
calculate the circumference of a unit circle (i.e., a circle with a radius of 1). Otherwise, it will calculate the
circumference of a circle with the provided radius:

circumference <- function(r=1){

2*pi*r
}

print(circumference())

print(circumference(2))

[1] 6.283185
[1] 12.56637

Function Body
The function body is a set of commands inside the curly braces that are run in a predefined order every
time we call the function. In other words, in the function body, we place what exactly we need the function
to do:

sum_two_nums <- function(x, y){

x + y

print(sum_two_nums(1, 2))

[1] 3
Note that the statements in the function body (in the above example – the only statement x + y) should
be indented by 2 or 4 spaces, depending on the IDE where we run the code, but the important thing is to
be consistent with the indentation throughout the program. While it doesn't affect the code performance
and isn't obligatory, it makes the code easier to read.

It's possible to drop the curly braces if the function body contains a single statement. For example:

sum_two_nums <- function(x, y) x + y

print(sum_two_nums(1, 2))

[1] 3

As we saw from all the above examples, in R, it usually isn't necessary to explicitly include the return
statement when defining a function since an R function just automatically returns the last evaluated
expression in the function body. However, we still can add the return statement inside the function body
using the syntax return(expression_to_be_returned). This becomes inevitable if we need to return more
than one result from a function. For example:

mean_median <- function(vector){

mean <- mean(vector)

median <- median(vector)

return(c(mean, median))

print(mean_median(c(1, 1, 1, 2, 3)))

[1] 1.6 1.0


Note that in the return statement above, we actually return a vector containing the necessary results, and
not just the variables separated by a comma (since the return() function can return only a single R
object). Instead of a vector, we could also return a list, especially if the results to be returned are
supposed to be of different data types.

Calling a Function in R
In all the above examples, we actually already called the created functions many times. To do so, we just
put the punction name and added the necessary arguments inside the parenthesis. In R, function
arguments can be passed by position, by name (so-called named arguments), by mixing position-based
and name-based matching, or by omitting the arguments at all.

If we pass the arguments by position, we need to follow the same sequence of arguments as defined in
the function:

subtract_two_nums <- function(x, y){

x - y

print(subtract_two_nums(3, 1))
[1] 2

In the above example, x is equal to 3 and y – to 1, and not vice versa.

If we pass the arguments by name, i.e., explicitly specify what value each parameter defined in the
function takes, the order of the arguments doesn't matter:

subtract_two_nums <- function(x, y){

x - y

print(subtract_two_nums(x=3, y=1))

print(subtract_two_nums(y=1, x=3))

[1] 2
[1] 2
Since we explicitly assigned x=3 and y=1, we can pass them either as x=3, y=1 or y=1, x=3 – the result
will be the same.

It's possible to mix position- and name-based matching of the arguments. Let's look at the example of the
function for calculating BMR (basal metabolic rate), or daily consumption of calories, for women based on
their weight (in kg), height (in cm), and age (in years). The formula that will be used in the function is
the Mifflin-St Jeor equation:

calculate_calories_women <- function(weight, height, age){

(10 * weight) + (6.25 * height) - (5 * age) - 161

Now, let's calculate the calories for a woman 30 years old, with a weight of 60 kg and a height of 165 cm.
However, for the age parameter, we'll pass the argument by name and for the other two parameters, we'll
pass the arguments by position:

print(calculate_calories_women(age=30, 60, 165))


[1] 1320.25

In the case like above (when we mix matching by name and by position), the named arguments are
extracted from the whole succession of arguments and are matched first, while the rest of the arguments
are matched by position, i.e., in the same order as they appear in the function definition. However, this
practice isn't recommended and can lead to confusion.

Finally, we can omit some (or all) of the arguments at all. This can happen if we set some (or all) of the
parameters to default values inside the function definition. Let's return to
our calculate_calories_women function and set the default age of a woman as 30 y.o.:

calculate_calories_women <- function(weight, height, age=30){

(10 * weight) + (6.25 * height) - (5 * age) - 161

print(calculate_calories_women(60, 165))

[1] 1320.25
In the above example, we passed only two arguments to the function, despite it having three parameters
in its definition. However, since one of the parameters has a default value assigned to it when we pass
two arguments to the function, R interprets that the third missing argument should be set to its default
value and makes the calculations accordingly, without throwing an error.

When calling a function, we usually assign the result of this operation to a variable, to be able to use it
later:

circumference <- function(r){

2*pi*r

circumference_radius_5 <- circumference(5)

print(circumference_radius_5)

[1] 31.41593
Using Functions Inside Other Functions
Inside the definition of an R function, we can use other functions. We've already seen such an example
earlier, when we used the built-in mean() and median() functions inside a user-defined
function mean_median:

mean_median <- function(vector){

mean <- mean(vector)

median <- median(vector)

return(c(mean, median))

It's also possible to pass the output of calling one function directly as an argument to another function:

radius_from_diameter <- function(d){

d/2

circumference <- function(r){

2*pi*r

print(circumference(radius_from_diameter(4)))
[1] 12.56637

In the above piece of code, we created two simple functions first: for calculating the radius of a circle
given its diameter and for calculating the circumference of a circle given its radius. Since originally we
knew only the diameter of a circle (equal to 4), we called the radius_from_diameter function inside
the circumference function to calculate first the radius from the provided value of diameter and then
calculate the circumference of the circle. While this approach can be useful in many cases, we should be
careful with it and avoid passing too many functions as arguments to other functions since it can affect the
code readability.

Finally, functions can be nested, meaning that we can define a new function inside another function. Let's
say that we need a function that sums up the circle areas of 3 non-intersecting circles:

sum_circle_ares <- function(r1, r2, r3){

circle_area <- function(r){

pi*r^2

circle_area(r1) + circle_area(r2) + circle_area(r3)

print(sum_circle_ares(1, 2, 3))

[1] 43.9823
Above, we defined the circle_area function inside the sum_circle_ares function. We then called that
inner function three times (circle_area(r1), circle_area(r2), and circle_area(r3)) inside the outer
function to calculate the area of each circle for further summing up those areas. Now, if we try to call
the circle_area function outside the sum_circle_ares function, the program throws an error, because the
inner function exists and works only inside the function where it was defined:

print(circle_area(10))
Error in circle_area(10): could not find function "circle_area"
Traceback:

1. print(circle_area(10))

When nesting functions, we have to keep in mind two things:

1. Similar to creating any function, the inner function is supposed to be used at least 3
times inside the outer function. Otherwise, it isn't viable to create it.
2. If we want to be able to use the function independent of the bigger function, we should
create it outside the bigger function instead of nesting these functions. For example, if
we were going to use the circle_area function outside the sum_circle_ares function,
we would write the following code:

circle_area <- function(r){

pi*r^2

sum_circle_ares <- function(r1, r2, r3){

circle_area(r1) + circle_area(r2) + circle_area(r3)

print(sum_circle_ares(1, 2, 3))

print(circle_area(10))

[1] 43.9823
[1] 314.1593
Here, we go again used the circle_area function inside the sum_circle_ares function. However, this
time, we were also able to call it outside that function and get the result rather than an error.
5.Dataset:

A data set is a collection of data, often presented in a table.

There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road
Tests), which is retrieved from the 1974 Motor Trend US Magazine.

1.Information about the dataset:

You can use the question mark (?) to get information about the mtcars data set:

2.Get Information

a.Use the dim() function to find the dimensions of the data set, and
the names() function to view the names of the variables:

Example:

Data_Cars <- mtcars # create a variable of the mtcars data set for
better organization

# Use dim() to find the dimension of the data set


dim(Data_Cars)

# Use names() to find the names of the variables from the data set
names(Data_Cars)

b.Use the rownames() function to get the name of each row in the first
column, which is the name of each car:

Example:

Data_Cars <- mtcars

rownames(Data_Cars)

3.Print Variable Values

If you want to print all values that belong to a variable, access the data frame
by using the $ sign, and the name of the variable (for example cyl )

Example:

Data_Cars <- mtcars

Data_Cars$cyl

4.Sort Variable Values

To sort the values, use the sort() function:


Data_Cars <- mtcars

sort(Data_Cars$cyl)

5. Analyzing the Data

Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.

For example, we can use the summary() function to get a statistical summary of
the data:

Example:

Data_Cars <- mtcars

summary(Data_Cars)

The summary() function returns six statistical numbers for each variable:

• Min
• First quantile (percentile)
• Median
• Mean
• Third quantile (percentile)
• Max

6.Max Min

In the previous chapter, we introduced the mtcars data set. We will continue
to use this data set throughout the next pages.

You learned from the R Math chapter that R has several built-in math
functions. For example, the min() and max() functions can be used to find the lowest
or highest value in a set:

Find the largest and smallest value of the variable hp (horsepower).

Data_Cars <- mtcars

max(Data_Cars$hp)
min(Data_Cars$hp)

Mean, Median, and Mode


In statistics, there are often three values that interests us:
• Mean - The average value
• Median - The middle value
• Mode - The most common value

Mean
To calculate the average value (mean) of a variable from the mtcars data set, find the sum of all
values, and divide the sum by the number of values.

Sorted observation of wt (weight)

1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465

2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215

3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570

3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

Luckily for us, the mean() function in R can do it for you:

Example
Find the average weight (wt) of a car:

Data_Cars <- mtcars

mean(Data_Cars$wt)

Median
The median value is the value in the middle, after you have sorted all the values.

If we take a look at the values of the wt variable (from the mtcars data set), we will see that
there are two numbers in the middle:
Sorted observation of wt (weight)

1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465

2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215

3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570

3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

Note: If there are two numbers in the middle, you must divide the sum of those numbers by
two, to find the median.

Luckily, R has a function that does all of that for you: Just use the median() function to find the
middle value:

Example
Find the mid point value of weight (wt):

Data_Cars <- mtcars

median(Data_Cars$wt)

Result:

[1] 3.325

Mode
The mode value is the value that appears the most number of times.

R does not have a function to calculate the mode. However, we can create our own function to
find it.

If we take a look at the values of the wt variable (from the mtcars data set), we will see that the
numbers 3.440 are often shown:
Sorted observation of wt (weight)

1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465

2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215

3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570

3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

Instead of counting it ourselves, we can use the following code to find the mode:

Example
Data_Cars <- mtcars

names(sort(-table(Data_Cars$wt)))[1]

Result:

[1] "3.44"

From the example above, we now know that the number that appears the most number of times
in mtcars wt variable is 3.44 or 3.440 lbs.

You might also like