0% found this document useful (0 votes)
10 views

DS EXP3

The document outlines various data manipulation techniques in R, focusing on analyzing an E-commerce sales dataset and managing participant registration for a conference. It includes procedures for loading, filtering, and transforming data, as well as using lists and arrays to handle participant details and temperature readings. Additionally, it provides source code examples for implementing these operations in R.

Uploaded by

LIGHTNING BOLT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DS EXP3

The document outlines various data manipulation techniques in R, focusing on analyzing an E-commerce sales dataset and managing participant registration for a conference. It includes procedures for loading, filtering, and transforming data, as well as using lists and arrays to handle participant details and temperature readings. Additionally, it provides source code examples for implementing these operations in R.

Uploaded by

LIGHTNING BOLT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

EXP.

NO:3
Data manipulation in R
DATE:

Aim:
To study about the data manipulations in R and execute them.

Source code:

(a) XYZ Online Store is analysing its E-commerce Sales Dataset to understand
trends, correct errors, and make predictions. The dataset contains 30 records with
details like Order ID, Date, Category, Product, Revenue, Quantity Sold, and
Customer Rating. To analyse sales performance, determine the total number of
records, sort the dataset by descending revenue, and identify the top 5 highest
revenue-generating products. Extract specific sales values, such as orders with
revenue above ₹10,000, count products sold in the Electronics category, and
retrieve orders with customer ratings of 4.5 or higher. Identify and correct data
entry errors, such as a mistakenly entered revenue value of ₹50 instead of ₹500.
Finally, for predictive analysis, simulate a 90-day sales cycle by repeating the
existing dataset for the next two months. Perform these operations using R to gain
actionable insights from the sales data.

Procedure:

Step 1 : Load Data


Step 2 : View & Inspect Data
Step 3 : Select & Filter Data
Step 4 : Transform & Mutate Columns
Step 5 : Rename or Drop Columns
Step 6 : Sort & Arrange Data
Step 7 : Summarize & Aggregate Data
Step 8 : Merge & Join Data
Step 9 : Reshape Data (Pivoting)
Step 10 : Export Data

participants <- list(


ParticipantID = c(1, 2, 3, 4, 5),
Name = c("John Doe", "Alice Smith", "Bob Johnson", "Mark Lee", "Eva Green"),
Event = c("Cloud Computing Talk", "AI Workshop", "Blockchain Workshop", "Cloud
Computing Talk", "Cybersecurity Talk"),
RegistrationStatus = c("Confirmed", "Pending", "Confirmed", "Confirmed", "Pending")
)

ARJUN SUDHEER (71812201021)


third_participant <- list(
ParticipantID = participants$ParticipantID[3],
Name = participants$Name[3],
Event = participants$Event[3],
RegistrationStatus = participants$RegistrationStatus[3]
)
cat("Details of third participant:\n")
print(third_participant)
second_participant <- list(
Name = participants$Name[2],
RegistrationStatus = participants$RegistrationStatus[2]
)
cat("\nName and status of second participant:\n")
print(second_participant)
confirmed_participants <- any(participants$RegistrationStatus == "Confirmed")
cat("\nIs there any participant with 'Confirmed' status? ", confirmed_participants, "\n")
cloud_computing_talk_participant <- any(participants$Event == "Cloud Computing Talk")
cat("Is there a participant registered for the 'Cloud Computing Talk'? ",
cloud_computing_talk_participant, "\n")
new_participant <- list(
ParticipantID = 6,
Name = "Sophia Lee",
Event = "Blockchain Workshop",
RegistrationStatus = "Confirmed"

participants$ParticipantID <- c(participants$ParticipantID, new_participant$ParticipantID)


participants$Name <- c(participants$Name, new_participant$Name)
participants$Event <- c(participants$Event, new_participant$Event)
participants$RegistrationStatus <- c(participants$RegistrationStatus,
new_participant$RegistrationStatus)
cat("\nUpdated list of participants after adding Sophia Lee:\n")
print(participants)
remove_index <- which(participants$Name == "Mark Lee")
participants$ParticipantID <- participants$ParticipantID[-remove_index]
participants$Name <- participants$Name[-remove_index]
participants$Event <- participants$Event[-remove_index]
participants$RegistrationStatus <- participants$RegistrationStatus[-remove_index]
cat("\nUpdated list after removing Mark Lee:\n")
print(participants)
selected_participants <- list(
ParticipantID = participants$ParticipantID[2:4],
Name = participants$Name[2:4],
Event = participants$Event[2:4],
RegistrationStatus = participants$RegistrationStatus[2:4]
)
ARJUN SUDHEER (71812201021)
cat("\nParticipants from index 2 to 4:\n")
print(selected_participants)
cat("\nLooping through the list of participants:\n")
for (i in 1:length(participants$Name)) {
cat(participants$Name[i], " - Status:", participants$RegistrationStatus[i], "\n")
}
new_participants <- list(
ParticipantID = 7:11,
Name = c("Daniel Green", "Mia White", "Liam Brown", "Olivia Black", "Ethan Blue"),
Event = c("Data Science Workshop", "Cloud Computing Workshop", "AI Workshop",
"Machine Learning Workshop", "Big Data Workshop"),
RegistrationStatus = c("Confirmed", "Pending", "Confirmed", "Confirmed", "Pending")
)
participants_combined <- list(
ParticipantID = c(participants$ParticipantID, new_participants$ParticipantID),
Name = c(participants$Name, new_participants$Name),
Event = c(participants$Event, new_participants$Event),
RegistrationStatus = c(participants$RegistrationStatus,
new_participants$RegistrationStatus)
)
cat("\nCombined list of participants after adding new workshop participants:\n")
print(participants_combined)

OUTPUT:

ARJUN SUDHEER (71812201021)


ARJUN SUDHEER (71812201021)
ARJUN SUDHEER (71812201021)
(b) Create a dataset for the Global Tech Conference is being organized by a
company that uses a list in R to manage the participants and their registration
details. The list includes Participant ID, Name, Event (workshops or talks), and
Registration Status (confirmed or pending). The conference organizers need to
manage the participant details efficiently, adding, removing, and processing data
dynamically.
Operations to Perform Using Lists in R
1. Access List Items
Retrieve the details of the third participant.
Get the name and status of the second participant.
2. Check if an Item Exists
Check if there is any participant with the status &quot;Confirmed&quot;.
Verify if there is a participant registered for the &quot;Cloud Computing Talk&quot;.
3. Add Items to the List
A new participant, Sophia Lee, registers for the Blockchain Workshop, and
their registration status is &quot;Confirmed&quot;. Add this participant to the list.
4. Remove Items from the List
A participant, Mark Lee, has canceled his registration. Remove his entry
from the list.

ARJUN SUDHEER (71812201021)


5. Range of Indexes

Extract the details of participants from index 2 to index 4 (inclusive) for


processing.
6. Loop Through the List
Loop through the list and print the names of all participants and their
registration status.
7. Join Two Lists
A new set of participants joins the conference for a new workshop. Combine
the original list with a second list of 5 additional participants.

Procedure:

Step 1 : Create a list to store participant details like ID, Name, Event, and
Registration Status.
Step 2 : Add participants by appending new data to the list.
Step 3 : Remove participants by using NULL or subset operations.
Step 4 : Update registration status by modifying the respective list elements.
Step 5 : Filter data based on specific criteria (e.g., confirmed registrations).
Step 6 : Process data dynamically using loops or functions for analysis or
updates.

Source code:

participants <- list(


list(ID = 101, Name = "Alice Johnson", Event = "AI Workshop", Status = "Confirmed"),
list(ID = 102, Name = "Bob Smith", Event = "Cloud Computing Talk", Status = "Pending"),
list(ID = 103, Name = "Charlie Brown", Event = "Cybersecurity Panel", Status =
"Confirmed"),
list(ID = 104, Name = "David White", Event = "Data Science Workshop", Status =
"Pending"),
list(ID = 105, Name = "Mark Lee", Event = "AI Workshop", Status = "Confirmed")
)
third_participant <- participants[[3]]
print(third_participant)

second_participant_info <- list(Name = participants[[2]]$Name, Status =


participants[[2]]$Status)
print(second_participant_info)
confirmed_status <- any(sapply(participants, function(x) x$Status == "Confirmed"))
print(confirmed_status)
cloud_computing_registered <- any(sapply(participants, function(x) x$Event == "Cloud
Computing Talk"))
print(cloud_computing_registered)
ARJUN SUDHEER (71812201021)
new_participant <- list(ID = 106, Name = "Sophia Lee", Event = "Blockchain Workshop",
Status = "Confirmed")
participants <- append(participants, list(new_participant))
participants <- participants[sapply(participants, function(x) x$Name != "Mark Lee")]
subset_participants <- participants[2:4]
print(subset_participants)
for (p in participants) {
cat("Name:", p$Name, "- Status:", p$Status, "\n")

new_participants <- list(


list(ID = 107, Name = "Emma Davis", Event = "Python Workshop", Status = "Confirmed"),
list(ID = 108, Name = "Liam Wilson", Event = "IoT Panel", Status = "Pending"),
list(ID = 109, Name = "Noah Brown", Event = "AI Ethics Talk", Status = "Confirmed"),
list(ID = 110, Name = "Olivia Clark", Event = "Data Analytics", Status = "Pending"),
list(ID = 111, Name = "Mason Taylor", Event = "Machine Learning Workshop", Status =
"Confirmed")
)
all_participants <- c(participants, new_participants)
print(all_participants)

OUTPUT:

ARJUN SUDHEER (71812201021)


ARJUN SUDHEER (71812201021)
ARJUN SUDHEER (71812201021)
ARJUN SUDHEER (71812201021)
( c) Loop Through a Matrix
A weather station collects temperature data from multiple cities. The data is stored
in a matrix, where:
Rows represent different cities
Columns represent temperature readings over multiple days
Task:
Use nested loops to find the hottest and coldest city based on average
temperature.

Procedure:

Step 1: Create a matrix where rows represent cities and columns represent temperature
readings over multiple days.
Step 2 : Initialize variables to store the hottest and coldest city indices based on average
temperature.
Step 3 : Use nested loops to iterate through each row (city) and calculate the average
temperature.
Step 4 : Compare averages to determine the hottest and coldest city.
Step 5 : Store and display the results with city names and their corresponding average
temperatures.

ARJUN SUDHEER (71812201021)


Source code:

emperature_matrix <- matrix(c(30, 32, 33, 29, 31, 28, 24, 25, 26, 27, 35, 33), nrow=3,
byrow=TRUE)
hottest_city_index <- 0
coldest_city_index <- 0
hottest_avg_temp <- -Inf
coldest_avg_temp <- Inf
for (i in 1:nrow(temperature_matrix)) {
avg_temp <- mean(temperature_matrix[i,])
if (avg_temp > hottest_avg_temp) {
hottest_avg_temp <- avg_temp
hottest_city_index <- i
}
if (avg_temp < coldest_avg_temp) {
coldest_avg_temp <- avg_temp
coldest_city_index <- i
}
}
cat("The hottest city is city", hottest_city_index, "with an average temperature of",
hottest_avg_temp, "\n")
cat("The coldest city is city", coldest_city_index, "with an average temperature of",
coldest_avg_temp, "\n")

Output:

ARJUN SUDHEER (71812201021)


(d) Agricultural Weather Monitoring System Arrays in R
Consider an agricultural research center analyzing soil temperature across four
different farmlands (Farm_A, Farm_B, Farm_C, Farm_D) for two weeks.
Rows: Represent days (1 to 14).
Columns: Represent farmlands (Farm_A, Farm_B, Farm_C, Farm_D).
Third Dimension: Represents three different depths in the soil (10cm,
50cm, 100cm).
Questions:
1. How would you create an array to store soil temperature data for all
farmlands over two weeks, at three soil depths?
2. Write an R function to compute the average soil temperature at each depth
across all farms.

3. How can you extract the soil temperature readings for Farm_C on Day 7 at
all depths?
4. If a new farm (Farm_E) is added to the dataset, how would you modify the
array to include its data?

Procedure:

Step 1 : Create an array using array() to store soil temperature data with dimensions (14
days, 4 farms, 3 depths).
Step 2 : Write a function that calculates the average soil temperature at each depth by
applying apply() across farms and days.
Step 3 : Extract Farm_C data for Day 7 at all depths using indexing (array[7, 3, ]).
Step 4 : Modify the array by increasing the second dimension to 5 (adding Farm_E) and
updating data accordingly.

Source code:

# Step 1: Create an array to store soil temperature data for all farmlands over two weeks
at three soil depths

# Define the dimensions


days <- 14
farms <- 4
depths <- 3

# Create the initial array with NA as placeholders


soil_temp_array <- array(data = NA, dim = c(days, farms, depths),
dimnames = list(Days = 1:days,
Farms = c("Farm_A", "Farm_B", "Farm_C", "Farm_D"),
Depths = c("10cm", "50cm", "100cm")))

# Print the empty array (just to show structure)


ARJUN SUDHEER (71812201021)
print("Initial soil temperature array (empty):")
print(soil_temp_array)

# Step 2: Manually fill the array with synthetic temperature data

set.seed(123) # For reproducibility (random number generation)

# Generate random temperature values for each farm at different depths (for illustration)
soil_temp_array[ , "Farm_A", ] <- matrix(runif(42, 18, 30), ncol = 3) # Farm_A
soil_temp_array[ , "Farm_B", ] <- matrix(runif(42, 15, 28), ncol = 3) # Farm_B
soil_temp_array[ , "Farm_C", ] <- matrix(runif(42, 16, 29), ncol = 3) # Farm_C
soil_temp_array[ , "Farm_D", ] <- matrix(runif(42, 17, 31), ncol = 3) # Farm_D

# Print the filled array


print("Soil temperature array with sample data:")
print(soil_temp_array)

# Step 3: Function to compute the average soil temperature at each depth across all farms

average_soil_temp <- function(soil_temp_array) {


# Calculate the average for each depth (across all farms and days)
avg_temp_at_depth <- apply(soil_temp_array, c(3), mean, na.rm = TRUE)
return(avg_temp_at_depth)
}

# Call the function and display the average temperatures


avg_temp <- average_soil_temp(soil_temp_array)
print("Average soil temperatures at each depth across all farms:")
print(avg_temp)

# Step 4: Extract the soil temperature readings for Farm_C on Day 7 at all depths

day_7_farm_C <- soil_temp_array[7, "Farm_C", ]


print("Soil temperature readings for Farm_C on Day 7 at all depths:")
print(day_7_farm_C)

# Step 5: Add a new farm (Farm_E) to the dataset and modify the array

# Create a new array with an additional farm (Farm_E)


new_soil_temp_array <- array(data = NA, dim = c(days, farms + 1, depths),
dimnames = list(Days = 1:days,
Farms = c("Farm_A", "Farm_B", "Farm_C", "Farm_D",
"Farm_E"),
Depths = c("10cm", "50cm", "100cm")))

ARJUN SUDHEER (71812201021)


# Example: Assign some data to Farm_E (just for illustration)
new_soil_temp_array[ , "Farm_E", ] <- matrix(runif(42, 18, 30), ncol = 3) # Random
data for Farm_E

# Print the new array with Farm_E added


print("New array with Farm_E added:")
print(new_soil_temp_array)

Output:

ARJUN SUDHEER (71812201021)


\

( e) Customer Survey Data Analysis for an E-Commerce Website using Factor


An e-commerce website conducts a customer survey to record the preferred
shopping category of 100 customers. The available categories are:
Electronics
Clothing
Home &amp; Kitchen
Books

ARJUN SUDHEER (71812201021)


Questions:
1. How can this categorical data be efficiently stored in R?
2. Write R code to determine how many customers prefer Electronics.
3. The company decides to introduce a new category, Sports &amp; Fitness. How
can the factor levels be updated to include this category?
4. How can the factor levels be reordered so that Books appears first instead
of the default order?

Procedure:

Step 1 : Store categorical data using a factor in R for efficient analysis and memory
optimization.
Step 2 : Count Electronics preferences using sum(survey_data == "Electronics").
Step 3 : Add a new category by updating levels with levels(survey_data) <-
c(levels(survey_data), "Sports & Fitness").
Step 4 : Reorder factor levels using survey_data <- factor(survey_data, levels = c("Books",
"Electronics", "Clothing", "Home & Kitchen", "Sports & Fitness")).

Source code:

# Step 1: Create a vector of customer preferences (for illustration purposes, we randomly


generate preferences)
set.seed(123) # For reproducibility
categories <- c("Electronics", "Clothing", "Home & Kitchen", "Books")

# Simulate survey data for 100 customers (randomly chosen category for each customer)
customer_preferences <- sample(categories, 100, replace = TRUE)

# Convert the customer preferences to a factor


customer_preferences_factor <- factor(customer_preferences, levels = categories)

# Step 2: Determine how many customers prefer Electronics


electronics_count <- sum(customer_preferences_factor == "Electronics")

# Print the result for Electronics


print(paste("Number of customers who prefer Electronics:", electronics_count))

# Step 3: Add a new category, Sports & Fitness


levels(customer_preferences_factor) <- c(levels(customer_preferences_factor), "Sports &
Fitness")

ARJUN SUDHEER (71812201021)


# Step 4: Reorder the factor levels so that "Books" appears first
customer_preferences_factor <- factor(customer_preferences_factor, levels = c("Books",
levels(customer_preferences_factor)[levels(customer_preferences_factor) != "Books"]))

# Print the updated factor levels and the new customer preferences
print("Updated customer preferences (with reordered levels):")
print(customer_preferences_factor)

Output:

(f) Sales Data Analysis using Data Frames


A sales dataset contains the following columns:
Customer_ID: Unique ID for each customer
Product: Name of the product purchased
Price: Purchase price
Quantity: Number of items bought
Purchase_Date: Date of purchase
Questions:
1. Write R code to create this dataset as a data frame with at least 5 sample
records.
2. How can records be filtered where the Price is greater than $100?
3. Write R code to compute the total revenue (Price × Quantity) for each
purchase.

ARJUN SUDHEER (71812201021)


4. How can a new column called Discounted_Price be added, applying a 10%
discount on purchases where Price is greater than $50?
5. How can the most purchased product be determined from the dataset?

Procedure:

Step 1 : Create a data frame using data.frame() with at least 5 sample records.
Step 2 : Filter records where Price > 100 using subset(sales_data, Price > 100).
Step 3 : Compute total revenue by adding a new column: sales_data$Total_Revenue <-
sales_data$Price * sales_data$Quantity.
Step 4 : Add Discounted_Price column using ifelse(sales_data$Price > 50,
sales_data$Price * 0.9, sales_data$Price).
Step 5 : Determine the most purchased product using
which.max(table(sales_data$Product)).

Source code:

# Step 1: Create the sales dataset as a data frame with at least 5 sample records
sales_data <- data.frame(
Customer_ID = c(101, 102, 103, 104, 105),
Product = c("Laptop", "Headphones", "Keyboard", "Mouse", "Monitor"),
Price = c(1200, 80, 50, 30, 200),
Quantity = c(1, 2, 1, 3, 1),
Purchase_Date = as.Date(c("2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04",
"2025-01-05"))
)

# Print the original sales data


print("Original Sales Data:")
print(sales_data)

# Step 2: Filter records where the Price is greater than $100


filtered_data <- sales_data[sales_data$Price > 100, ]
print("Filtered Data (Price > $100):")
print(filtered_data)

# Step 3: Compute the total revenue (Price × Quantity) for each purchase and add it as a
new column
sales_data$Total_Revenue <- sales_data$Price * sales_data$Quantity
print("Sales Data with Total Revenue:")
print(sales_data)

ARJUN SUDHEER (71812201021)


# Step 4: Add a new column called Discounted_Price with a 10% discount where Price >
$50
sales_data$Discounted_Price <- ifelse(sales_data$Price > 50, sales_data$Price * 0.9,
sales_data$Price)
print("Sales Data with Discounted Price (for Price > $50):")
print(sales_data)

# Step 5: Determine the most purchased product (highest total quantity sold)
product_sales <- aggregate(Quantity ~ Product, data = sales_data, sum)
most_purchased_product <- product_sales[which.max(product_sales$Quantity), ]
print("Most Purchased Product:")
print(most_purchased_product)

Output:

ARJUN SUDHEER (71812201021)


RESULT:
Thus data manipulations in R have been studied and executed successfully.

ARJUN SUDHEER (71812201021)

You might also like