DS EXP3
DS EXP3
NO:3
Data manipulation in R
DATE:
Aim:
To study about the data manipulations in R and execute them.
Source code:
(a) XYZ Online Store is analysing its E-commerce Sales Dataset to understand
trends, correct errors, and make predictions. The dataset contains 30 records with
details like Order ID, Date, Category, Product, Revenue, Quantity Sold, and
Customer Rating. To analyse sales performance, determine the total number of
records, sort the dataset by descending revenue, and identify the top 5 highest
revenue-generating products. Extract specific sales values, such as orders with
revenue above ₹10,000, count products sold in the Electronics category, and
retrieve orders with customer ratings of 4.5 or higher. Identify and correct data
entry errors, such as a mistakenly entered revenue value of ₹50 instead of ₹500.
Finally, for predictive analysis, simulate a 90-day sales cycle by repeating the
existing dataset for the next two months. Perform these operations using R to gain
actionable insights from the sales data.
Procedure:
OUTPUT:
Procedure:
Step 1 : Create a list to store participant details like ID, Name, Event, and
Registration Status.
Step 2 : Add participants by appending new data to the list.
Step 3 : Remove participants by using NULL or subset operations.
Step 4 : Update registration status by modifying the respective list elements.
Step 5 : Filter data based on specific criteria (e.g., confirmed registrations).
Step 6 : Process data dynamically using loops or functions for analysis or
updates.
Source code:
OUTPUT:
Procedure:
Step 1: Create a matrix where rows represent cities and columns represent temperature
readings over multiple days.
Step 2 : Initialize variables to store the hottest and coldest city indices based on average
temperature.
Step 3 : Use nested loops to iterate through each row (city) and calculate the average
temperature.
Step 4 : Compare averages to determine the hottest and coldest city.
Step 5 : Store and display the results with city names and their corresponding average
temperatures.
emperature_matrix <- matrix(c(30, 32, 33, 29, 31, 28, 24, 25, 26, 27, 35, 33), nrow=3,
byrow=TRUE)
hottest_city_index <- 0
coldest_city_index <- 0
hottest_avg_temp <- -Inf
coldest_avg_temp <- Inf
for (i in 1:nrow(temperature_matrix)) {
avg_temp <- mean(temperature_matrix[i,])
if (avg_temp > hottest_avg_temp) {
hottest_avg_temp <- avg_temp
hottest_city_index <- i
}
if (avg_temp < coldest_avg_temp) {
coldest_avg_temp <- avg_temp
coldest_city_index <- i
}
}
cat("The hottest city is city", hottest_city_index, "with an average temperature of",
hottest_avg_temp, "\n")
cat("The coldest city is city", coldest_city_index, "with an average temperature of",
coldest_avg_temp, "\n")
Output:
3. How can you extract the soil temperature readings for Farm_C on Day 7 at
all depths?
4. If a new farm (Farm_E) is added to the dataset, how would you modify the
array to include its data?
Procedure:
Step 1 : Create an array using array() to store soil temperature data with dimensions (14
days, 4 farms, 3 depths).
Step 2 : Write a function that calculates the average soil temperature at each depth by
applying apply() across farms and days.
Step 3 : Extract Farm_C data for Day 7 at all depths using indexing (array[7, 3, ]).
Step 4 : Modify the array by increasing the second dimension to 5 (adding Farm_E) and
updating data accordingly.
Source code:
# Step 1: Create an array to store soil temperature data for all farmlands over two weeks
at three soil depths
# Generate random temperature values for each farm at different depths (for illustration)
soil_temp_array[ , "Farm_A", ] <- matrix(runif(42, 18, 30), ncol = 3) # Farm_A
soil_temp_array[ , "Farm_B", ] <- matrix(runif(42, 15, 28), ncol = 3) # Farm_B
soil_temp_array[ , "Farm_C", ] <- matrix(runif(42, 16, 29), ncol = 3) # Farm_C
soil_temp_array[ , "Farm_D", ] <- matrix(runif(42, 17, 31), ncol = 3) # Farm_D
# Step 3: Function to compute the average soil temperature at each depth across all farms
# Step 4: Extract the soil temperature readings for Farm_C on Day 7 at all depths
# Step 5: Add a new farm (Farm_E) to the dataset and modify the array
Output:
Procedure:
Step 1 : Store categorical data using a factor in R for efficient analysis and memory
optimization.
Step 2 : Count Electronics preferences using sum(survey_data == "Electronics").
Step 3 : Add a new category by updating levels with levels(survey_data) <-
c(levels(survey_data), "Sports & Fitness").
Step 4 : Reorder factor levels using survey_data <- factor(survey_data, levels = c("Books",
"Electronics", "Clothing", "Home & Kitchen", "Sports & Fitness")).
Source code:
# Simulate survey data for 100 customers (randomly chosen category for each customer)
customer_preferences <- sample(categories, 100, replace = TRUE)
# Print the updated factor levels and the new customer preferences
print("Updated customer preferences (with reordered levels):")
print(customer_preferences_factor)
Output:
Procedure:
Step 1 : Create a data frame using data.frame() with at least 5 sample records.
Step 2 : Filter records where Price > 100 using subset(sales_data, Price > 100).
Step 3 : Compute total revenue by adding a new column: sales_data$Total_Revenue <-
sales_data$Price * sales_data$Quantity.
Step 4 : Add Discounted_Price column using ifelse(sales_data$Price > 50,
sales_data$Price * 0.9, sales_data$Price).
Step 5 : Determine the most purchased product using
which.max(table(sales_data$Product)).
Source code:
# Step 1: Create the sales dataset as a data frame with at least 5 sample records
sales_data <- data.frame(
Customer_ID = c(101, 102, 103, 104, 105),
Product = c("Laptop", "Headphones", "Keyboard", "Mouse", "Monitor"),
Price = c(1200, 80, 50, 30, 200),
Quantity = c(1, 2, 1, 3, 1),
Purchase_Date = as.Date(c("2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04",
"2025-01-05"))
)
# Step 3: Compute the total revenue (Price × Quantity) for each purchase and add it as a
new column
sales_data$Total_Revenue <- sales_data$Price * sales_data$Quantity
print("Sales Data with Total Revenue:")
print(sales_data)
# Step 5: Determine the most purchased product (highest total quantity sold)
product_sales <- aggregate(Quantity ~ Product, data = sales_data, sum)
most_purchased_product <- product_sales[which.max(product_sales$Quantity), ]
print("Most Purchased Product:")
print(most_purchased_product)
Output: