0% found this document useful (0 votes)
6 views

Assignment

Uploaded by

shafaq tanveer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Assignment

Uploaded by

shafaq tanveer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Titanic Data Preprocessing

EMAN

r Sys.Date()

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and
MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the
output of any embedded R code chunks within the document. You can embed an R code chunk like this:
{r cars} summary(cars)

Including Plots
You can also embed plots, for example:
{r pressure, echo=FALSE} plot(pressure)
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that
generated the plot.
library(dplyr) library(tidyr) library(caret) library(knitr) library(rmarkdown)

Load the dataset


titanic_data <- read.csv(“titanic.csv”)

Inspect the first few rows and structure


head(titanic_data) str(titanic_data) View(titanic.csv)

Handling missing values


missing_values <- sapply(titanic_data, function(x) sum(is.na(x))) missing_values

Imputing Missing Values(median)


titanic_dataAge[is.na(titanicd ataAge)] <- median(titanic_data$Age, na.rm = TRUE)

1
Imputing frequent value (mode)
most_frequent_embarked <- as.character(names(sort(table(titanic_dataEmbarked), decreasing =
T RU E)[1]))titanicd ataEmbarked[is.na(titanic_data$Embarked)] <- most_frequent_embarked
titanic_data <- titanic_data %>% select(-Cabin)

Encode categorical variables


titanic_dataSex < −as.f actor(titanicd ataSex) titanic_dataEmbarked < −as.f actor(titanicd ataEmbarked)
titanic_data <- titanic_data %>% mutate(Sex = as.numeric(Sex == “female”), Embarked_S =
as.numeric(Embarked == “S”), Embarked_C = as.numeric(Embarked == “C”), Embarked_Q =
as.numeric(Embarked == “Q”)) %>% select(-Embarked)

Feature engineering
titanic_dataF amilySize < −titanicd ataSibSp + titanic_data$Parch + 1

Create ‘IsAlone’ feature


titanic_dataIsAlone < −if else(titanicd ataFamilySize == 1, 1, 0)

Drop unnecessary columns


titanic_data <- titanic_data %>% select(-PassengerId, -Name, -Ticket)

Splitting the dataset


set.seed(123) # For reproducibility train_index <- createDataPartition(titanic_data$Survived, p = 0.8, list
= FALSE) train_data <- titanic_data[train_index, ] test_data <- titanic_data[-train_index, ]

Output the dimensions of the training and testing sets


table(trainSurvived)table(testSurvived)

You might also like