0% found this document useful (0 votes)
36 views9 pages

21NKU14 - Preprosessing Assignment

The document outlines 6 steps for pre-processing a dataset: 1) Understanding the data by viewing dimensions, column names, and structure; 2) Looking at the data by viewing the first/last rows; 3) Visualizing the data with bar plots and box plots; 4) Dealing with outliers by capping extreme values; 5) Dealing with missing values by imputing the mean for total profit; and 6) Scaling features like total profit for machine learning algorithms.

Uploaded by

S.K. Praveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views9 pages

21NKU14 - Preprosessing Assignment

The document outlines 6 steps for pre-processing a dataset: 1) Understanding the data by viewing dimensions, column names, and structure; 2) Looking at the data by viewing the first/last rows; 3) Visualizing the data with bar plots and box plots; 4) Dealing with outliers by capping extreme values; 5) Dealing with missing values by imputing the mean for total profit; and 6) Scaling features like total profit for machine learning algorithms.

Uploaded by

S.K. Praveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

“PRE-PROCESSING STEPS”

STEP 1: UNDERSTANDING THE DATA:

1.VIEW THE DIMENSIONS

> class(X100_Data)

[1] "spec_tbl_df" "tbl_df" "tbl"

[4] "data.frame"

> dim(X100_Data)

[1] 100 14

Dimension was viewed as a number of rows and columns as 100 and 14 respectively

> nrow(X100_Data)

[1] 100

Number of Rows alone viewed with this command

> ncol(X100_Data)

[1] 14

The number of Columns alone viewed with this command

2.VIEW THE COLUMN NAMES

> names(X100_Data)

[1] "Region" "Country" "Item Type"

[4] "Sales Channel" "Order Priority" "Order Date"

[7] "Order ID" "Ship Date" "Units Sold"

[10] "Unit Price" "Unit Cost" "Total Revenue"

[13] "Total Cost" "Total Profit"

It shows the name of the headings in the dataset table. 14 headings = 14 columns

3.STRUCTURE OF THE DATASET


> str(X100_Data)

spec_tbl_df [100 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

$ Region : chr [1:100] "Australia and Oceania" "Central America and the Caribbean"
"Europe" "Sub-Saharan Africa" ...

$ Country : chr [1:100] "Tuvalu" "Grenada" "Russia" "Sao Tome and Principe" ...

$ Item Type : chr [1:100] "Baby Food" "Cereal" "Office Supplies" "Fruits" ...

$ Sales Channel : chr [1:100] "Offline" "Online" "Offline" "Online" ...

$ Order Priority: chr [1:100] "H" "C" "L" "C" ...

$ Order Date : chr [1:100] "5/28/2010" "8/22/2012" "05-02-2014" "6/20/2014" ...

$ Order ID : num [1:100] 6.69e+08 9.64e+08 3.41e+08 5.14e+08 1.15e+08 ...

$ Ship Date : chr [1:100] "6/27/2010" "9/15/2012" "05-08-2014" "07-05-2014" ...

$ Units Sold : num [1:100] 9925 2804 1779 8102 5062 ...

$ Unit Price : num [1:100] 255.28 205.7 651.21 9.33 651.21 ...

$ Unit Cost : num [1:100] 159.42 117.11 524.96 6.92 524.96 ...

$ Total Revenue : num [1:100] 2533654 576783 1158503 75592 3296425 ...

$ Total Cost : num [1:100] 1582244 328376 933904 56066 2657348 ...

$ Total Profit : num [1:100] 951411 248406 224599 19526 639078 ...

- attr(*, "spec")=

.. cols(

.. Region = col_character(),

.. Country = col_character(),

.. `Item Type` = col_character(),

.. `Sales Channel` = col_character(),

.. `Order Priority` = col_character(),

.. `Order Date` = col_character(),


.. `Order ID` = col_double(),

.. `Ship Date` = col_character(),

.. `Units Sold` = col_double(),

.. `Unit Price` = col_double(),

.. `Unit Cost` = col_double(),

.. `Total Revenue` = col_double(),

.. `Total Cost` = col_double(),

.. `Total Profit` = col_double()

.. )

- attr(*, "problems")=<externalptr>

The structure of the dataset gave the character and numeric differentiations in the table. Like
Unit price is structured as numeric such as 159.42 and the Sales channel is structured as a
character such as offline.

> summary(X100_Data)

Region Country

Length:100 Length:100

Class :character Class :character

Mode :character Mode :character

Item Type Sales Channel

Length:100 Length:100

Class :character Class :character

Mode :character Mode :character

Order Priority Order Date

Length:100 Length:100
Class :character Class :character

Mode :character Mode :character

Order ID Ship Date Units Sold

Min. :114606559 Length:100 Min. : 124

1st Qu.:338922488 Class :character 1st Qu.:2836

Median :557708561 Mode :character Median :5382

Mean :555020412 Mean :5129

3rd Qu.:790755081 3rd Qu.:7369

Max. :994022214 Max. :9925

It gave the data summary with the minimum values, 1st and 3rd quartile values, Mean, median
and mode, maximum values, and the count of not available values.

STEP 2: LOOKING AT THE DATA:

> head(X100_Data)

> head(X100_Data,n=15)
It shows the top 6 rows by default. When typing the command head (X100_Data, n=15), shows
the first 15 rows.

> tail(X100_Data)

It shows the bottom 6 rows by default.

STEP 3: VISUALIZING THE DATA:

1.Bar plot:

> Region<-table(X100_Data$Region)

> Region

> barplot(Regiontable,col = c("violet","Red","Green","black","Blue","Yellow","Orange"),ylab =


"Region")

2.Box plot:

> par(mfrow=c(1,2))

> boxplot(X100_Data$`Total Profit`)


> par(mfrow=c(1,1))

Outliers occurred.

STEP 4: DEALING WITH OUTLIERS:

> X100_Data1<-X100_Data

> boxplot(X100_Data$`Total Profit`)

> boxplot(X100_Data$`Total Profit`[X100_Data$`Total Profit`<1000000])

> boxplot(X100_Data$`Total Profit`, horizontal = TRUE)

> attach(X100_Data)

> x<-`Total Profit`

> qnt<-quantile(x,probs = c(.25,.75),na.rm=T)

> caps<-quantile(x,probs=c(.05,.95),na.rm=T)
> H <- 1.5 * IQR(x, na.rm = T)

> x[x < (qnt[1] - H)] <- caps[1]

> x[x > (qnt[2] + H)] <- caps[2]

> `Total Profit`<-x

> boxplot(`Total Profit`,main="Boxplot of Total Profit",horizontal=TRUE,col='Grey')

STEP 5: DEALING WITH MISSING VALUES:

> data("X100_Data")

> any(is.na(X100_Data[]))

[1] FALSE

> sum(is.na(X100_Data[]))

[1] 0

> colSums(is.na(X100_Data[]))

> nrow(X100_Data)

[1] 100

> nrow(X100_Data1)
[1] 100

> m=mean(X100_Data1$`Total Profit`[!is.na(X100_Data1$`Total Profit`)])

[1] 441682

The non-available values are detected from the above diagram, and code is done for total profit
and the value was included in the dataset table.

STEP 6: SCALING THE FEATURES:

1.To display the data in vector X:

> x<-X100_Data$`Total Profit`

>x
It can be an important pre-processing step for many machine-learning algorithms

> scale(x)

The center parameter takes either a numeric alike vector or logical value.

You might also like