Data Science Project
Data Science Project
Page 1 of 31
applying to applying this knowledge in
future endeavors.
Chirag Rai
12th Science
Page 2 of 31
CERTIFICATE
This is to certify that Chirag Rai, a
student of class 12 B1 (Science) of Army
Public School, Bengdubi, has successfully
completed the project ON ‘Exploratory
Data Analysis’,’Decision Tree’ and
‘Linear Regression’ under the guidance
of Mr. Somnath PaulChoudhury. This
project stands as a testament of their
diligence, dedication and hard work.
Date: 21 Jan , 2025
_________________________
_______________________
TEACHER-IC’S signature
EXAMINER’S signature
Page 3 of 31
TABLE OF CONTENTS
ACKNOWLEDGMENT
CEERTIFICATE
PROJECT 1[Exploratory Data
AnalysisWhat is EDA?
What is Iris Dataset?
R Code
Output Plots
Summary of EDA
PROJECT 2[Decision Tree]
What is Decision Tree?
R Code & Output
PROJECT 3[Linear Regression]
Introduction to Linear Regression
What is Boston dataset?
R Code & Output
Plot
Model Evaluation
BIBLOGRAPHY
Page 4 of 31
PROJECT 1
Exploratory Data
Analysis On Iris
Dataset
Page 5 of 31
EXPLORATORY DATA
ANALYSIS
Exploratory Data Analysis (EDA) is key
to understanding your data, Using R,
start by importing and cleaning your
data with functions like read.csv() and
na.omit(). Next generate basic
descriptive statistics using summary().
Visualization is crucial-
Use the ggplot2 package to create box
plots and histograms. Check for
correlation variable with cor(). Finally,
apply Principal Component Analysis
(PCA) to reduce the number of
variables while retaining information.
Page 6 of 31
WHAT IS IRIS DATASET IN
R?
It is simple, small, and easy to understand. It contains 150 records of iris
flowers, with four numerical features (sepal length, sepal width, petal length,
and petal width) and a categorical target variable (species). This simplicity
makes it ideal for students who are just beginning to explore data science
and machine learning concepts. The dataset's manageable size (150 rows)
ensures that students can easily process and analyze the data without
needing advanced tools or heavy computational power.
Finally, the hands-on experience with the Iris dataset helps students build
confidence in working with real-world data. It provides a practical context for
students to apply their knowledge of data cleaning, exploration, and model
evaluation. Whether using R or Python, both of which are commonly taught
in high school curriculums, students can gain experience in writing code,
interpreting results, and drawing conclusions from their analyses.
Page 7 of 31
R CODE
> library('ggplot2')
> data ("iris")
> #Rplot01
> ggplot(iris, aes (x=Sepal Length)) +
+ geom_histogram (binwidth=0.2, fill='blue', color='black')
+
+ labs (title = 'Distribution of Sepal Length') +
> #Rplot02
> ggplot(iris, aes (x=Sepal.Width)) +
+geom_histogram(binwidth=0.2, fill='green', color='black')
+
+ labs (title = 'Distribution of Sepal Width')
> #Rplot03
> ggplot (iris, aes (x=Petal. Length)) +
+ geom_histogram (binwidth=0.2, fill='red', color='black') +
Page 8 of 31
+ labs (title = 'Distribution of Petal Length')
> #Rplot04
> ggplot(iris, aes (x=Petal Width)) +
+geom_histogram (binwidth=0.2, fill='purple', color='black')
+
+ labs (title = 'Distribution of Petal Width')
> #Rplot05
> ggplot(iris, aes (x=Species, y=Sepal Length, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Sepal Length by Species')
> #Rplot06
> ggplot(iris, aes (x=Species, y=Sepal.Width, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Sepal Width by Species')
> #Rplot07
> ggplot(iris, aes (x=Species, y=Petal.Length, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Petal Length by Species')
> #Rplot08
> ggplot(iris, aes (x=Species, y=Petal Width, fill=Species)) +
Page 9 of 31
+ geom_boxplot() +
+ labs (title = 'Petal Width by Species')
> cor_matrix <- cor(iris[,1:4])
> cor_matrix
OUTPUT
Page 10 of 31
Page 11 of 31
Page 12 of 31
Page 13 of 31
Page 14 of 31
SUMMARY OF EDA
The Iris dataset contains 150 flowers with 4 numerical features: Sepal
Length, Sepal Width, Petal Length, and Petal Width, across 3 species:
Setosa, Versicolor, and Virginica.
1. Descriptive Statistics:
o Sepal.Length and Petal.Length have higher values with a wider
range.
o Petal.Width and Sepal.Width are more concentrated, especially
Petal.Width around 1.0–1.5 cm.
2. Distributions:
o Histograms show that Sepal.Length and Petal.Length have a
roughly normal distribution, while Sepal.Width and Petal.Width are
more skewed.
Page 15 of 31
3. Boxplots:
o Show clear species differences, with Setosa having smaller petals
and sepals, while Virginica has the largest.
4. Correlation:
o Strong positive correlations between Sepal.Length and Petal.Length
(0.96), and between Petal.Length and Petal.Width (0.95).
o Weak negative correlations between Sepal.Width and the petal
dimensions.
Key Takeaways:
This analysis reveals the structure of the data, providing insights into how
measurements of flowers can be used for classification.
PROJECT 2
Page 16 of 31
Generate a Decision
Tree model to
classify the famous
iris dataset
Decision Tree
A decision tree is a popular machine learning algorithm
used for making decisions or predictions. It works by
breaking down a complex decision-making process into
simpler, step-by-step questions based on the features (or
attributes) of the data. The tree starts with a root node,
Page 17 of 31
which asks the first question or decision point, such as "Is
the person's age greater than 30?" Depending on the
answer (yes/no), the data is split into two branches, and
each branch leads to a further question. This process
continues until the tree reaches a leaf node, which gives
the final outcome or prediction.
Page 18 of 31
R code & Output
> Library(rpart)
>Library(rpart.plot)
>v<-iris$Species
>table(v)
>summary(iris)
>head(iris)
>table(v)
>summary(iris)
>head(iris)
Page 19 of 31
>set.seed(522)
>iris[,’train’] <- ifelse(runif(nrow(iris)) <
0.75, 1, 0)
>trainSet <- iris[iris$train == 1,]
>testSet <- iris[iris$train == 0, ]
>trainColNum <- grep(‘train’,
names(trainSet))
>trainSet <- trainSet[, -trainColNum]
>testSet <- testSet[, -trainColNum]
>treeFit <-
rpark(Species~.,data=trainSet,method =
‘close’)
>rpart.plot(treeFit, box.col=c(“red”,
“yellow”))
Page 20 of 31
PROJECT 3
Predictive Modeling
Using Linear
Regression
Page 21 of 31
INTRODUCTION TO
LINEAR REGRESSION
Introduction to Linear Regression
Where:
Page 22 of 31
β0\beta_0β0 is the intercept, and
β1\beta_1β1 is the slope (indicating how much YYY changes for a unit
change in XXX).
In multiple linear regression, the model uses more than one independent
variable to predict the dependent variable, and the equation expands to
include multiple terms for each feature.
How It Works:
The model fits the best possible line or hyperplane by minimizing the
sum of squared errors between the actual and predicted values.
Once trained, the model can predict the dependent variable for new,
unseen data.
Key Concepts:
Assumptions:
Applications:
Page 23 of 31
Linear regression is a powerful tool for prediction and understanding
relationships between variables, but its assumptions and limitations should
be considered.
WHAT IS BOSTON
DATASET IN R?
The Boston dataset in R is a popular dataset available in the
MASS package, primarily used for regression analysis. It contains
data on 506 observations and 14 features related to housing
in Boston. The goal is usually to predict the median value of
owner-occupied homes (medv), which is the dependent
variable.
Page 24 of 31
Features:
# Load dataset
library(MASS)
data(Boston)
head(Boston)
Page 25 of 31
```
Example Regression:
summary(model)
```
Page 26 of 31
>model <- im(medv ~ .,data=traininData)
>summary(model)
PLOT
> ggplot(data testData, aes (x=medv,
y=predictions)) + geom_point()
+ geom_abline (slope-1, intercept 0,
color-'red') + labs (x-"Actual value",
Page 28 of 31
y="Predicted value", main="medv Actual
vs Predicted")
MODEL EVALUATION
Mean squared error(mse): a common
metric used to measure the performance
Page 29 of 31
of a regression model. It quantifies the
average of the squared differences
between the actual values and the
predicted values. MSE is widely used
because it penalizes larger errors more
heavily, making it sensitive to outliers.
Root squared error(rmse): a widely used
metric for evaluating the performance of
regression models. It measures the
average magnitude of the errors
between predicted and actual values, but
unlike Mean Squared Error (MSE),
RMSE is in the same units as the target
variable, making it more interpretable.
Page 30 of 31
BIBLIOGRAPHY
My Parents
Rstudio
Page 31 of 31