0% found this document useful (0 votes)

14 views

Data Science Project

The document is an acknowledgment and certification of a data science project completed by Chirag Rai under the guidance of Mr. Somnath PaulChoudhury, covering topics such as Exploratory Data Analysis, Decision Trees, and Linear Regression. It includes detailed explanations of the Iris dataset, methodologies for data analysis, and R code implementations for visualizations and model evaluations. The project highlights the importance of data science techniques in understanding and predicting outcomes based on data.

Uploaded by

raichirag718

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Data Science Project

Uploaded by

raichirag718

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 31

ACKNOWLEDGEMENT

I would like to express my sincere

gratitude to Mr. Somnath
PaulChoudhury, my subject teacher, for
this invaluable guidance, support, and
encouragement throughout this data
science project. His expertise and insights
have been instrumental in shaping this
work.
I am profoundly thankful to my parents
for their unwavering support and
assistance. Their patience,
encouragement and belief in my abilities
were crucial in the successful completion
of this project.
Finally, I acknowledge my own dedication
and hard work, completing this project
has been a significant learning
experience, and I took forward to

Page 1 of 31
applying to applying this knowledge in
future endeavors.
Chirag Rai
12th Science

Page 2 of 31
CERTIFICATE
This is to certify that Chirag Rai, a
student of class 12 B1 (Science) of Army
Public School, Bengdubi, has successfully
completed the project ON ‘Exploratory
Data Analysis’,’Decision Tree’ and
‘Linear Regression’ under the guidance
of Mr. Somnath PaulChoudhury. This
project stands as a testament of their
diligence, dedication and hard work.
Date: 21 Jan , 2025

_________________________
_______________________
TEACHER-IC’S signature
EXAMINER’S signature

Page 3 of 31
TABLE OF CONTENTS
ACKNOWLEDGMENT
CEERTIFICATE
 PROJECT 1[Exploratory Data
AnalysisWhat is EDA?
 What is Iris Dataset?
 R Code
 Output Plots
 Summary of EDA
PROJECT 2[Decision Tree]
 What is Decision Tree?
 R Code & Output
PROJECT 3[Linear Regression]
 Introduction to Linear Regression
 What is Boston dataset?
 R Code & Output
 Plot
Model Evaluation
BIBLOGRAPHY

Page 4 of 31
PROJECT 1
Exploratory Data
Analysis On Iris
Dataset

Page 5 of 31
EXPLORATORY DATA
ANALYSIS
Exploratory Data Analysis (EDA) is key
to understanding your data, Using R,
start by importing and cleaning your
data with functions like read.csv() and
na.omit(). Next generate basic
descriptive statistics using summary().

Visualization is crucial-
Use the ggplot2 package to create box
plots and histograms. Check for
correlation variable with cor(). Finally,
apply Principal Component Analysis
(PCA) to reduce the number of
variables while retaining information.

EDA in R helps uncover patterns, spot

anomalies, and frame hypotheses for
further analysis.

Page 6 of 31
WHAT IS IRIS DATASET IN
R?
It is simple, small, and easy to understand. It contains 150 records of iris
flowers, with four numerical features (sepal length, sepal width, petal length,
and petal width) and a categorical target variable (species). This simplicity
makes it ideal for students who are just beginning to explore data science
and machine learning concepts. The dataset's manageable size (150 rows)
ensures that students can easily process and analyze the data without
needing advanced tools or heavy computational power.

The Iris dataset offers an excellent opportunity to practice basic statistical

analysis and data visualization techniques. Students can calculate
descriptive statistics such as mean, median, and standard deviation to better
understand the data. They can also create visualizations like scatter plots,
box plots, and histograms to explore how different species of flowers are
distributed based on the measurements. These activities help students gain
foundational skills in analyzing and interpreting data, which are crucial in
fields like statistics and data science.

Additionally, the Iris dataset serves as a great introduction to machine

learning. Students can apply supervised learning algorithms, such as k-
nearest neighbors (k-NN), decision trees, and logistic regression, to
predict the species of a flower based on its measurements. These algorithms
are easy to implement and understand, making them ideal for high school
students. Using the Iris dataset, students can learn the basics of model
training, evaluation, and prediction, as well as important concepts like
accuracy and confusion matrices.

Finally, the hands-on experience with the Iris dataset helps students build
confidence in working with real-world data. It provides a practical context for
students to apply their knowledge of data cleaning, exploration, and model
evaluation. Whether using R or Python, both of which are commonly taught
in high school curriculums, students can gain experience in writing code,
interpreting results, and drawing conclusions from their analyses.

Page 7 of 31
R CODE
> library('ggplot2')
> data ("iris")

> #Rplot01
> ggplot(iris, aes (x=Sepal Length)) +
+ geom_histogram (binwidth=0.2, fill='blue', color='black')
+
+ labs (title = 'Distribution of Sepal Length') +

> #Rplot02
> ggplot(iris, aes (x=Sepal.Width)) +
+geom_histogram(binwidth=0.2, fill='green', color='black')
+
+ labs (title = 'Distribution of Sepal Width')

> #Rplot03
> ggplot (iris, aes (x=Petal. Length)) +
+ geom_histogram (binwidth=0.2, fill='red', color='black') +

Page 8 of 31
+ labs (title = 'Distribution of Petal Length')

> #Rplot04
> ggplot(iris, aes (x=Petal Width)) +
+geom_histogram (binwidth=0.2, fill='purple', color='black')
+
+ labs (title = 'Distribution of Petal Width')

> #Rplot05
> ggplot(iris, aes (x=Species, y=Sepal Length, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Sepal Length by Species')

> #Rplot06
> ggplot(iris, aes (x=Species, y=Sepal.Width, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Sepal Width by Species')

> #Rplot07
> ggplot(iris, aes (x=Species, y=Petal.Length, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Petal Length by Species')

> #Rplot08
> ggplot(iris, aes (x=Species, y=Petal Width, fill=Species)) +

Page 9 of 31
+ geom_boxplot() +
+ labs (title = 'Petal Width by Species')
> cor_matrix <- cor(iris[,1:4])
> cor_matrix

OUTPUT
Page 10 of 31
Page 11 of 31
Page 12 of 31
Page 13 of 31
Page 14 of 31
SUMMARY OF EDA
The Iris dataset contains 150 flowers with 4 numerical features: Sepal
Length, Sepal Width, Petal Length, and Petal Width, across 3 species:
Setosa, Versicolor, and Virginica.

1. Descriptive Statistics:
o Sepal.Length and Petal.Length have higher values with a wider
range.
o Petal.Width and Sepal.Width are more concentrated, especially
Petal.Width around 1.0–1.5 cm.
2. Distributions:
o Histograms show that Sepal.Length and Petal.Length have a
roughly normal distribution, while Sepal.Width and Petal.Width are
more skewed.

Page 15 of 31
3. Boxplots:
o Show clear species differences, with Setosa having smaller petals
and sepals, while Virginica has the largest.
4. Correlation:
o Strong positive correlations between Sepal.Length and Petal.Length
(0.96), and between Petal.Length and Petal.Width (0.95).
o Weak negative correlations between Sepal.Width and the petal
dimensions.

Key Takeaways:

 Species Differences: Setosa is easily separable from the other two

species based on smaller flower dimensions.
 Correlations: Petal dimensions are strongly related to each other, and
sepal dimensions show moderate relationships with petals.

This analysis reveals the structure of the data, providing insights into how
measurements of flowers can be used for classification.

PROJECT 2
Page 16 of 31
Generate a Decision
Tree model to
classify the famous
iris dataset

Decision Tree
A decision tree is a popular machine learning algorithm
used for making decisions or predictions. It works by
breaking down a complex decision-making process into
simpler, step-by-step questions based on the features (or
attributes) of the data. The tree starts with a root node,

Page 17 of 31
which asks the first question or decision point, such as "Is
the person's age greater than 30?" Depending on the
answer (yes/no), the data is split into two branches, and
each branch leads to a further question. This process
continues until the tree reaches a leaf node, which gives
the final outcome or prediction.

For example, if you're predicting whether a student will

pass or fail based on their study time and attendance, the
decision tree might first ask, "Did the student attend
more than 80% of classes?" If the answer is yes, it might
then ask, "Did the student study more than 5 hours per
week?" Based on the answers, the tree will classify the
student as either likely to pass or fail.

Decision trees are easy to understand because they

visually represent the decision-making process, making
them great for both classification (e.g., whether an email
is spam or not) and regression tasks (e.g., predicting
house prices). They can handle both categorical data (like
"Yes" or "No") and continuous data (like numerical
values). However, they can sometimes overfit the data,
meaning they may be too closely tailored to the training
data and not perform well on new data.

Page 18 of 31
R code & Output
> Library(rpart)
>Library(rpart.plot)
>v<-iris$Species
>table(v)
>summary(iris)
>head(iris)
>table(v)

>summary(iris)

>head(iris)

Page 19 of 31
>set.seed(522)
>iris[,’train’] <- ifelse(runif(nrow(iris)) <
0.75, 1, 0)
>trainSet <- iris[iris$train == 1,]
>testSet <- iris[iris$train == 0, ]
>trainColNum <- grep(‘train’,
names(trainSet))
>trainSet <- trainSet[, -trainColNum]
>testSet <- testSet[, -trainColNum]
>treeFit <-
rpark(Species~.,data=trainSet,method =
‘close’)
>rpart.plot(treeFit, box.col=c(“red”,
“yellow”))

Page 20 of 31
PROJECT 3
Predictive Modeling
Using Linear
Regression
Page 21 of 31
INTRODUCTION TO
LINEAR REGRESSION
Introduction to Linear Regression

Linear regression is a statistical method used to model the relationship

between a dependent variable and one or more independent variables. It
predicts the value of the dependent variable based on the values of the
independent variables. The goal is to fit a straight line (in simple linear
regression) or a hyperplane (in multiple linear regression) that best
represents the data.

In simple linear regression, the relationship is modeled with the equation:

Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilonY=β0+β1X+ϵ

Where:

 YYY is the dependent variable (what we want to predict),

 XXX is the independent variable (the predictor),

Page 22 of 31
 β0\beta_0β0 is the intercept, and
 β1\beta_1β1 is the slope (indicating how much YYY changes for a unit
change in XXX).

In multiple linear regression, the model uses more than one independent
variable to predict the dependent variable, and the equation expands to
include multiple terms for each feature.

How It Works:

 The model fits the best possible line or hyperplane by minimizing the
sum of squared errors between the actual and predicted values.
 Once trained, the model can predict the dependent variable for new,
unseen data.

Key Concepts:

 Slope (β1\beta_1β1): Shows how much the dependent variable

changes for a unit change in the independent variable.
 Intercept (β0\beta_0β0): The value of the dependent variable when
the independent variable is zero.
 R-Squared (R²): Measures how well the model explains the variability
of the dependent variable.

Assumptions:

Linear regression assumes a linear relationship, independence of errors,

constant variance of errors (homoscedasticity), and normally distributed
residuals.

Applications:

 Predicting sales based on advertising spend.

 Estimating house prices from features like size and location.
 Understanding relationships in medical and economic data.

Advantages and Disadvantages:

 Advantages: Simple, interpretable, and fast.

 Disadvantages: Sensitive to outliers and assumes a linear
relationship, which may not always hold.

Page 23 of 31
Linear regression is a powerful tool for prediction and understanding
relationships between variables, but its assumptions and limitations should
be considered.

WHAT IS BOSTON
DATASET IN R?
The Boston dataset in R is a popular dataset available in the
MASS package, primarily used for regression analysis. It contains
data on 506 observations and 14 features related to housing
in Boston. The goal is usually to predict the median value of
owner-occupied homes (medv), which is the dependent
variable.

Page 24 of 31
Features:

1. crim: Crime rate per capita.

2. zn: Proportion of residential land zoned for large plots.

3. indus: Proportion of non-retail business acres.

4. chas: Charles River dummy variable (1 if near the river).

5. nox: Nitrogen oxide concentration.

6. rm: Average number of rooms per dwelling.

7. age: Proportion of older homes.

8. dis: Distance to employment centers.

9. rad: Accessibility to radial highways.

10. tax: Property tax rate.

11. ptratio: Pupil-teacher ratio.

12. b: Proportion of African American residents.

13. lstat: Percentage of lower-status population.

14. medv: Median home value (target).

Using the Dataset:

# Load dataset

library(MASS)

data(Boston)

# View first few rows

head(Boston)
Page 25 of 31
```

Example Regression:

To predict `medv`, you can use linear regression:

# Linear regression model

model <- lm(medv ~ ., data = Boston)

summary(model)

```

The dataset is commonly used for practicing regression analysis

and understanding the relationship between various factors
affecting housing prices.

R CODE & OUTPUT

>library(MASS)
>data(Boston)
>trainIndex <- createDataPartition(Boston$medv, p=0.7,
list=FALSE)
>trainData <- Boston[trainIndex,]
>testData <- Boston[-trainIndex,]

Page 26 of 31
>model <- im(medv ~ .,data=traininData)
>summary(model)

>predictions <- predict(model,newdata =

testData)
>mse <- mean((predictions –
testData$medv)^2)
>rmse <- sqrt(mse)
>mse
Page 27 of 31
>rmse

>r2 <- cor(predictions,

testData$medv)^2
>r2

PLOT
> ggplot(data testData, aes (x=medv,
y=predictions)) + geom_point()
+ geom_abline (slope-1, intercept 0,
color-'red') + labs (x-"Actual value",
Page 28 of 31
y="Predicted value", main="medv Actual
vs Predicted")

MODEL EVALUATION
Mean squared error(mse): a common
metric used to measure the performance

Page 29 of 31
of a regression model. It quantifies the
average of the squared differences
between the actual values and the
predicted values. MSE is widely used
because it penalizes larger errors more
heavily, making it sensitive to outliers.
Root squared error(rmse): a widely used
metric for evaluating the performance of
regression models. It measures the
average magnitude of the errors
between predicted and actual values, but
unlike Mean Squared Error (MSE),
RMSE is in the same units as the target
variable, making it more interpretable.

Page 30 of 31
BIBLIOGRAPHY
My Parents

Rstudio

CBSE Study Material

Page 31 of 31

ASA Comprehension Year 4 2019 FREE
67% (3)
ASA Comprehension Year 4 2019 FREE
100 pages
Deadlift Program v1 - Joeyawalashaw
No ratings yet
Deadlift Program v1 - Joeyawalashaw
39 pages
Robert Page: How To Identify Henry III Long Cross Pennies Part 2
0% (1)
Robert Page: How To Identify Henry III Long Cross Pennies Part 2
4 pages
The Encyclopedia of Spices and Herbs (PDFDrive) (120-139)
No ratings yet
The Encyclopedia of Spices and Herbs (PDFDrive) (120-139)
20 pages
10
No ratings yet
10
7 pages
5 ASAP Business Analytics-BasicStatistics - Exploratory Data Analysis
No ratings yet
5 ASAP Business Analytics-BasicStatistics - Exploratory Data Analysis
24 pages
王玉 20201108012390
No ratings yet
王玉 20201108012390
13 pages
EDA AnalysisA
No ratings yet
EDA AnalysisA
15 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
Decision Tree Classifier Project
100% (1)
Decision Tree Classifier Project
20 pages
II Cse Cs3352 Fds QB Unit2
No ratings yet
II Cse Cs3352 Fds QB Unit2
5 pages
Final Survey Paper 17-9-13
No ratings yet
Final Survey Paper 17-9-13
5 pages
DAR LECT 12
No ratings yet
DAR LECT 12
29 pages
5 Review Paper
No ratings yet
5 Review Paper
7 pages
FODS
No ratings yet
FODS
6 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
ST1 4483 8995 Capstone PPT Template
No ratings yet
ST1 4483 8995 Capstone PPT Template
10 pages
DS Assignment
No ratings yet
DS Assignment
12 pages
Data Mining
No ratings yet
Data Mining
32 pages
Stem and Leaf Plots Teacher
No ratings yet
Stem and Leaf Plots Teacher
6 pages
Decision Tree Analysis On J48 Algorithm PDF
No ratings yet
Decision Tree Analysis On J48 Algorithm PDF
6 pages
Lecture 1 Inferential Statistics
No ratings yet
Lecture 1 Inferential Statistics
32 pages
DA_Lab_Week-3 (1)
No ratings yet
DA_Lab_Week-3 (1)
15 pages
Y9 Statistics Notes
No ratings yet
Y9 Statistics Notes
13 pages
Chapter 3 24
No ratings yet
Chapter 3 24
6 pages
Mid 1 Answers IDS
No ratings yet
Mid 1 Answers IDS
22 pages
Da Model QP With Answes
No ratings yet
Da Model QP With Answes
32 pages
REPORT On DECISION TREE
No ratings yet
REPORT On DECISION TREE
40 pages
13 14 SPL Galley Proof 057
No ratings yet
13 14 SPL Galley Proof 057
4 pages
BUSINESSANALYTICSASSIGNMENTcopy
No ratings yet
BUSINESSANALYTICSASSIGNMENTcopy
7 pages
6 2 Unit 6
No ratings yet
6 2 Unit 6
9 pages
STAT8017 Assignment 1
No ratings yet
STAT8017 Assignment 1
6 pages
ASSIGNMEnt 3
No ratings yet
ASSIGNMEnt 3
26 pages
DEV LAB MANUAL 2 EXP
No ratings yet
DEV LAB MANUAL 2 EXP
18 pages
PHD Thesis On Iris Recognition
100% (3)
PHD Thesis On Iris Recognition
6 pages
new89万美迪电子商务 202111080314
No ratings yet
new89万美迪电子商务 202111080314
15 pages
150-507-1-PB
No ratings yet
150-507-1-PB
9 pages
lesson12
No ratings yet
lesson12
8 pages
AMR - Assignment 1-Sample Solutions
No ratings yet
AMR - Assignment 1-Sample Solutions
7 pages
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
100% (3)
Data Science & Machine Learning Algorithms - A CONCISEtasets, and Free Text Books) - Ananthu S Chakravarthi
90 pages
module_4
No ratings yet
module_4
30 pages
Thesis Iastate
100% (3)
Thesis Iastate
6 pages
Konsep Ensemble
No ratings yet
Konsep Ensemble
52 pages
Data analytics using r unit-3
No ratings yet
Data analytics using r unit-3
4 pages
Automatic Construction of Decision Trees From Data: A Multi-Disciplinary Survey
No ratings yet
Automatic Construction of Decision Trees From Data: A Multi-Disciplinary Survey
49 pages
Exploratory Data Analysis, Variation, Missing Values, Covariation
No ratings yet
Exploratory Data Analysis, Variation, Missing Values, Covariation
22 pages
Exploring Data Graphs and Numerical Summaries
No ratings yet
Exploring Data Graphs and Numerical Summaries
106 pages
DATA MANAGEMENT (MMW)
No ratings yet
DATA MANAGEMENT (MMW)
6 pages
Project Report
No ratings yet
Project Report
29 pages
DS
No ratings yet
DS
4 pages
Thesis Report On Iris Recognition
100% (2)
Thesis Report On Iris Recognition
4 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Decision Tree R
No ratings yet
Decision Tree R
5 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Statistics
No ratings yet
Statistics
16 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
Data Mining Report
No ratings yet
Data Mining Report
15 pages
Journal On Decision Tree
No ratings yet
Journal On Decision Tree
5 pages
Dissertation Using Descriptive Statistics
100% (2)
Dissertation Using Descriptive Statistics
4 pages
3 Exploring and Visualizing Data
No ratings yet
3 Exploring and Visualizing Data
35 pages
ENGG1003 07 DataModelingAndVisualization
No ratings yet
ENGG1003 07 DataModelingAndVisualization
29 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
BT Article of Organization
No ratings yet
BT Article of Organization
2 pages
13122023080026PPLC-PersonalInjuryClaimReport2023-12-13_4027_20231312202534603
No ratings yet
13122023080026PPLC-PersonalInjuryClaimReport2023-12-13_4027_20231312202534603
3 pages
Mainstream Science Is A Religion
100% (1)
Mainstream Science Is A Religion
7 pages
Trauma Room Checklist February 2022
No ratings yet
Trauma Room Checklist February 2022
6 pages
Barangay Health Workers
No ratings yet
Barangay Health Workers
2 pages
Ribbons
No ratings yet
Ribbons
1 page
Fig Complexa de Taylor Simplificada Estudo Normativo Brasileiro Idosos Controle CCL Alzheimer de Paula e Malloy Diniz 2016
No ratings yet
Fig Complexa de Taylor Simplificada Estudo Normativo Brasileiro Idosos Controle CCL Alzheimer de Paula e Malloy Diniz 2016
6 pages
The Foundation Module d'26
No ratings yet
The Foundation Module d'26
9 pages
Gift Shop Role Play
No ratings yet
Gift Shop Role Play
3 pages
Fetch Execute Cycle
No ratings yet
Fetch Execute Cycle
4 pages
The Waiting Time Problem in A Model Hominin Population: Research Open Access
No ratings yet
The Waiting Time Problem in A Model Hominin Population: Research Open Access
28 pages
Impact of Green Human Resource Management (G-HRM) Practices Over Organization Effectiveness
No ratings yet
Impact of Green Human Resource Management (G-HRM) Practices Over Organization Effectiveness
9 pages
School Action Plan in School Disaster Risk Reduction and Management
100% (1)
School Action Plan in School Disaster Risk Reduction and Management
3 pages
Material Safety Data Sheet 1. Chemical Product and Company Identification
No ratings yet
Material Safety Data Sheet 1. Chemical Product and Company Identification
5 pages
Volume IV
No ratings yet
Volume IV
130 pages
"Wouldn'T It Be Funny If You Didn'T Have A Nose?" by Roger Mcgough
No ratings yet
"Wouldn'T It Be Funny If You Didn'T Have A Nose?" by Roger Mcgough
2 pages
Literature Review Economic Development
100% (2)
Literature Review Economic Development
8 pages
Structure Analysis Q 91 To 105
No ratings yet
Structure Analysis Q 91 To 105
21 pages
Leave Form
No ratings yet
Leave Form
1 page
Known Errors in CBS and Solutions
No ratings yet
Known Errors in CBS and Solutions
29 pages
2023 Mock Exam Paper 2 QP - Draft
100% (1)
2023 Mock Exam Paper 2 QP - Draft
14 pages
Tf2 Commands
No ratings yet
Tf2 Commands
9 pages
10 Vocabulary Words With Meaning, Example Use in A Sentence, Pronunciation
No ratings yet
10 Vocabulary Words With Meaning, Example Use in A Sentence, Pronunciation
2 pages
Accelerators
100% (1)
Accelerators
25 pages
AziziM PaperMakingFactory
No ratings yet
AziziM PaperMakingFactory
11 pages
David Childs B.Sc.,C.Eng.,MICE: Made By: DAC Date: 23/7/09 Job No.: 001 Checked by
No ratings yet
David Childs B.Sc.,C.Eng.,MICE: Made By: DAC Date: 23/7/09 Job No.: 001 Checked by
19 pages