0% found this document useful (0 votes)
14 views

Data Science Project

The document is an acknowledgment and certification of a data science project completed by Chirag Rai under the guidance of Mr. Somnath PaulChoudhury, covering topics such as Exploratory Data Analysis, Decision Trees, and Linear Regression. It includes detailed explanations of the Iris dataset, methodologies for data analysis, and R code implementations for visualizations and model evaluations. The project highlights the importance of data science techniques in understanding and predicting outcomes based on data.

Uploaded by

raichirag718
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Science Project

The document is an acknowledgment and certification of a data science project completed by Chirag Rai under the guidance of Mr. Somnath PaulChoudhury, covering topics such as Exploratory Data Analysis, Decision Trees, and Linear Regression. It includes detailed explanations of the Iris dataset, methodologies for data analysis, and R code implementations for visualizations and model evaluations. The project highlights the importance of data science techniques in understanding and predicting outcomes based on data.

Uploaded by

raichirag718
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

ACKNOWLEDGEMENT

I would like to express my sincere


gratitude to Mr. Somnath
PaulChoudhury, my subject teacher, for
this invaluable guidance, support, and
encouragement throughout this data
science project. His expertise and insights
have been instrumental in shaping this
work.
I am profoundly thankful to my parents
for their unwavering support and
assistance. Their patience,
encouragement and belief in my abilities
were crucial in the successful completion
of this project.
Finally, I acknowledge my own dedication
and hard work, completing this project
has been a significant learning
experience, and I took forward to

Page 1 of 31
applying to applying this knowledge in
future endeavors.
Chirag Rai
12th Science

Page 2 of 31
CERTIFICATE
This is to certify that Chirag Rai, a
student of class 12 B1 (Science) of Army
Public School, Bengdubi, has successfully
completed the project ON ‘Exploratory
Data Analysis’,’Decision Tree’ and
‘Linear Regression’ under the guidance
of Mr. Somnath PaulChoudhury. This
project stands as a testament of their
diligence, dedication and hard work.
Date: 21 Jan , 2025

_________________________
_______________________
TEACHER-IC’S signature
EXAMINER’S signature

Page 3 of 31
TABLE OF CONTENTS
ACKNOWLEDGMENT
CEERTIFICATE
 PROJECT 1[Exploratory Data
AnalysisWhat is EDA?
 What is Iris Dataset?
 R Code
 Output Plots
 Summary of EDA
PROJECT 2[Decision Tree]
 What is Decision Tree?
 R Code & Output
PROJECT 3[Linear Regression]
 Introduction to Linear Regression
 What is Boston dataset?
 R Code & Output
 Plot
Model Evaluation
BIBLOGRAPHY

Page 4 of 31
PROJECT 1
Exploratory Data
Analysis On Iris
Dataset

Page 5 of 31
EXPLORATORY DATA
ANALYSIS
Exploratory Data Analysis (EDA) is key
to understanding your data, Using R,
start by importing and cleaning your
data with functions like read.csv() and
na.omit(). Next generate basic
descriptive statistics using summary().

Visualization is crucial-
Use the ggplot2 package to create box
plots and histograms. Check for
correlation variable with cor(). Finally,
apply Principal Component Analysis
(PCA) to reduce the number of
variables while retaining information.

EDA in R helps uncover patterns, spot


anomalies, and frame hypotheses for
further analysis.

Page 6 of 31
WHAT IS IRIS DATASET IN
R?
It is simple, small, and easy to understand. It contains 150 records of iris
flowers, with four numerical features (sepal length, sepal width, petal length,
and petal width) and a categorical target variable (species). This simplicity
makes it ideal for students who are just beginning to explore data science
and machine learning concepts. The dataset's manageable size (150 rows)
ensures that students can easily process and analyze the data without
needing advanced tools or heavy computational power.

The Iris dataset offers an excellent opportunity to practice basic statistical


analysis and data visualization techniques. Students can calculate
descriptive statistics such as mean, median, and standard deviation to better
understand the data. They can also create visualizations like scatter plots,
box plots, and histograms to explore how different species of flowers are
distributed based on the measurements. These activities help students gain
foundational skills in analyzing and interpreting data, which are crucial in
fields like statistics and data science.

Additionally, the Iris dataset serves as a great introduction to machine


learning. Students can apply supervised learning algorithms, such as k-
nearest neighbors (k-NN), decision trees, and logistic regression, to
predict the species of a flower based on its measurements. These algorithms
are easy to implement and understand, making them ideal for high school
students. Using the Iris dataset, students can learn the basics of model
training, evaluation, and prediction, as well as important concepts like
accuracy and confusion matrices.

Finally, the hands-on experience with the Iris dataset helps students build
confidence in working with real-world data. It provides a practical context for
students to apply their knowledge of data cleaning, exploration, and model
evaluation. Whether using R or Python, both of which are commonly taught
in high school curriculums, students can gain experience in writing code,
interpreting results, and drawing conclusions from their analyses.

Page 7 of 31
R CODE
> library('ggplot2')
> data ("iris")

> #Rplot01
> ggplot(iris, aes (x=Sepal Length)) +
+ geom_histogram (binwidth=0.2, fill='blue', color='black')
+
+ labs (title = 'Distribution of Sepal Length') +

> #Rplot02
> ggplot(iris, aes (x=Sepal.Width)) +
+geom_histogram(binwidth=0.2, fill='green', color='black')
+
+ labs (title = 'Distribution of Sepal Width')

> #Rplot03
> ggplot (iris, aes (x=Petal. Length)) +
+ geom_histogram (binwidth=0.2, fill='red', color='black') +

Page 8 of 31
+ labs (title = 'Distribution of Petal Length')

> #Rplot04
> ggplot(iris, aes (x=Petal Width)) +
+geom_histogram (binwidth=0.2, fill='purple', color='black')
+
+ labs (title = 'Distribution of Petal Width')

> #Rplot05
> ggplot(iris, aes (x=Species, y=Sepal Length, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Sepal Length by Species')

> #Rplot06
> ggplot(iris, aes (x=Species, y=Sepal.Width, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Sepal Width by Species')

> #Rplot07
> ggplot(iris, aes (x=Species, y=Petal.Length, fill=Species))
+
+ geom_boxplot() +
+ labs (title = 'Petal Length by Species')

> #Rplot08
> ggplot(iris, aes (x=Species, y=Petal Width, fill=Species)) +

Page 9 of 31
+ geom_boxplot() +
+ labs (title = 'Petal Width by Species')
> cor_matrix <- cor(iris[,1:4])
> cor_matrix

OUTPUT
Page 10 of 31
Page 11 of 31
Page 12 of 31
Page 13 of 31
Page 14 of 31
SUMMARY OF EDA
The Iris dataset contains 150 flowers with 4 numerical features: Sepal
Length, Sepal Width, Petal Length, and Petal Width, across 3 species:
Setosa, Versicolor, and Virginica.

1. Descriptive Statistics:
o Sepal.Length and Petal.Length have higher values with a wider
range.
o Petal.Width and Sepal.Width are more concentrated, especially
Petal.Width around 1.0–1.5 cm.
2. Distributions:
o Histograms show that Sepal.Length and Petal.Length have a
roughly normal distribution, while Sepal.Width and Petal.Width are
more skewed.

Page 15 of 31
3. Boxplots:
o Show clear species differences, with Setosa having smaller petals
and sepals, while Virginica has the largest.
4. Correlation:
o Strong positive correlations between Sepal.Length and Petal.Length
(0.96), and between Petal.Length and Petal.Width (0.95).
o Weak negative correlations between Sepal.Width and the petal
dimensions.

Key Takeaways:

 Species Differences: Setosa is easily separable from the other two


species based on smaller flower dimensions.
 Correlations: Petal dimensions are strongly related to each other, and
sepal dimensions show moderate relationships with petals.

This analysis reveals the structure of the data, providing insights into how
measurements of flowers can be used for classification.

PROJECT 2
Page 16 of 31
Generate a Decision
Tree model to
classify the famous
iris dataset

Decision Tree
A decision tree is a popular machine learning algorithm
used for making decisions or predictions. It works by
breaking down a complex decision-making process into
simpler, step-by-step questions based on the features (or
attributes) of the data. The tree starts with a root node,

Page 17 of 31
which asks the first question or decision point, such as "Is
the person's age greater than 30?" Depending on the
answer (yes/no), the data is split into two branches, and
each branch leads to a further question. This process
continues until the tree reaches a leaf node, which gives
the final outcome or prediction.

For example, if you're predicting whether a student will


pass or fail based on their study time and attendance, the
decision tree might first ask, "Did the student attend
more than 80% of classes?" If the answer is yes, it might
then ask, "Did the student study more than 5 hours per
week?" Based on the answers, the tree will classify the
student as either likely to pass or fail.

Decision trees are easy to understand because they


visually represent the decision-making process, making
them great for both classification (e.g., whether an email
is spam or not) and regression tasks (e.g., predicting
house prices). They can handle both categorical data (like
"Yes" or "No") and continuous data (like numerical
values). However, they can sometimes overfit the data,
meaning they may be too closely tailored to the training
data and not perform well on new data.

Page 18 of 31
R code & Output
> Library(rpart)
>Library(rpart.plot)
>v<-iris$Species
>table(v)
>summary(iris)
>head(iris)
>table(v)

>summary(iris)

>head(iris)

Page 19 of 31
>set.seed(522)
>iris[,’train’] <- ifelse(runif(nrow(iris)) <
0.75, 1, 0)
>trainSet <- iris[iris$train == 1,]
>testSet <- iris[iris$train == 0, ]
>trainColNum <- grep(‘train’,
names(trainSet))
>trainSet <- trainSet[, -trainColNum]
>testSet <- testSet[, -trainColNum]
>treeFit <-
rpark(Species~.,data=trainSet,method =
‘close’)
>rpart.plot(treeFit, box.col=c(“red”,
“yellow”))

Page 20 of 31
PROJECT 3
Predictive Modeling
Using Linear
Regression
Page 21 of 31
INTRODUCTION TO
LINEAR REGRESSION
Introduction to Linear Regression

Linear regression is a statistical method used to model the relationship


between a dependent variable and one or more independent variables. It
predicts the value of the dependent variable based on the values of the
independent variables. The goal is to fit a straight line (in simple linear
regression) or a hyperplane (in multiple linear regression) that best
represents the data.

In simple linear regression, the relationship is modeled with the equation:

Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilonY=β0+β1X+ϵ

Where:

 YYY is the dependent variable (what we want to predict),


 XXX is the independent variable (the predictor),

Page 22 of 31
 β0\beta_0β0 is the intercept, and
 β1\beta_1β1 is the slope (indicating how much YYY changes for a unit
change in XXX).

In multiple linear regression, the model uses more than one independent
variable to predict the dependent variable, and the equation expands to
include multiple terms for each feature.

How It Works:

 The model fits the best possible line or hyperplane by minimizing the
sum of squared errors between the actual and predicted values.
 Once trained, the model can predict the dependent variable for new,
unseen data.

Key Concepts:

 Slope (β1\beta_1β1): Shows how much the dependent variable


changes for a unit change in the independent variable.
 Intercept (β0\beta_0β0): The value of the dependent variable when
the independent variable is zero.
 R-Squared (R²): Measures how well the model explains the variability
of the dependent variable.

Assumptions:

Linear regression assumes a linear relationship, independence of errors,


constant variance of errors (homoscedasticity), and normally distributed
residuals.

Applications:

 Predicting sales based on advertising spend.


 Estimating house prices from features like size and location.
 Understanding relationships in medical and economic data.

Advantages and Disadvantages:

 Advantages: Simple, interpretable, and fast.


 Disadvantages: Sensitive to outliers and assumes a linear
relationship, which may not always hold.

Page 23 of 31
Linear regression is a powerful tool for prediction and understanding
relationships between variables, but its assumptions and limitations should
be considered.

WHAT IS BOSTON
DATASET IN R?
The Boston dataset in R is a popular dataset available in the
MASS package, primarily used for regression analysis. It contains
data on 506 observations and 14 features related to housing
in Boston. The goal is usually to predict the median value of
owner-occupied homes (medv), which is the dependent
variable.

Page 24 of 31
Features:

1. crim: Crime rate per capita.

2. zn: Proportion of residential land zoned for large plots.

3. indus: Proportion of non-retail business acres.

4. chas: Charles River dummy variable (1 if near the river).

5. nox: Nitrogen oxide concentration.

6. rm: Average number of rooms per dwelling.

7. age: Proportion of older homes.

8. dis: Distance to employment centers.

9. rad: Accessibility to radial highways.

10. tax: Property tax rate.

11. ptratio: Pupil-teacher ratio.

12. b: Proportion of African American residents.

13. lstat: Percentage of lower-status population.

14. medv: Median home value (target).

Using the Dataset:

# Load dataset

library(MASS)

data(Boston)

# View first few rows

head(Boston)
Page 25 of 31
```

Example Regression:

To predict `medv`, you can use linear regression:

# Linear regression model

model <- lm(medv ~ ., data = Boston)

summary(model)

```

The dataset is commonly used for practicing regression analysis


and understanding the relationship between various factors
affecting housing prices.

R CODE & OUTPUT


>library(MASS)
>data(Boston)
>trainIndex <- createDataPartition(Boston$medv, p=0.7,
list=FALSE)
>trainData <- Boston[trainIndex,]
>testData <- Boston[-trainIndex,]

Page 26 of 31
>model <- im(medv ~ .,data=traininData)
>summary(model)

>predictions <- predict(model,newdata =


testData)
>mse <- mean((predictions –
testData$medv)^2)
>rmse <- sqrt(mse)
>mse
Page 27 of 31
>rmse

>r2 <- cor(predictions,


testData$medv)^2
>r2

PLOT
> ggplot(data testData, aes (x=medv,
y=predictions)) + geom_point()
+ geom_abline (slope-1, intercept 0,
color-'red') + labs (x-"Actual value",
Page 28 of 31
y="Predicted value", main="medv Actual
vs Predicted")

MODEL EVALUATION
Mean squared error(mse): a common
metric used to measure the performance

Page 29 of 31
of a regression model. It quantifies the
average of the squared differences
between the actual values and the
predicted values. MSE is widely used
because it penalizes larger errors more
heavily, making it sensitive to outliers.
Root squared error(rmse): a widely used
metric for evaluating the performance of
regression models. It measures the
average magnitude of the errors
between predicted and actual values, but
unlike Mean Squared Error (MSE),
RMSE is in the same units as the target
variable, making it more interpretable.

Page 30 of 31
BIBLIOGRAPHY
My Parents

Rstudio

CBSE Study Material

Page 31 of 31

You might also like