Mini Project PPT, Sumit Malan
Mini Project PPT, Sumit Malan
Deemed to be University
Mini Project
Topic:-Rainfall Prediction System
2.1 Missing Values: As per our EDA step, we learned that we have few instances with
null values. Hence, this becomes one of the important step. To impute the missing
values, we will group our instances based on the location and date and thereby
replace the null values by there respective mean values.
2.2 Categorical Values: Categorical feature is one that has two or more categories,
but there is no intrinsic ordering to the categories. We have a few categorical
features - WindGustDir, WindDir9am, WindDir3pm with 16 unique values. Now it gets
complicated for machines to understand texts and process them, rather than
numbers, since the models are based on mathematical equations and calculations.
Therefore, we have to encode the categorical data.
3. Model Implementation: We chose different classifiers each belonging to
different model family (such as Linear classifier, Tree-based, Distance-based).
Logistic Regression is a classification algorithm used to predict a binary
outcome (1 / 0, Yes / No, True / False) given a set of independent variables.
In simple words, it predicts the probability of occurrence of an event by
fitting data to a logit function. Hence, this makes Logistic Regression a better
fit as ours is a binary classification problem.
Decision Tree In this technique, we split the population or sample into two or
more homogeneous sets (or sub-populations) based on most significant
differentiator in input variables. This characteristics of Decision Tree makes it
a good fit for our problem as our target variable is binary categorical
variable.
Random Forest is a supervised ensemble learning algorithm.Here, we have a
collection of decision trees, known as Forest. To classify a new object based
on attributes, each tree gives a classification and we say the tree votes for
that class. The forest chooses the classification having the most votes (over
all the trees in the forest).
Model Evaluation: For evaluating our classifiers we used below evaluation
metrics. Accuracy is the ratio of number of correct predictions to the total
number of input samples. It works well only if there are equal number of
samples belonging to each class. As we have, imbalanced data, we will also
consider other metrics.
Area Under Curve(AUC) is used for binary classification problem. AUC of a
classifier is equal to the probability that the classifier will rank a randomly
chosen positive example higher than a randomly chosen negative example
Precision is the number of correct positive results divided by the number of
positive results predicted by the classifier.
Recall is the number of correct positive results divided by the number of all
relevant samples (all samples that should have been identified as positive).
RESULT:
Experiment 1 - Original Dataset: Post all the preprocessing steps (as mentioned
above in the Methodology section), we ran all the implemented classifiers each one
with the same input data . It depicts two considered metrics (10-skfold Accuracy and
Area Under Curve) for all the classifiers.
Accuracy wise Gradient Boosting with a
learning rate of 0.25 performed best,
coverage wise Random Forest and Decision
Tree performed worsts.