0% found this document useful (0 votes)
47 views12 pages

Mini Project PPT, Sumit Malan

Uploaded by

schaudhary2332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views12 pages

Mini Project PPT, Sumit Malan

Uploaded by

schaudhary2332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Graphic Era

Deemed to be University

Mini Project
Topic:-Rainfall Prediction System

Submitted by:- Sumit Malan


Mentor Name:-Ms. Meenakshi Maindola Course:- B.Tech(CSE)
University Roll no:- 2019162
Introduction
Rainfall prediction plays a
crucial role in various industries
and sectors. Accurate and
reliable forecasts are essential
for effective planning and
decision-making. From
agriculture to construction,
transportation to energy
production, rainfall forecasts
help organizations optimize their
operations and mitigate risks.
Problem Statement
•Rainfall prediction is a complex task due to the inherent variability and
dynamics of weather systems.
•Factors such as atmospheric conditions, temperature, humidity, and
wind patterns contribute to the complexity of rainfall prediction.
•Accurately predicting rainfall patterns requires advanced modeling
techniques and access to large amounts of historical weather data.
•Rainfall patterns are influenced by various factors such as topography,
vegetation, and land use, making it challenging to accurately predict
rainfall in specific regions.
•Climate change and global warming introduce additional uncertainties
in rainfall prediction, as they can alter weather patterns and
precipitation levels.
METHODOLOGY
The overall architecture include four major
components: Data Exploration and Analysis, Data
Pre-processing, Model Implementation, and Model
Evaluation

1. Data Exploration and Analysis Exploratory:


Data Analysis is valuable to machine learning
problems since it allows to get closer to the
certainty that the future results will be valid,
Pair-wise Correlation Matrix which is
correctly interpreted, and applicable to the performed to understand interactions between
desired business contexts different fields in the data set
2. Data Preprocessing: Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format. Real-world data is often
incomplete, inconsistent, and lacking in certain behaviors and is likely to contain
many errors.
We have carried below preprocessing steps:-

2.1 Missing Values: As per our EDA step, we learned that we have few instances with
null values. Hence, this becomes one of the important step. To impute the missing
values, we will group our instances based on the location and date and thereby
replace the null values by there respective mean values.

2.2 Categorical Values: Categorical feature is one that has two or more categories,
but there is no intrinsic ordering to the categories. We have a few categorical
features - WindGustDir, WindDir9am, WindDir3pm with 16 unique values. Now it gets
complicated for machines to understand texts and process them, rather than
numbers, since the models are based on mathematical equations and calculations.
Therefore, we have to encode the categorical data.
3. Model Implementation: We chose different classifiers each belonging to
different model family (such as Linear classifier, Tree-based, Distance-based).
Logistic Regression is a classification algorithm used to predict a binary
outcome (1 / 0, Yes / No, True / False) given a set of independent variables.
In simple words, it predicts the probability of occurrence of an event by
fitting data to a logit function. Hence, this makes Logistic Regression a better
fit as ours is a binary classification problem.
Decision Tree In this technique, we split the population or sample into two or
more homogeneous sets (or sub-populations) based on most significant
differentiator in input variables. This characteristics of Decision Tree makes it
a good fit for our problem as our target variable is binary categorical
variable.
Random Forest is a supervised ensemble learning algorithm.Here, we have a
collection of decision trees, known as Forest. To classify a new object based
on attributes, each tree gives a classification and we say the tree votes for
that class. The forest chooses the classification having the most votes (over
all the trees in the forest).
Model Evaluation: For evaluating our classifiers we used below evaluation
metrics. Accuracy is the ratio of number of correct predictions to the total
number of input samples. It works well only if there are equal number of
samples belonging to each class. As we have, imbalanced data, we will also
consider other metrics.
Area Under Curve(AUC) is used for binary classification problem. AUC of a
classifier is equal to the probability that the classifier will rank a randomly
chosen positive example higher than a randomly chosen negative example
Precision is the number of correct positive results divided by the number of
positive results predicted by the classifier.
Recall is the number of correct positive results divided by the number of all
relevant samples (all samples that should have been identified as positive).
RESULT:
Experiment 1 - Original Dataset: Post all the preprocessing steps (as mentioned
above in the Methodology section), we ran all the implemented classifiers each one
with the same input data . It depicts two considered metrics (10-skfold Accuracy and
Area Under Curve) for all the classifiers.
Accuracy wise Gradient Boosting with a
learning rate of 0.25 performed best,
coverage wise Random Forest and Decision
Tree performed worsts.

Experiment 2 - Undersampled Dataset: Post all the preprocessing steps (as


mentioned above in the Methodology section) including the undersampling step, we
ran all the implemented classifiers each one with the same input data. It depicts
two considered metrics (10-skfold Accuracy and Area Under Curve) for all the
classifiers.
Accuracy and coverage wise Logistic
Regression performed best and
Decision Tree performed worsts.

Experiment 3 - Oversampled Dataset:


Post all the preprocessing steps (as
mentioned above in the Methodology
section) including the oversampling
step, we ran all the implemented
classifiers each one with the same
input data.It depicts two considered
metrics (10-skfold Accuracy and Area
Accuracy and coverage wise Decision Tree
Under Curve) for all the classifiers.
performed best and Logistic Regression
performed worsts
Discussion:
The first important thing we learned is the importance of knowing
your data. While imputing the missing value, we grouped two other
features and calculated the mean instead of directly calculating the
mean for all the instances. This way our imputed values were closer
to the correct information.

Another thing we learned is about the leaky features. While exploring


our data, we came to that one of our feature (RiskMM) was used for
generating the target variable and hence it made no sense to use this
feature for predictions.

We learned about the curse of dimensionality while dealing with


categorical variables which we solved using feature hashing.
Conclusion:
We explored and applied several preprocessing steps and learned there
impact on the overall performance of our classifiers. We also carried a
comparative study of all the classifiers with different input data and
observed how the input data can affect the model predictions.

We can conclude that Australian weather is uncertain and there is no such


correlation among rainfall and the respective region and time. We figured
certain patterns and relationships among data which helped in determining
important features. Refer to the appendix section.

As we have a huge amount of data, we can apply Deep Learning models


such as Multilayer Perceptron, Convolutional Neural Network, and others. It
would be great to perform a comparative study between the Machine
learning classifiers and Deep learning models.
Future Work
While our research has provided valuable insights into rainfall prediction, there
are several areas for further exploration and improvement. Some potential
avenues for future work include:
1.Data Collection: Expanding the dataset used for training and testing the
models by incorporating data from additional weather stations and sources.
This would provide a more comprehensive and diverse dataset, leading to more
accurate predictions.
2.Feature Engineering: Exploring new variables and features that could
improve the predictive power of the models. This could include factors such as
humidity, wind speed, and atmospheric pressure.
3.Model Optimization: Investigating different machine learning algorithms and
techniques to optimize the models' performance. This could involve ensemble
methods, deep learning, or hybrid models that combine multiple approaches.
4.Real-Time Prediction: Developing real-time prediction models that can
provide up-to-date rainfall forecasts. This would require efficient data
processing and modeling techniques to handle large volumes of streaming data

You might also like