Water Quality Analysis and Prediction Using Machine Learning
Water Quality Analysis and Prediction Using Machine Learning
Abstract — The main objective of this research is to estimate In the current scenario, determining water quality is a
the water quality using machine learning technique. Water is time-consuming process that involves lots of statistical
considered as a vital resource that has an impact on many facets analysis and is quite expensive, which also requires a lot of
of human health and existence. People who live in metropolitan time and effort to calculate. The main objective of this work
areas are often concerned about the quality of the water as it is is to provide the quality score of water and provide the
critical to monitor the quality of water. Water sample collection suggestion if the water needs to treated/purified or not, based
and laboratory analysis are time and resource-intensive on the water quality score that is calculated by the machine
processes. Analyzing water quality is a complicated subject learning model.
because of the many variables that affect it. This concept is
inextricably linked to the various purposes for which water is II. LITERATURE SURVEY
used. The goal of this study is to estimate water quality by
acquiring several parameters, and using the machine learning
method, Random Forest regression. In this case, the model uses The methods used to address problems with water quality
parameters like pH, turbidity, dissolved oxygen, conductivity, and have been reviewed. In previous research, tradit ional lab
others. analysis and statistical analysis are frequently employed to
help determine the quality of the water, while other analys is
Keywords—Water Quality, Machine Learning, Random
use machine learn ing approaches to help identify the best
Forest regression, Decision Tree, Web User Interface (UI),
possible solution to the water quality problem.
I. INT RODUCT ION
In [1], Osim Ku mar Pal et al., proposed that predicting
Water is the most crucial supply and resource for
potable quality of water is more beneficial for water
maintaining a wide range of life. Rapid industrialization distribution and environmental protection. Contaminated
therefore resulted in the deterioration of quality of water at
water causes substantial waterborne infections and presents a
an astonishing pace. The foremost reason for the spread of hazard to human health. Estimating the quality of
terrible illnesses is recognized to be poor water quality. consumable water, may lower the frequency of water-related
Approximately 2 b illion people worldwide consume water
disorders. The latest method for machine learn ing has
that has been contaminated with faeces . Drinking such demonstrated promising accurate performance for water
contaminated water puts humans at risk for water-borne
quality. The study utilized five distinct learning algorith ms to
illnesses. According to the research by WaterAid India, one estimate drinking water quality. Firs t, data is acquired fro m
in ten individuals do not have access to safe drinking water.
several public sources and reported in compliance with the
Water sources are polluted more and more due to over use. water quality criteria of World Health Organizat ion (WHO).
The key issue may be the limited purification methodology
Artificial Neural Network (ANN) provided 99 % highest
that ensures the water quality. According to the report given
accuracy with a 0.75% training error during the training
by UNICEF, it is estimated that the water-borne ailments phase. Random Forest achieved an 87.76% F1 score and
cost India's economy over 600 million USD per year.
82.45% prediction accuracy. The ANN predicted high with
96.51% of F1 score.
In [2], A mir Hamzeh Haghibi et al., proposed a AI-based III. PROPOSED A PPROACH A ND M ET HODOLOGY
framework for water quality expectations. The study
investigated the presentation of AI approaches, such as A. Methodology Flow Diagram
ANN, group information handling strategies, and Support
Vector Machines (SVM) for predicting quality of water
portions of the River Tireh in Iran. Various types of move
and piece capabilities were tested separately to build up the
ANN and SVM . Analyzing the results of both, it showed that
the two models had a good ability to predict water quality
segments. Seven parameters such as, pH, SO4 , Na, Ca, Cl,
Mg, and HCO3 were considered.
In [3], Salisu Yusuf Muhammad et al. used Machine
Learning Techniques to implement a classification model for
finding the quality of water. Based on AI computations, the
article presented an acceptable grouping approach for sorting
water quality. The study looked into and studied the
representation of many arrangement models and calculations
to determine the main aspects that contributed to managing
the water nature of the River Kinta located in Malaysia. The
5 separate models with accurate calculations were assessed,
contrasted, and demonstrated. In examin ing the outcomes, K-
Star Calculation which is used from the Lazy Model, was the Fig. 1. Flow Diagram of Methodology
most effective with 86.67% preciseness. In general,
wastewater is dangerous to human lives, and developing B. Data Collection
rational models to solve this problem is necessary.
The dataset collected from Kaggle consists of 1992
In [4], Consolata Gakii et al., presented a water quality samples of various Indian rivers from the years 2004 to 2014.
analysis classification model based on Decision Tree. The The dataset consists of several parameters of river data such
study proposed a characterization methodology that utilizes a as year, Dissolved Oxygen, Temperature, pH values,
decision tree to monitor the quality of water data fro m Conductivity, Biochemical Oxygen Demand (B.O.D),
several areas in East African country Kenya. The quality of Nitrate Nitrogen, and Fecal Coliforms.
water is critical in ensuring that inhabitants consume water
which is safe to drin k. The usage of decision tree as an For predicting the water quality of the user entered data,
informational examination approach to predict the water is the trained model is used. The data is accepted through a web
drinkable water or not based on water quality parameters can application. After calculation of the water quality score, the
assist the research institute technician by predicting which predicted score is displayed on the web application with a
test results of water should proceed to the next stage of the suggestion of what measure to be taken on the water before
inquiry. The Kenya Water Institution's optional information consuming it.
was used in the building of the model. WEKA programming
was used to run the information model. To order/predict the
clean and unclean water, grouping using a decision tree was
used. The alkalinity water, pH level of water, and
conductivity tests may all play a part in determining quality
of water. The five classifiers of decision tree i.e., J48
algorithm, Logistic Model Tree, Random Timberland, and
Decision Stump, Hoeffd ing tree, were used to build the
model and assess its exactness. J48 decision tree achieved
94% maximu m accurate result, whereas Decision Stu mp
attained 83% which was the poorest precision.
In [5], C.Ashwini et al. presented a paper using Internet
of Things (IoT) and Machine Learn ing to monitor the quality
of water. The paper provided a practical solution to prevent
private overhead tanks' water fro m being contaminated. IoT Fig. 2. Interface for accepting user inputs
devices were used to observe the characteristics of water, and
AI computations were used to predict future water C. Data Preprocessing
contamination. To obtain the required water parameters, the The data collected from Kaggle consists of null values
proposed structure included several types of sensors and duplicate values. Therefore, before analyzing the data,
connected to NodeMCU module. The structure aids in the the data needs to be cleaned. After the data cleaning, the
conservation of water and is also cost effective. The values are normalized to a 0 to 100 range utilizing q-values,
following seven parameters, pH value of water, temperature so that the Water Quality Index (WQI) can be calculated
of water, turbidity, color of water, dissolved oxygen, total utilizing the available parameters . After the calculation of
natural carbon and conductivity were considered and WQI, all actual data values are normalized using z-score. As
evaluated. a result, the values are on the same scale. Following that is
the entire procedure.
3 pH 6.5 to 8.5
4 Chemical Oxygen Demand 250 mg/lit
5 Conductivity 0.1 to 1.5 mS/cm
Fig.3.2. Univariate Analysis of normalized parameters
6 Nitrate-Nitrogen 45 mg/lit —Box plot
7 Alkalinity 200 mg/lit
D. Data Analysis
1) Normalization Fig.3.3. Univariate Analysis on WQI of each parameter
—Box plot
To normalize the values, specifically the water quality
parameters, Q-value normalization is employed for fitting
the parameters in the range of values 0 to 100 for simpler
indexed computation. Every parameter that is used in the
model is normalized.
It is the statistical analysis of data in which challenge, with individual leaf hub comparing to a class
numerous measurements are taken on each experimental mark and parameters represented on the inner core of
unit and the correlations between multivariate tree. Decision tree can deal with challenges involving
measurements and their structure are crucial. both literal and mathemat ical data. Choice Tree topology
generates a tree like structure where each inner hub
Heat map- The use of colors to signify the value of
symbolizes a test quality, individual branch symbolizes
each data point in a heat map visualization is a way of the test's unintended result, and every leaf core provides
visually expressing numerical data. The warm-to-cool
the sensible option.
color scheme is most widely utilized in heat map
visualization, with warm colors indicating high-value 2) Random Forest Regression
data points and cold colors representing low-value data
points. A random forest is a meta-estimator that uses
averaging to improve projected accuracy and avoid over-
fitting by fitting a range of classification decision trees on
distinct sub-samples of the dataset.
G. Data Splitting
The final step for applying the Machine Learning
algorithm is data splitting. After applying the machine
leaning model, the model is tested with a particular
Fig.3.6a and 3.6b. Multivariate Analysis—Heat map portion of the data to evaluate performance of the model
that is trained and compute the accuracy measurements.
3) Correlation Analysis
In fig. 5, the data split is done for train ing and
To locate the variables that are dependent and to testing in which 80% of data is used for training and 20%
estimate tough variables using readily achievable for testing. After splitting the data, the data is fitted into a
parameters, the correlation analysis is used to uncover random forest regression algorithm.
the feasible correlations between parameters.
Managed learning is the category that describes the The Mean Absolute Error (MA E) is the average of
estimate of a decision tree. Both agreement and relapse the absolute difference between the dataset's actually
issues may be addressed with their help. The decision present and predicted values. The dataset's residuals are
tree employs a tree representation to address the averaged out in this measurement.
REFERENCES
[1] Osim Kumar Pal (2022) “The Quality of Drinkable Water using
Machine Learning T echniques” – International Journal of Advanced
Engineering Research and Science
Fig. 6. 4 UI after calculating results
[2] Amir Hamzeh Haghibi, Ali Heidar Nasrolahi, Abbas Parsaie (2018),
“Water quality prediction using machine learning”– Journal of Water
V. CONCLUSION quality research.
The examination of water quality is a co mplex issue because [3] Salisu Yusuf Muhammad, Mokhairi Makhtar, Azilawati Rozaimee,
of the many variables that affect it. Th is idea, in part icular, Azwa Abdul Aziz and Azrul Amri Jamal (2015), “Classification
Model for Water Quality using Machine Learning T echniques” -
is inextricably lin ked to the many purposes for which water International Journal of Software Engineering and Its Applications
is used. Different needs call for various standards. There is a Vol. 9, No. 6 , pp. 45-52
[4] Consolata Gakii and Jeniffer Jepkoech (2019), “ A Classification [14] Mengyuan Zhu, Jiawei Wang, Xiao Yang Yu Zhang, Linyu Zhang,
Model for Water Quality Analysis Using Decision Tree”- European Hongqiang Ren, Bing Wu, LinYe (2022). “A review of the
Journal of Computer Science and Information Technology Vol.7, application of machine learning in water quality evaluation”. - Eco-
No.3, pp.1-8. Environment & Health Vol 1 Issue 2 pp. 107-116
[5] C.Ashwini, Uday Pratap Singh, Ekta Pawar, Shristi (2019), “Water [15] Jui-Sheng Chou, Chia-Chun Ho, Ha-Son Hoang (2018). “Determining
Quality Monitoring Using Machine Learning And IOT” - quality of water in reservoir using machine learning”. Ecological
International journal of scientific & technology research volume 8, Informatics Vol 44 pp. 57-75.
issue 10. [16] Mourade Azrour, Jamal Mabrouki, Ghizlane Fattah, Azedine Guezzaz
[6] J. I. Ubah, L. C. Orakwe , K. N. Ogbu , J. I.Awu , I. E.Ahaneku & E. & Faissal Aziz (2021). “Machine learning algorithms for efficient
C. Chukwuma (2021), “Forecasting water quality parameters using water quality prediction”. Modeling Earth Systems and Environment
artifcial neural network for irrigation purposes” – Vol 8, pp 2793–2801
Nature.com/scientific-reports [17] Vesna Ranković, Jasna Radulovic, Ivana D. Radojevic, Aleksandar
[7] Nida Nasira, Afreen Kansal, Omar Alshaltone, Feras Barneih, Ostojić, Ljiljana Čomić(2010) .” Neural network modeling of
Mustafa Sameer, Abdallah Shanableh, Ahmed Al-Shamma'a (2022). dissolved oxygen in the Gruža reservoir, Serbia”-- Ecological
“Water quality classification using machine learning algorithms”-- Modelling Vol 221(8) pp. 1239-1244
Journal of Water Process Engineering [18] Rajiv Das Kangabam, Sarojini Devi Bhoominathan, Suganthi
[8] Zhao Fu (2020), “Water Quality Prediction Based on Machine Kanagaraj & Munisamy Govindaraju(2017). “Development of a water
Learning T echniques”-- UNLV T HESES, DISSERTAT IONS, quality index (WQI) for the Loktak Lake in India” -- Applied Water
PROFESSIONAL PAPERS, AND CAPSTONES Science Vol. 7, pp. 2907–2918
[9] Gazzaz N.M., Yusoff M.K., Aris A.Z., Juahir H., Ramli M.F (2012).” [19] Thair Khayyun, Abdul Hameed M Jawad Al Obaidy, Ayad Mustafa
Artificial neural network modeling of the water quality index for (2014). “Prediction of Water Quality of Euphrates River by Using
Kinta River (Malaysia) using water quality variables as predictors”. – Artificial Neural Network Model (Spatial And T emporal Study)” --
Marine Pollution Bulletin Vol. 64, Issue. 11, pp. 2409-2420 International Research Journal of Natural Sciences Vol.2, No.3,
[10] M K Daud, Muhammad Nafees , Shafaqat Ali, Muhammad Rizwa, pp.25-38.
Raees Ahmad Bajwa, Muhammad Bilal Shakoor, Muhammad Umair [20] Vaishnavi V. Daigavane, Dr. M.A Gaikwad., “Water Quality
Arshad , Shahzad Ali Shahid Chatha , Farah Deeba ,Waheed Murad, Monitoring System Based on IOT(2017) ” -- Advances in Wireless
Ijaz Malook , Shui Jin Zhu (2017) .“Drinking Water Quality Status and Mobile Communications, ISSN 0973-6972..
and Contamination in Pakistan”. -- BioMed Res. Int., 2017, 7908183. [21] Yogalakshmi S. and Mahalakshmi A(2021). “Efficient Water Quality
[11] Jianhua Dong, Guoyin Wang, Huyong Yan, Ji Xu & Xuerui Zhang Prediction for Indian Rivers Using Machine Learning”-- Asian
(2015). “ A survey of smart water quality monitoring system” -- Journal of Applied Science and T echnolog, Vol. 5, Issue 1, pp 100 -
Environmental Science and Pollution Research volume Vol. 22, pp. 109
4893–4906 [22] Umair Ahmed , Rafia Mumtaz , , Hirra Anwar , Asad A. Shah ,
[12] Ali Najah Ahmeda, Faridah Binti Othman, Haitham Abdulmohsin Rabia Irfan and José García-Nieto (2019). “Efficient Water Quality
Afan, Rusul Khaleel Ibrahim, ChowMing Fai, Md Shabbir Hossain, Prediction Using Supervised Machine Learning”-- Environmental
Mohammad Ehteram, AhmedElshafie (2019). “Machine learning Chemistry of Water Quality Monitoring
methods for better water quality prediction” -- Journal of Hydrology
[13] Ishaani Priyadarshini, Ahmed Alkhayyat, Ahmed J.Obaid, Rohit
Sharma (2022). “Water pollution reduction for sustainable urban
development using machine learning techniques” – Cities Vol 130