A Soft Sensor Based Inference Engine for Water Quality Assessment and Prediction

Abstract

Toxic contaminants with potential exposure to health risks now appear in urban drainage systems. These contaminants may be measured in specific concentrations or gross terms with impact on potential of hydrogen (pH) and dissolve oxygen (DO). Often, urban slum settlements are at the effluents of these drainage systems. These communities mostly lack functional water treatment facilities. They drink and sanitize from run-off, and shallow wells in alluvial aquifers. This is responsible for the poor health and low quality living conditions of urban slum dwellers. It becomes imperative to implement reliable, real-time, water quality assessment and prediction systems to resolve known and emerging water quality issues in order to improve the quality of life in such communities. Thus, this work developed an inference engine for robust water quality assessment and prediction using some statistical and intelligent discriminant functions. Results show that machine learning algorithms including the Logistic Regression, Decision Trees, Random Forest, XGBoost, and Neural Networks schemes reliably predicted water potability in the absence of two missing instrumentation parameters namely: pH and DO. Typically, the inference engine estimated pH in the absence of DO. Subsequently, the Dissolved Oxygen was estimated using the predicted pH. In outcomes, computational metrics of the pH prediction model returns Mean Squared Error (MSE) of 0.0163 for training and 0.1024 for validation. Similarly, the Dissolved Oxygen was estimated by the Random Forest model with a mean squared error (MSE) of 1.0335 for training and 0.7150 for validation. Decision engines built on statistical discriminant functions reliably predicted water potability with validation accuracy and F1 score of 95.19% and 96.94% for Logistic Regression, 100% and 100% for Decision Trees, 98.97% and 99.35% for Random Forest, and 100% and 100% for XGBoost. Also, the Neural Networks scheme predicted water potability with an MSE of 98.97% and 99.35% for training and validation respectively. The developed soft sensor solution is applicable in situations with complete or incomplete instrumentation data. Summarily, the developed inference engines provide a reliable framework for fault diagnosis and serve as backup sensors in case of failure. This enhanced the validation and reliability of sensor data for water quality assessment and prediction.

Keywords

Discriminant functions inference engine instrumentation data instrumentation parameters machine learning soft sensors water potability water quality assessment

Introduction

Water is very important to human existence. It is by far one of the most important resources on Earth (Elubid et al., 2019). According to several studies, less than 1% of the available water resources are fit for human use (Yaro et al., 2023). Therefore, it is important to have a technique for evaluating the quality of groundwater systems using some index parameters. Since access to clean and potable water is crucial for human well-being and the entire ecosystem, the geographic distribution of freshwater is an important metric for assessing public health and its sustainability (Elubid et al., 2019). The distribution of fresh water varies widely across the globe and mostly influenced by geographical factors. Within the same geographic location, fresh water distribution may differ between urban and rural areas (Aji Setiawan et al., 2020). In urban areas, centralized water treatment and distribution systems are basic facilities. On the other hand, communities mostly rely on decentralized sources such as wells, boreholes, or rivers (Maryati et al., 2022). Potable water is water that is safe to drink and free from harmful contaminants, pathogens, and pollutants. Regions with abundant freshwater sources and well-developed infrastructure and water treatment systems generally have a higher percentage of potable water. Most people in industrialized nations have seamless access to drinkable water. These include people in Western Europe, the United States, Canada, etc. (Bernabé-Crespo & Loáiciga, 2024). Nevertheless, in developing nations, the availability of safe drinking water is of significant concern. Based on data provided by the World Health Organization (WHO), it is estimated that a staggering 2.2 billion individuals worldwide do not have adequate access to potable water (WHO & UNICEF, 2019). In the aforementioned places, there exists a heightened incidence of waterborne diseases. This gives rise to apparent health compromise and relatively high mortality (Lee et al., 2020). The Water Quality Index (WQI) is employed to categorize the quality of drinking water based on multiple physical, chemical, and biological parameters. This tool is among the many accessible options for representing data pertaining to the nature of water. It aggregates various water quality parameters (Uddin et al., 2021). This rating assesses the influence of several factors on the overall characteristics of water. The metrics under consideration include Electrical Conductivity (EC), Total Dissolved Solids (TDS), Chloride, Calcium, Magnesium, Sodium, Potassium, Sulphate, Total Hardness, Bicarbonate, pH, Carbonate, and Sodium Adsorption Ratio (SAR), among others. The interpretation of these parameters can be very cumbersome for the general public and even policy makers. That is why the WQI is very important since it aggregates all the parameters.

Water quality parameters consist of a range of physical, chemical, and biological attributes that ascertain the appropriateness of water for diverse applications. These include drinking, irrigation, and the sustenance of aquatic organisms. These parameters are used to measure the quality of water and determine whether it meets specific water quality standards (Omer, 2019). There are various water quality parameters that are frequently employed in the evaluation of water quality. They include temperature, pH, dissolved oxygen, and total dissolved solid. Others are Biological oxygen demands, turbidity, bacteria, pathogens, etc. These parameters are measured through various methods, including laboratory testing, field monitoring, and remote sensing. The objective is to ensure that they meet specific water quality standards to protect public health, aquatic life, and the environment.

Sensors play critical role in monitoring water quality. They support continuous measurement of key water quality parameters. However, occasionally these sensors could fail for a variety of reasons. When they do, they can have significant consequences for water quality monitoring. These may be inaccurate or incomplete data, false alarms, or missed alerts. The penalty may be poor public health, increased maintenance costs and downtime. This can lead to a range of issues, including delays in official response to contamination events, inaccurate risk assessments, and incorrect management decisions. To address these issues, it is important to have a reliable system in place for sensor maintenance and replacement. This can involve regular calibration and testing of sensors, as well as contingency plans for responding to sensor failures. Additionally, it is important to have backup sensors and redundancy in the monitoring system to ensure that data collection can continue even if individual sensors fail.

Soft sensors are essentially parameter estimation backup tools for measuring systems. They facilitate what-if analysis, real-time prediction for plant control, sensor validation, and fault diagnosis strategies. Basically, they are software that measure and process several signals in a plant in order to estimate the value of another variable of the system (Lima et al., 2022). In plants, a measuring device may get damaged due to extreme working conditions. Sometimes, the cost of acquiring some devices might be a concern. The challenge could also include inability to install specific measuring devices due to certain limitations in technologies. In situations like this, soft sensors are found useful as an alternative way of measuring, or estimating missing parameters or variables in the plant.

A soft sensor could be model-driven (white box) or data-driven (black box). The model-driven soft sensor is modeled with the first principle approach by understanding the physics of the system while the data-driven soft sensor relies on information derived from the historical data of the system from which empirical models are built. Soft sensors have helped governments to implement regulations in industrialized countries. An example is the Kyoto treaty aimed at reducing the emission of greenhouse gasses (Fortuna et al., 2006). In a nutshell, a soft sensor, like an observer, is a plant model for estimating plant variables.

In recent times, there has been a notable surge in interest within the domain of water quality evaluation, and potability prediction. More attention is now on the use of predictive models and machine learning algorithms (Ighalo & Adeniyi, 2020). Prior to the emergence of intelligent tools, evaluation of water quality predominantly depended on manual examination and chemical analysis of water samples obtained from different sources. These methods are known to take significant time and financial resources to realize. In addition, it is important to note that the static sampling of water quality parameters may not effectively account for the dynamic fluctuations that might occur due to a multitude of environmental influences (Ma et al., 2020). To overcome these constraints, water quality experts have resorted to employing data-driven methodologies to determine and possibly predict the potability of water from different sources. In this regard, the application of machine learning (ML) algorithms methods has demonstrated significant potential in the prediction of water potability. Thus, this work considered a range of machine learning methods for water quality assessment and prediction. The methods include models that are based on decision trees, support vector machines, random forests, and neural networks.

The models utilized in this study have been trained using historical data on water quality. The outcomes demonstrated the capability of the models to forecast the potability of water by considering a diverse set of inputs, such as pH levels, turbidity, conductivity, and dissolved solids, as well as a variety of chemical and biological characteristics.

The selection of relevant predictive variables is a crucial factor in the prediction of water potability. Researchers have conducted investigations into the significance of several water quality criteria in the prediction of potability. For example, the work of Sharma et al. (2019) revealed that some measures, such as turbidity, sulfate, and electrical conductivity, exhibited significant correlations with water quality. Similarly, an independent investigation conducted by Aziz et al. (2020) underscored the importance of pH, chloride, and total dissolved solids (TDS) as reliable indicators for properly predicting potability.

Furthermore, in the quest for real-time monitoring of water quality, accessibility and data quality are pivotal factors in the advancement and accuracy of prediction models. Up to this time, researchers have relied on historical water quality data, meteorological data and some geographic information for developing prediction models. Nevertheless, some issues relating to data consistency, the presence of missing values (which is the focus of this study), and the overall quality of the data have been established (Miller et al., 2005). In order to handle these concerns, various data preprocessing approaches, including imputation and data augmentation, have been employed to enhance the dependability of predictive models. Thus, assessment of the accuracy of water potability prediction models is crucial in ascertaining dependability and precision. In this process, metrics such as accuracy, precision, recall, and F1-score are frequently employed by researchers to evaluate the performance of a model (Rani et al., 2023). Additionally, receiver operating characteristic (ROC) curves and area under the curve (AUC) values are used to measure the model’s ability to discriminate between potable and non-potable water.

In some model-specific outcomes of soft sensors for water potability, Usman Kaoje et al. (2021) employed a machine learning methodology to forecast the water quality indicators of the Kelantan River. This was achieved by utilizing historical data obtained from many stations. The prediction model was developed using a Support Vector Machine (SVM). The study made predictions about six water quality measures, including dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), ammonia nitrogen (NH₃-N), and suspended solids (SS). The study’s findings revealed that suspended solid parameters had the most favorable performance in terms of prediction, as evidenced by its highest R² score values. In contrast, the R² score for the prediction of the COD parameter was found to be the lowest, suggesting that the dataset poses challenges in terms of its suitability for modeling using SVM. On the other hand, an examination of the impact of attribute number reveals a positive correlation between the prediction accuracy of the four parameters (DO, BOD, NH₃-N, and SS) and the model’s performance. However, optimal estimation of the pH parameter is achieved by employing the minimum amount of features identified in the scheme.

From the foregoing, it is apparent that predicting water potability requires the knowledge of all key quality parameters and involving them in the model to predict potability. This might suffer a setback in case of sensor(s) omission, failure, or malfunction on the field. These are very common situations on the field. Therefore, it is useful to develop smart soft sensors that estimate missing key parameters when the need arises. They also stand in as backup sensors pending when the sensors’ fault is diagnosed and resolved. This study therefore focuses on the use of a soft sensor as a parameter estimation tool for fitting values of missing water quality parameters to predict water potability.

Materials and Methods

Data availability

The dataset used in this study is “Water Quality Dataset for Water Quality Prediction Tool” by Geetha J. (2023) obtained from Hydroshare (https://www.hydroshare.org/resource/4ab43e1b507b496b9b42749701daed5c/). It consists of 1,361 samples acquired from 31 unique capital cities in India. The cities are listed thus: Shimla, Chandigarh, Jammu, Dehradun, Lucknow, Patna, Kolkata, New Delhi, Bhopal, Jaipur, Ranchi, Dispur, Kohima, Gangtok, Gandhinagar, Mumbai, Raipur, Bhubaneswar, Hyderabad, Bengaluru, Chennai, Thiruvananthapuram, Pondicherry, Panaji, Telangana, Bangalore, Daman, Silvassa, Imphal, Agartala, and Shillong. The dataset has fourteen (14) features and eight (8) significant quality parameters. These include temperature, D.O., pH, conductivity, BOD, nitrate, fecal coliform, and total coliform. Lastly, it has a target label “class” which is a categorical data with the value “yes” if the water is potable and “no” if it is not. A cross-section of the data is shown in Table 1 while the classification of the target labels is shown in Table 2.

Table 1.

A Cross-Section of Water Quality Dataset Showing the Features and Labels Obtained From a Snapshot of the Data Head on Jupyter Notebook (Accessed Date: 10/12/2023).

Station code	Locations	Lat	Lon	Capital city	State	Temperature	D.O	pH	Conductivity	B.O.D	Nitrate	Fecal coliform	Total coliform	class
1001	BEAS AT U/S MANALI	32.24495	77.19108	Shimla	HIMACHAL PRADESH	9	9	8	85	0.1	0.2	106	397	yes
1002	BEAS AT D/S KULU	31.96058	77.11401	Shimla	HIMACHAL PRADESH	10	9	8	102	0.3	0.4	153	954	yes
1003	BEAS AT D/S AUT	26.88789	75.81148	Shimla	HIMACHAL PRADESH	11	9	8	96	0.2	0.3	58	653	yes
1004	BEAS AT U/S PANDON DAM	47.35194	19.63362	Shimla	HIMACHAL PRADESH	13	9	8	94	0.2	0.4	34	317	yes
1005	BEAS AT EXIT OF TUNNEL DEHAL POWER HOUSE	25.99279	91.82611	Shimla	HIMACHAL PRADESH	14	10	8	112	0.2	0.5	213	1072	yes

Table 2.

Classification of Targets Before and After Preprocessing.

Stage	Total samples	Yes	No
Before preprocessing	1,361	1,063	268
After preprocessing	967	751	216

Data preprocessing

All samples in the dataset with missing values were first removed, reducing the number of samples to 967. The new classification of target labels is shown in Table 2. The station code, location, latitude, longitude, capital city, and state which are not significant parameters contributing to water potability were taken off. As shown in the correlation analysis in Figure 1, the fecal coliform feature is highly correlated with the total coliform feature with a correlation coefficient of 0.85. Therefore, the fecal coliform feature was also taken off. The final features used are: temperature, D.O., pH, conductivity, B.O.D, nitrate concentration, and total coliform.

Figure 1.

Correlation coefficient among water quality parameters.

Before the stages of model training and validation, the training and validation datasets were normalized due to the highly disparate scales used to measure the various water quality parameters.

The StandardScaler, also known as Z-score normalization or standardization, is a method used to transform features (variables) in a dataset such that the standard deviation of each feature is 1 and the mean is 0 (Rahmad Ramadhan & Yolanda, 2024).

Mathematically, the formula for standardization is:

X_{s t a n d a r d i z e d} = \frac{X - μ}{σ}

Where:

$X_{s t a n d a r d i z e d}$ is the standardized value of the data point X, $X$ is the original value, μ is the mean of the feature, and σ is the standard deviation of the feature.

Machine learning algorithms

Many machine learning algorithms are used in this work to build the water potability decision making model. The algorithms used in this study are logistic regression, decision trees classifier, decision trees regressor, random forest classifier, random forest regressor, XGBoost classifier, and XGBoost Regressor. Random forest regressor and XGBoost regressor were also used to develop soft sensors (which are parameter estimators) to estimate pH and dissolved oxygen respectively. The flow chart in Figure 2 describes the processes and steps taken to develop the models. These include loading of the dataset, filtering, scaling, training, and validating the models.

Figure 2.

Flow chart for machine learning models for soft sensor development and application.

Performance metrics

I. Mean Squared Error (MSE): is a metric used to measure the average squared difference between the predicted and actual values within a dataset. It offers insights into the predictive accuracy of regression models by showing how close a regression line is to a set of data points.

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}

$M S E :$ Mean Square error

$n :$ Number of Data points

$Y_{i} :$ Observed Values

$\hat{Y} :$ Predicted Values

II. R-Squared (R²): R-Squared is a statistical measure of how well a regression model fits data it was modeled for. R-Squared measures how much variability in the target is explained by the variables in the model with values ranging from 0 to 1. A value of 0 indicates that the model does not capture the relationship between the dependent and independent variables while a value of 1 indicates that the model perfectly fits the data.

R^{2} = 1 - \frac{R e s i d u a l S u m o f S q u a r e s}{T o t a l S u m o f S q u a r e s}

o r R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \underline{y})}^{2}}

$R^{2} :$ R-Squared Value

$n :$ Number of Data points

$y_{i} :$ Observed Values

${\hat{y}}_{i} :$ Predicted Values

$\underline{y} :$ Mean of Observed Values

III. Accuracy: Accuracy is the ratio of accurately predicted samples to the total number of samples in a dataset. It gives an overall measure of the likelihood of the model accurately predicting an outcome. However, this metric may not be appropriate when there is imbalance in class distribution.

A c c u r a c y = \frac{N u m b e r o f C o r r e c t P r e d i c t i o n s}{T o t a l N u m b e r o f P r e d i c t i o n s}

IV. Precision: Precision measures the ratio of the true positives to the total number of positive predictions made by the model. This metric is useful when minimal or zero false positives are wanted such as in medical diagnoses.

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}

V. Recall: Recall measures the ratio of true positives to the total number of actual positives in the dataset. This metric is important in situations where false negatives are costly.

R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s}

VI. F1-score: F1-score is the output of the harmonic mean of precision and recall. This is a more effective metric for evaluating inaccurate predictions compared to the accuracy.

F 1 S c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

VII. Confusion Matrix: Confusion matrix is a representation of the performance of the classifier model by showing the number of true positives, the number of true negatives, false positives, false negatives, and true positives.

Results and Discussion

Soft sensors: Water quality parameters estimation

In this study, we developed a soft sensor designed to estimate or predict the pH level of water based on five key water quality parameters: temperature, conductivity, biochemical oxygen demand (B.O.D.), nitrate, and total coliform counts. Additionally, we created a second soft sensor to estimate dissolved oxygen (D.O.) levels, incorporating the same five parameters along with the estimated pH value obtained from the first sensor. The 80-20 split strategy (Joseph, 2022) was used for all the regressor models. However, the 70-30 split strategy was also attempted but the performance was poor considering that the MSE for all the statistical models were greater than their respective 80-20 counterparts both in training and validation (Suwadi et al., 2022).

PH estimation model

Different machine learning algorithms namely: decision tree regressor, random forest regressor, XGBoost regressor, and neural networks were implemented to develop models that estimate the pH.

For the pH model, as seen in Figure 3, the minimum validation error is 0.1024 while the maximum is 0.1581 obtained from the random forest and neural network schemes respectively. The random forest algorithm outperformed other algorithms except the decision tree which is closest to it.

Figure 3.

Comparison of the MSEs of the ML Algorithms for predicting the pH.

Figure 4 shows how the predicted pH compares with the actual pH with Random Forest algorithm. The plot of the predicted pH shows a good agreement with the plot of the actual pH except for a few spikes that could have been classified as outliers. From the regression plots shown in Figure 5a, 5b, and 5c, the R² values for validation are 0.2933, 0.4963, and 0.3018 for decision tree regressor, random forest regressor, and XGBoost regressor, respectively.

Figure 4.

Comparative Performance of Random Forest Predicted and Actual pH: (a) decision tree regressor for pH, (b) random forest regressor for pH, and (c) XGBoost regressor for pH.

Figure 5.

The plot of predicted and actual values of pH obtained from different ML Algorithms.

Dissolved oxygen estimation model

Similar to the pH estimation, random forest, decision trees, XGBoost, and neural network models were used to predict dissolved oxygen. In this case, the minimum validation MSE is 0.7075 while the maximum is 1.2626 which are obtained for Decision tree and XGBoost respectively. Both extreme MSEs here are more than what we have in the pH model. This behavior is justified by the introduction of the predicted pH values used to estimate the Dissolved Oxygen. Unlike the pH model, the neural network algorithm performs fairly well with lower training and validation MSEs. For the D.O. model, as seen in Figure 6, the minimum validation error is 0.7075 while the maximum is 1.2626 obtained from the XGBoost and decision tree schemes respectively. The XGBoost algorithm outperformed other algorithms in this case from our research.

Figure 6.

Comparison of the MSE ML Algorithms for predicting Dissolved Oxygen.

Figure 7 illustrates the comparison between the actual D.O. values and D.O. predicted using the XGBoost algorithm. The plot of the predicted D.O. (in red) shows a good agreement with the plot of the actual D.O. (in blue) just as in case of pH, but also has few spikes that could have been classified as outliers.

Figure 7.

Comparison between the actual D.O. and D.O. predicted using the XGBoost algorithm.

Also, as shown in Figure 8a–c, in the regression plots of the predicted D.O. values and actual D.O. values, the R² values for validation are 0.3132, 0.6110, and 0.6040 for decision tree regressor, random forest regressor, and XGBoost regressor respectively.

Figure 8.

Actual vs predicted values of dissolved oxygen obtained from different ML algorithms: (a) decision tree regressor for D.O, (b) random forest regressor for D.O, and (c) XGBoost regressor for D.O.

Water potability prediction

Water is considered to be potable when it is safe for drinking and free from harmful contaminants, pathogens, and pollutants. According to Geetha Jenifel (2023) from the dataset used in this study, the water is potable when the target label class has the value “yes.” When the water is not potable, it has the value “no.” The normalized features (temperature, D.O, pH, conductivity, B.O.D, nitrate, and total coliform) were used to train a logistic regression model. A 70-30 split strategy was used to train the logistic regression model. This gave 81.09% validation accuracy. Feature engineering was carried out on the dataset to improve the performance of the model. To be precise, with different degrees of polynomials, the polynomial regression method was used to transform the features up to the eighteenth degree. The optimum order occurring at the tenth degree happened to be the optimal degree beyond which the training accuracy continued to increase while the validation accuracy started to decline. This indicated an overfitting problem beyond tenth order. Decision tree classifier, random forest classifier, and XGBoost classifier algorithms were also implemented for predicting potability. The 80-20 split strategy was used for all these.

Ultimately, a Neural network model was trained to recognize the pattern and extract water quality features using a three-layer network. Its hyper-parameters are as follows:

Layer 1: 120 Neurons, Relu activation function, and L2 with lambda = 0.0001

Layer 2: 60 Neurons, Relu activation function, and L2 with lambda = 0.0001

Layer 3: 1 Neurons, Linear activation function

The model was trained for 5,439 epochs.

The performances of the different schemes are summarized in Table 3. F1 score and accuracy score obtained for prediction by the logistic regression was 95.19 and 96.94, respectively. Other schemes gave relatively higher accuracies and F1 scores.

Table 3.

Summary of the Performance Metrics and Hyper-Parameters for Water Potability.

ML algorithm	Validation accuracy (%)	Validation F1 score (%)	Polynomial degree	No of layers	No of estimators	Maximum depth	Minimum samples split
Polynomial Regression	95.19	96.94	10	N/A	N/A	N/A	N/A
Decision Tree Classifier	100	100	N/A	N/A	N/A	4	50
Random Forest Classifier	98.97	99.35	N/A	N/A	5	5	100
XGBoost Classifier	100	100	N/A	N/A	500	N/A	N/A
Neural Network	98.97	99.35	N/A	3	N/A	N/A	N/A

As shown in Figure 9, The Decision Tree and XG Boost algorithms correctly identified all potable and non-potable water samples. However, Random Forest Classifier and Neural Network models wrongly classified two (2) samples that are potable as non-potable.

Figure 9.

The confusion matrix for the different machine learning models prediction models: (a) decision tree classifier, (b) random forest classifier, (c) XGBoost classifier, and (d) neural network.

Graphical user interface for water potability prediction

A graphical user interface was developed for the prediction algorithm. The application was built with Python and Django framework. The developed machine learning models were deployed on the website applications.

Plate 1 shows a state where all parameters are provided to the system except pH and D.O. After clicking the “check potability” button on the web page as shown in Plate 1, pH and D.O. values are predicted and displayed as shown in Plate 2 as approximated values. The prediction of the water potability is also shown as water is potable or water is not potable.

Plate 1:

Plate 1: User interface for the water potability prediction website application showing the inputs

Plate 2:

Plate 2: User interface for the water potability prediction website application showing the result

Summary and Conclusions

Sensors are essential in monitoring and assessing water quality because of the critical nature of their role. It is important they are also monitored and are in good working condition with high precision. When they fail and are not able to acquire and transmit data properly, it impacts negatively on the whole monitoring framework. This study addresses such challenges by developing robust inference engines referred to as soft sensors. The soft sensors are built with advanced machine learning models capable of predicting two critical parameters: pH and D.O. in case of incomplete instrumentation data. The results show high predictive accuracy across models. Among the pH soft sensors, Random Forest had the best performance with 0.1024 validation MSE while XGBoost achieved the best among the D.O. soft sensors with 0.9483 validation MSE. The developed soft sensors provide a reliable framework for fault diagnosis and serve as backup sensors in case of failure. This work highlights the potential of soft sensor-based solutions for improving water quality assessment and prediction systems, especially in urban slum settlements. Urban slum communities most times lack functional water treatment facilities. They rely on runoff and shallow wells for water supply. The inference engine can help identify contamination risks and inform necessary interventions. By advancing water quality monitoring and prediction through the developed soft sensors, the public health and living conditions in these settlements can be significantly improved.

Footnotes

Acknowledgements

The authors would like to appreciate the funding for this research from the Mastercard Foundation through the African Engineering and Technology Network (AFRETEC) which is managed by Carnegie Mellon University Africa under the Inclusive Digital Transformation Research Grant given to University of Lagos, Akoka, Nigeria. The views expressed in this document are solely those of authors and do not necessarily reflect those of the Carnegie Mellon University Africa or the Mastercard Foundation.

Abbreviations

BOD Biochemical Oxygen Demand

COD Chemical Oxygen Demand

D. O. Dissolved Oxygen

MSE Mean Squared Error

ML Machine Learning

NH₃-N Ammonia Nitrogen

R² R-Squared

SDG Sustainable Development Goal

SVM Support Vector Machine

SS suspended solids

WQI Water Quality Index

XGB Extreme Gradient Boosting

Author Contributions

Study conception and design: T.A., F.O., and M.O; data collection: M.A. and U.A.; analysis and interpretation of results: T.A., M.A., F.O., and K.O.; draft manuscript preparation: M.A., T.A., F.O., A.A., and U.A. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Micheal A Ogundero

Theophilus A Fashanu

Foluso O Agunbiade

Kehinde Orolu

Ahmed A Yinusa

Usman A Daudu

Muhammed O H Amuda

References

Aziz

N. A.

Abdulrazzaq

Mansur

M. N.

(2020). GIS-based watershed morphometric analysis using dem data in Diyala river, IRAQ. The Iraqi Geological Journal, 53(1C), 36–49. https://doi.org/10.46717/igj.53.1c.3rx-2020.04.03

Bernabé-Crespo

M. B.

Loáiciga

(2024). Managing potable water in Southeastern Spain, Los Angeles, and Sydney: Transcontinental Approaches to Overcome Water Scarcity. Water Resources Management, 38(4), 1299–1313. https://doi.org/10.1007/s11269-023-03721-8

Elubid

B. A.

Huang

Ahmed

E. H.

Zhao

Elhag

K. M.

Abbass

Babiker

M. M.

(2019). Geospatial distributions of groundwater quality in Gedaref state using geographic information system (GIS) and Drinking Water Quality Index (DWQI). International Journal of Environmental Research and Public Health, 16(5), Article 731.

Fortuna

Graziani

Rizzo

Xibilia

M. G.

(2006). Soft sensors for monitoring and control of industrial processes. Springer. https://doi.org/10.1007/978-1-84628-480-9

Geetha

(2023). Water Quality Dataset for water quality prediction tool [Dataset]. HydroShare. https://doi.org/10.4211/hs.4ab43e1b507b496b9b42749701daed5c

Ighalo

J. O.

Adeniyi

A. G.

(2020). A comprehensive review of water quality monitoring and assessment in Nigeria. Chemosphere, 260, Article 127569. https://doi.org/10.1016/j.chemosphere.2020.127569

Joseph

V. R.

(2022). Optimal ratio for data splitting. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15, 531–538. https://doi.org/10.1002/sam.11583

Lee

Perera

Glickman

Taing

(2020). Water-related disasters and their health impacts: A global review. Progress in Disaster Science, 8, Article 100123. https://doi.org/10.1016/j.pdisas.2020.100123

Lima

R. P.

Mauricio Villanueva

J. M.

Gomes

H. P.

Flores

T. K.

(2022). Development of a soft sensor for flow estimation in water supply systems using artificial neural networks. Sensors, 22(8), Article 3084. https://doi.org/10.3390/s22083084

10.

Shekhar

N. R.

Biswas

Sahu

A. K.

(2020). Determination of physicochemical parameters and levels of heavy metals in food waste water with environmental effects. Bioinorganic Chemistry and Applications, 2020(1), Article 8886093. https://doi.org/10.1155/2020/8886093

11.

Maryati

Firman

Humaira

A. N. S.

(2022). A sustainability assessment of decentralized water supply systems in Bandung City, Indonesia. Utilities Policy, 76, Article 101373. https://doi.org/10.1016/j.jup.2022.101373

12.

Miller

Whitehill

Deere

(2005). A national approach to risk assessment for drinking water catchments in Australia. Water Science & Technology: Water Supply, 5(2), 123–134. https://doi.org/10.2166/ws.2005.0029

13.

Omer

N. H.

(2019). Water quality parameters. In: Kevin Summers

(Ed), Water quality - science, assessments and policy (pp. 1–15). IntechOpen. https://doi.org/10.5772/intechopen.89657

14.

Rahmad Ramadhan

Yolanda

A. M

. (2024). A comparative study of Z-score and min-max normalization for rainfall classification in Pekanbaru. Journal of Data Science, 2024(4), 1–8. https://doi.org/10.61453/jods.v2024no04

15.

Rani

J. P.

Nivasini

Rubavathi

C. Y.

Jona

(2023). Machine learning based real time water quality monitoring system [Conference session]. International conference on adaptive and intelligent systems (pp. 1366–1370). https://doi.org/10.1109/icais56108.2023.10073761

16.

Setiawan

Budiman

Sayuti

(2020). Design an automatic fresh water distribution system with microcontroller for community sanitation needs. Journal of Physics: Conference Series, 1469(1), Article 012108. https://doi.org/10.1088/1742-6596/1469/1/012108

17.

Sharma

Kumar

Derrick

M. D.

Singh

S. K.

(2019). Appraisal of river water quality using open-access earth observation data set: A study of river Ganga at Allahabad (India). Sustainable Water Resources Management, 5, 755–765. https://doi.org/10.1007/s40899-018-0251-7

18.

Suwadi

N. A.

Derbali

Sani

N. S.

Lam

M. C.

Arshad

Khan

Kim

K. I.

(2022). An optimized approach for predicting water quality features based on machine learning. Wireless Communications and Mobile Computing, 2022(1), Article 3397972. https://doi.org/10.1155/2022/3397972

19.

Uddin

M. G.

Nash

Olbert

A. I.

(2021). A review of water quality index models and their use for assessing surface water quality. Ecological Indicators, 122, Article 107218. https://doi.org/10.1016/j.ecolind.2020.107218

20.

Usman Kaoje

Abdul Rahman

M. Z.

Idris

N. H.

Razak

K. A.

Wan Mohd Rani

W. N.

Tam

T. H.

Mohd Salleh

M. R

. (2021). Physical flood vulnerability assessment using geospatial indicator-based approach and participatory analytical hierarchy process: A case Study in Kota Bharu, Malaysia. Water, 13(13), Article 1786. https://doi.org/10.3390/w13131786

21.

WHO & UNICEF. (2019). Progress on drinking water, sanitation and hygiene: 2000–2017: Special focus on inequalities (p. 140). https://data.unicef.org/resources/progress-drinking-water-sanitation-hygiene-2019/

22.

Yaro

A. S.

Malý

Prazák

(2023). Outlier detection in time-series receive signal strength observation using Z-score method with Sn Scale estimator for indoor localization. Applied Sciences, 13(6), Article 3900. https://doi.org/10.3390/app13063900