Abstract
Groundwater prediction in data-scarce and environmentally sensitive regions presents a persistent challenge due to limited
observational data, spatial heterogeneity, and the nonlinear nature of hydrogeological processes. In this study, we propose
HydroPredictor, a hybrid machine learning framework that integrates the categorical handling efficiency of CatBoost with the nonlinear
feature learning capacity of a regularized Multi-Layer Perceptron (MLP). The model was trained on a geo- referenced dataset of 315
samples from the Feija Basin in southeastern Morocco, incorporating ten environmental predictors such as elevation, rainfall, soil
permeability, NDVI, and topographic wetness index. The pipeline includes Optuna-based hyperparameter optimization and 5-fold
cross-validation to ensure robustness and generalization. HydroPredictor achieved a testing accuracy of 89.23%, with an F1-score of
0.8937 and Area Under the Curve (AUC) values exceeding 0.90 across all groundwater potential classes. Statistical validation using the
Friedman and Wilcoxon signed-rank tests (p < 0.05) confirmed its significant outperformance over conventional models, including
Random Forest, Support Vector Machine (SVM), and standalone MLP. Furthermore, HydroPredictor demonstrated superior
generalization compared to prior models in the literature (e.g., RF-SSA: AUC = 0.840; GBDT: AUC = 0.88), while maintaining minimal
overfitting (∆Accuracy = 0.35%). By combining interpretable tree-based embeddings with deep neural representations, HydroPredictor
provides a robust and scalable solution for groundwater classification in data-limited settings, offering a reproducible and operationally
relevant tool for sustainable groundwater resource management under climatic and environmental uncertainty.