Abstract
This research explores the classification of environmental sounds using the UrbanSound8K dataset, introducing a novel feature fusion technique that combines spectrograms, scalograms, and Mel-Frequency Cepstral Coefficients (MFCCs)—a combination not previously explored in existing literature. While individual features such as spectrograms, scalograms, and MFCCs achieved accuracies of 92%, 89%, and 89%, respectively, the fusion of these features led to a significant improvement in accuracy, reaching 97%, with a notable 5–8% performance gain. The proposed method harnesses the complementary nature of these features, effectively capturing temporal, frequency, and perceptual characteristics of audio signals, enabling more robust and comprehensive representations. The fused features are processed through an enhanced AlexNet architecture, customized to handle multi-dimensional inputs. The model demonstrated excellent noise robustness, faster convergence, and superior generalizability compared to models trained on individual features. The findings pave the way for future applications, including real-time environmental sound classification integrated with IoT devices, mobile applications, and broader domains such as wildlife monitoring and industrial noise detection systems.