Abstract
Transformers have been extensively employed in various vision issues, particularly visual recognition and detection. Detection transformers are connected to end-to-end networks for object detection. Self-attention modules in the transformer give huge efficiency, making excellent object detection models. The decoder transformer fails to initialize query content properly and also fails to provide specific prior knowledge, which might potentially enhance inductive bias. This paper uses encoder and decoder transformers for object detection in deep foggy conditions. High-Resolution Network (HRNet) has been used in the backbone of this architecture to extract deep feature representation. The proposed method validates and compares with other detection techniques in terms of average precision (AP), the variety of factors, and frames per second (FPS) using the Foggy Cityscapes dataset. The qualitative results indicate that the proposed technique improves detection accuracy in deep foggy conditions.