Underwater images suffer from severe degradations, including color distortions, reduced visibility, and loss of structural details due to wavelength-dependent attenuation and scattering. Existing enhancement methods primarily focus on spatial-domain processing, neglecting the frequency domain’s potential to capture global color distributions and long-range dependencies. To address these limitations, we propose FUSION, a dual-domain deep learning framework that jointly leverages spatial and frequency domain information. FUSION independently processes each RGB channel through multi-scale convolutional kernels and adaptive attention mechanisms in the spatial domain, while simultaneously extracting global structural information via FFT-based frequency attention. A Frequency Guided Fusion module integrates complementary features from both domains, followed by inter-channel fusion and adaptive channel recalibration to ensure balanced color distributions. Extensive experiments on benchmark datasets (UIEB, EUVP, SUIM-E) demonstrate that FUSION achieves state-of-the-art performance, consistently outperforming existing methods in reconstruction fidelity (highest PSNR of 23.717 dB and SSIM of 0.883 on UIEB), perceptual quality (lowest LPIPS of 0.112 on UIEB), and visual enhancement metrics (best UIQM of 3.414 on UIEB), while requiring significantly fewer parameters (0.28 M) and lower computational complexity, demonstrating its suitability for real-time underwater imaging applications.
The FUSION framework enhances underwater images by processing them in both spatial and frequency domains. Starting with an input image of size H×W×3, we split it into its three color channels (DR, DG, DB). In the spatial path, each channel undergoes multi-scale convolutions (3×3 for red, 5×5 for green, and 7×7 for blue) followed by channel and spatial attention (CBAM) and a residual connection to preserve fine details. Concurrently, in the frequency path, each channel is transformed via a 2D FFT to extract its magnitude, which is refined through two 1×1 convolutions and a Frequency Attention mechanism that produces weighted magnitude maps. The refined magnitude is recombined with original phase information and passed through an IFFT to yield frequency-domain features.
For each channel, spatial and frequency features are concatenated and fused via a small convolutional block (Frequency Guided Fusion), followed by adding back the original input channel in a residual fashion. The three fused channels are then concatenated to form a joint representation, which is projected to a higher-dimensional space and further combined with aggregated frequency features through a learned transform. A global CBAM module refines this fused representation, and a decoder reconstructs a preliminary enhanced RGB image. Finally, an Adaptive Channel Calibration step computes per-channel scaling factors from global image statistics and applies them to balance color distributions, yielding the final enhanced output.
Visual comparison of FUSION against other state-of-the-art methods on the UIEB test set; note the restored natural colors and enhanced details.
FUSION effectively recovers contrast and corrects color casts compared to competing approaches on EUVP.
Visual examples showing the impact of removing key modules: frequency attention, frequency branch, fusion, channel calibration, local attention, and global attention.
FUSION achieves a favorable balance between parameter count (0.28 M) and computational cost (36.73 GFLOPs), outperforming larger models in both quality and efficiency.
@InProceedings{FUSION,
author = {Jaskaran Singh Walia and Shravan Venkatraman and Pavithra LK},
title={FUSION: Frequency-guided Underwater Spatial Image recOnstructioN},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025}
}