HTFNet: Hybrid Time-Frequency UNet for binaural audio synthesis

Binaural audio is an essential technique for achieving highly realistic spatial localization. Currently, direct binaural audio recording and HRTF-based binaural audio synthesis are the two mainstream approaches for presenting binaural audio. However, the former often incurs high costs, while the latter typically lacks constraints on the orientation of the source, which we believe leaves room for performance improvement. To address this, we propose a hybrid time-frequency domain UNet framework for binaural audio synthesis, namely HTF-Net. Specifically, we first convert the mono audio into an initial binaural signal, which is then processed separately in the time and frequency domains to extract both local and global features. Gated Conv Transformer Blocks (GCTBs) are used to capture the global context, while a Pos-Ori Attention Module (POAM) is introduced to integrate the spatial information of the sound source and capture its movement. During the reconstruction phase, Dilated Residual Convolution Blocks (DRCBs) are incorporated to capture features in both the time and frequency domains. Extensive experiments demonstrate that the proposed method produces outperforms other state-of-the-art methods in phase estimation (Phase-\( \mathscr{L}_{2} \): 0.763, IPD-\( \mathscr{L}_{2} \): 1.072, Wave-\( \mathscr{L}_{2} \): 0.138, Amp-\( \mathscr{L}_{2} \): 0.036).