Effective techniques for scaling audio encoder pretraining
2025
This work presents advancements in audio pretraining objectives designed to generate semantically rich embeddings, capable of addressing a wide range of audio-related tasks. Despite significant progress in the field, current methods often emphasize full fine-tuning in downstream applications, which can obscure the true potential of pretrained audio encoders. In this study, we present an audio encoder that achieves state-of-the-art (SOTA) performances in both fine-tuning and linear probing, utilizing a carefully curated set of pragmatic techniques. Building on previous research, we incorporate masked prediction and introduce SpecAug within a curriculum masking strategy at the patch level, which progressively increases training difficulty, along with a mask-aware position bias. To comprehensively assess the encoder’s capabilities, we examine the impact of scaling both the dataset size and model capacity, conducting linear probing evaluations while keeping the encoder frozen as well as full fine-tuning. Our model demonstrates superior performance compared to recent SOTA methods across various downstream tasks. Additionally, we explore the potential of tokenizing the resulting audio embeddings for use as discrete inputs, enhancing our understanding of the model’s capabilities.
Research areas