Speech super-resolution is the process of estimating the missing frequency content of a speech signal from its existing band-limited frequency content. The loss of frequency components is a common occurrence that can be because of a low sampling rate, low-quality microphones, or various transmission factors, and it is an increasingly common problem as bandwidth for high-quality communications is generally available, but many end devices are still using older standards and protocols. Although a number of solutions exist for this problem, we note that most are not amenable to real-world use, due to computational or algorithmic constraints. In this paper we present a compact, efficient, and minimal-latency solution to speech super-resolution that is suitable for use with real-time streaming data. We propose a novel causal architecture that can be easily deployed for real-world use. We additionally propose a novel adversarial training process and an initialization procedure that speeds up convergence and results in improved outputs. Objective and subjective results show that our proposed model outperforms the latest solutions in this space, despite being significantly smaller and faster.
Research areas