Spectrogram Inversion: Cheap, Good and Real-Time!

PUBLISHED ON AUG 28, 2025 — CATEGORIES: publications

If you work with real-time audio and/or with spectrograms, you may be interested in our latest publication at InterSpeech 2025, resulting from my 2024 internship at Meta Reality Labs in Cambridge (UK):


Imagine you want to perform a low-latency audio generative task such as speech enhancement or live translation, on a tiny computer:

Smart glasses were our motivation for a low-latency and low-compute audio pipeline.

To limit computation, many “cheap” pipelines benefit from working directly with spectrograms. The problem is that, while spectrograms are very nice to work with, phases are not! Luckily, the Gradient Theorem articulates a really neat connection between spectrograms and their phases:

(left) Speech STFT log-magnitudes. (center) Corresponding phases. (right) Frequency-derivative of the phases. Note the similarity to the magnitudes!

This allows us to directly obtain information of the phases from the magnitudes! And as it turns out, it leads to very high-quality and low-latency results when combined with deep learning.

In our work, we managed to make the whole pipeline much faster and smaller without compromising any of the good quality, by proposing a tiny, causal CNN and leveraging a neat numerical trick to solve a least-squares system in linear time and memory (bringing it down from cubic).

Our proposed causal CNN, comprising only 8k trainable parameters.

The result: Efficient, real-time and high-quality speech spectrogram inversion! Check out our paper and presentation linked below for more details, and of course, a big shout-out 🗣 to my collaborators and supervisors from Meta!



@inproceedings{fernandez25_interspeech,
  title     = {{Efficient Neural and Numerical Methods for High-QualityOnline Speech Spectrogram Inversion via Gradient Theorem}},
  author    = {Andres Fernandez and Juan Azcarreta Ortiz and Çağdaş Bilen and Jesus {Monge Alvarez}},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {3449--3453},
  doi       = {10.21437/Interspeech.2025-439},
  issn      = {2958-1796},
}
TAGS: algebra, audio, convex optimization, deep learning, live electronics, machine learning, signal processing