The work presented here is based on the amazing research done in 2019 by my colleagues at CVSSP Surrey:
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley
The results of that research were not only highly interesting, but also highly practical. As a consequence, a real-time demonstration was released under a liberal license:
The demo shows good detection quality while running in real time on commodity hardware (i.e. relatively inexpensive and non-specialized systems like a single CPU core in a 5-year-old laptop). The slide below, from the Future Technologies for Home Wellbeing event where we presented the latest version live using a microphone and a laptop, illustrates the main workflow and performed task, as well as some potential applications (PDF available here):
This post summarizes the software architecture and its main components. It has been built with reproducibility, stability, flexibility and maintainability in mind, and is hosted in the following repository:
I’m the current maintainer and author of the last version. Apart from that, all credit goes to my colleagues for their excellent previous work!
Since the deep learning backend was PyTorch, and Python runs on many different platforms, the most natural thing for a desktop demonstration was to stay within Python. In this context, Tkinter is a natural choice for the GUI, since it has long been an integral part of Python, and comes preinstalled. While less powerful than e.g. Qt, it is powerful enough, simpler, and is subject to a very liberal license (BSD-alike).
Furthermore, we make use of the tkinter.ttk
sub-library, which allows to separate widget and layout configuration from aesthetic choices like colors and fonts using a style-based mechanism that reminds of CSS classes.
The result was the DemoFrontend class, a tk.Canvas
extension with the following capabilities:
Start/Stop
switch button when pressed, and dispatching into respective start()
and stop()
actionsThis way, not only the frontend class is easily customizable (e.g. adding/removing areas, changing styles…), its connection to the backend also has a clear interface: developers just have to override the start()
, stop()
and exit_demo()
methods with the desired backend functionality, and if needed, the widgets can be easily updated by accessing them through the self.top_widgets
, self.mid_widgets
and self.bottom_widgets
attributes.
The result looks as follows:
The first thing that a real-time audio backend needs is a place to continually store the latest sound signals. Since we are only interested in the last few seconds, we would like a datastructure that doesn’t need more memory than that, and that is always able to deliver the latest contents sorted in their original order (i.e. it is a FIFO datastructure). A ring buffer achieves exactly this:
[5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4]
The Python implementation using numpy’s advanced indexing is quite simple and performative:
class RingBuffer():
"""
A 1D ring buffer using numpy arrays, designed to efficiently handle
real-time audio buffering. Modified from
https://scimusing.wordpress.com/2013/10/25/ring-buffers-in-pythonnumpy/
"""
def __init__(self, length, dtype=np.float32):
"""
:param int length: Number of samples in this buffer
"""
self._length = length
self._buf = np.zeros(length, dtype=dtype)
self._bufrange = np.arange(length)
self._idx = 0 # the oldest location
def update(self, arr):
"""
Adds 1D array to ring buffer. Note that ``len(arr)`` must be anything
smaller than ``self.length``, otherwise it will error.
"""
len_arr = len(arr)
assert len_arr < self._length, "RingBuffer too small for this update!"
idxs = (self._idx + self._bufrange[:len_arr]) % self._length
self._buf[idxs] = arr
self._idx = idxs[-1] + 1 # this will be the new oldest location
def read(self):
"""
Returns a copy of the whole ring buffer, unwrapped in a way that the
first element is the oldest, and the last is the newest.
"""
idxs = (self._idx + self._bufrange) % self._length # read from oldest
result = self._buf[idxs]
return result
We can see that the datastructure keeps track of all the indexing, and users are left with the simplest interface: rb.update(arr)
to add an arr
of length smaller than the ring buffer, and rb.read()
to retrieve the full buffer in FIFO order.
Usually, we parallelize a program as a way of speeding it up, by distributing the computations. This case is different:
In order to handle user input in real time (e.g. button clicked, window closed), the GUI must be constantly checking for updates. Tkinter does this with the so-called mainloop, which must be started whenever the app is run.
Although the deep neural network used for detection is relatively fast, it still takes a substantial fraction of a second to perform a single detection. If we perform this detection inside of the mainloop
, every other computation in the app will be interrupted during that time. This heavily affects app responsiveness, and even worse, we may miss audio data from the microphone.
Similarly, the audio server will be constantly streaming small packets of sound from the microphone into the buffer. We also wouldn’t like this to interfer with the mainloop
, or vice versa.
The solution for all these problems is multithreading: We run the mainloop
in the main thread, and the recording loop and detection loop on separate subthreads, as illustrated by the image below:
We see that, thanks to the nature of the ringbuffer, we don’t need to bother about synchronizing the subthreads: the audio loop will write on the ring buffer at the desired rate, and the detector will read the ring buffer at its own pace (as fast as the thread permits it), getting the full expected length with temporal consistency in every read.
update()
operation atomary e.g. using semaphores (thanks again Edsger!). In our case, we found that this doesn’t affect demo performance, probably because this rarely happens, and whenever happens it affects only ~2.5% of the buffer on one of its endpoints, so we left it as-is. Still, it should be addressed in more critical applications.One last thing: whenever closing the Tkinter app while a slow subthread is running (e.g. if the user presses Exit before Stop in the middle of a detection), we have to wait for the thread to end, otherwise this could lead to segmentation faults or even worse things depending on the nature of the subthread.
Even when making use of the recommended thread.running = False
flag to tell the thread to stop, usual threading
mechanisms like thread.join
and Event.wait
will freeze the Tkinter application. When working with Tkinter, we have to explicitly tell the mainloop
to wait. The solution is to use the Tk.after
method (check our implementation for details).
To provide flexibility while keeping maintainability we follow a pattern for interfacing that is unfortunately unpopular in Python machine learning research code, based on 3 ingredients:
Explicit interfaces: Every parameter list receives an exhaustive list of the elements needed by the implementation. I.e. we don’t pass composite datastructures like conf
objects to be dissected, and we don’t make use of globals inside the body.
Segregated interfaces: Every parameter list contains precisely what is needed to run and nothing more.
Single, programmatic access point for a strongly typed configuration: Instead of implementing a Python file with default arguments, a serialization method to import arguments from the filesystem and a CLI system to input arguments programmatically, we make use of the popular OmegaConf library to take care of everything in one place.
This is how the configuration currently looks like:
@dataclass
class ConfDef:
"""
Check ``DemoApp`` docstring for details on the parameters. Defaults should
work reasonably well out of the box.
"""
ALL_LABELS_PATH: str = AUDIOSET_LABELS_PATH
SUBSET_LABELS_PATH: Optional[str] = None
MODEL_PATH: str = os.path.join(
"models", "Cnn9_GMP_64x64_300000_iterations_mAP=0.37.pth")
#
SAMPLERATE: int = 32000
AUDIO_CHUNK_LENGTH: int = 1024
RINGBUFFER_LENGTH: int = int(32000 * 2)
#
MODEL_WINSIZE: int = 1024
STFT_HOPSIZE: int = 512
STFT_WINDOW: str = "hann"
N_MELS: int = 64
MEL_FMIN: int = 50
MEL_FMAX: int = 14000
# frontend
TOP_K: int = 6
TITLE_FONTSIZE: int = 28
TABLE_FONTSIZE: int = 22
This is the single-stop shop to define the exposed parameters, their types and their default values. Then, in the main routine, this can be merged with e.g. the CLI as follows (the same can be done with OmegaConf.from_yaml
):
CONF = OmegaConf.structured(ConfDef())
cli_conf = OmegaConf.from_cli()
CONF = OmegaConf.merge(CONF, cli_conf)
Then, users can change any parameter by adding e.g. TOP_K=12
, and the number will automatically be interpreted as an integer (or throw an error otherwise). Nice and clean! The only thing that I miss from argparse
is the ability to self-document the parameters with the -h
flag.
All of this is in line with some of the SOLID principles in software development. While in many cases they are too strict (we ignore some of them since they are impractical for small prototypes like this one), it is definitely something useful and interesting to consider!
As already mentioned, this is free software and we welcome any users, testers and developers to give it a try! if this sounds good 👂, feel free to contact us through the repo issues to that end.