Deep Learning for Real-Time Sound Event Detection on CPU 🎶 ⚠️ 🌊 💬 🐶

PUBLISHED ON NOV 4, 2021 — CATEGORIES: utilities

Prior Work

The work presented here is based on the amazing research done in 2019 by my colleagues at CVSSP Surrey:


PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley

The results of that research were not only highly interesting, but also highly practical. As a consequence, a real-time demonstration was released under a liberal license:

The demo shows good detection quality while running in real time on commodity hardware (i.e. relatively inexpensive and non-specialized systems like a single CPU core in a 5-year-old laptop). The slide below, from the Future Technologies for Home Wellbeing event where we presented the latest version live using a microphone and a laptop, illustrates the main workflow and performed task, as well as some potential applications (PDF available here):

This post summarizes the software architecture and its main components. It has been built with reproducibility, stability, flexibility and maintainability in mind, and is hosted in the following repository:

I’m the current maintainer and author of the last version. Apart from that, all credit goes to my colleagues for their excellent previous work!

Tkinter GUI

Since the deep learning backend was PyTorch, and Python runs on many different platforms, the most natural thing for a desktop demonstration was to stay within Python. In this context, Tkinter is a natural choice for the GUI, since it has long been an integral part of Python, and comes preinstalled. While less powerful than e.g. Qt, it is powerful enough, simpler, and is subject to a very liberal license (BSD-alike).

Furthermore, we make use of the tkinter.ttk sub-library, which allows to separate widget and layout configuration from aesthetic choices like colors and fonts using a style-based mechanism that reminds of CSS classes.

The result was the DemoFrontend class, a tk.Canvas extension with the following capabilities:

  • Compartimented division of the window layout in top, middle and bottom areas
  • Configurable number of rows in the middle area
  • Change of appearance in the Start/Stop switch button when pressed, and dispatching into respective start() and stop() actions
  • Parametrized colors and font sizes for flexible themes
  • Layout and images fully responsive to window resizing

This way, not only the frontend class is easily customizable (e.g. adding/removing areas, changing styles…), its connection to the backend also has a clear interface: developers just have to override the start(), stop() and exit_demo() methods with the desired backend functionality, and if needed, the widgets can be easily updated by accessing them through the self.top_widgets, self.mid_widgets and self.bottom_widgets attributes.

The result looks as follows:

Ring Buffer

The first thing that a real-time audio backend needs is a place to continually store the latest sound signals. Since we are only interested in the last few seconds, we would like a datastructure that doesn’t need more memory than that, and that is always able to deliver the latest contents sorted in their original order (i.e. it is a FIFO datastructure). A ring buffer achieves exactly this:

Schematic of 12-element, clockwise ring buffer. Prior to the update, the “newest” element was at position 12, and the “oldest” at position 1. After the update, the 12 elements are sorted by position as follows (from oldest to newest): [5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4]

The Python implementation using numpy’s advanced indexing is quite simple and performative:

class RingBuffer():
    A 1D ring buffer using numpy arrays, designed to efficiently handle
    real-time audio buffering. Modified from
    def __init__(self, length, dtype=np.float32):
        :param int length: Number of samples in this buffer

        self._length = length
        self._buf = np.zeros(length, dtype=dtype)
        self._bufrange = np.arange(length)
        self._idx = 0  # the oldest location

    def update(self, arr):
        Adds 1D array to ring buffer. Note that ``len(arr)`` must be anything
        smaller than ``self.length``, otherwise it will error.
        len_arr = len(arr)
        assert len_arr < self._length, "RingBuffer too small for this update!"
        idxs = (self._idx + self._bufrange[:len_arr]) % self._length
        self._buf[idxs] = arr
        self._idx = idxs[-1] + 1  # this will be the new oldest location

    def read(self):
        Returns a copy of the whole ring buffer, unwrapped in a way that the
        first element is the oldest, and the last is the newest.
        idxs = (self._idx + self._bufrange) % self._length  # read from oldest
        result = self._buf[idxs]
        return result

We can see that the datastructure keeps track of all the indexing, and users are left with the simplest interface: rb.update(arr) to add an arr of length smaller than the ring buffer, and to retrieve the full buffer in FIFO order.

Multiple threads

Usually, we parallelize a program as a way of speeding it up, by distributing the computations. This case is different:

  • In order to handle user input in real time (e.g. button clicked, window closed), the GUI must be constantly checking for updates. Tkinter does this with the so-called mainloop, which must be started whenever the app is run.

  • Although the deep neural network used for detection is relatively fast, it still takes a substantial fraction of a second to perform a single detection. If we perform this detection inside of the mainloop, every other computation in the app will be interrupted during that time. This heavily affects app responsiveness, and even worse, we may miss audio data from the microphone.

  • Similarly, the audio server will be constantly streaming small packets of sound from the microphone into the buffer. We also wouldn’t like this to interfer with the mainloop, or vice versa.

The solution for all these problems is multithreading: We run the mainloop in the main thread, and the recording loop and detection loop on separate subthreads, as illustrated by the image below:

We see that, thanks to the nature of the ringbuffer, we don’t need to bother about synchronizing the subthreads: the audio loop will write on the ring buffer at the desired rate, and the detector will read the ring buffer at its own pace (as fast as the thread permits it), getting the full expected length with temporal consistency in every read.

General warning about consistency in multithreading
OK, not really, one thing that may happen is race conditions: Since the ring buffer update takes several steps, it could happen that the detector reads from the buffer while it is getting updated, leading to a packet displacement from the end to the beginning. This can be addressed by making the update() operation atomary e.g. using semaphores (thanks again Edsger!). In our case, we found that this doesn’t affect demo performance, probably because this rarely happens, and whenever happens it affects only ~2.5% of the buffer on one of its endpoints, so we left it as-is. Still, it should be addressed in more critical applications.

One last thing: whenever closing the Tkinter app while a slow subthread is running (e.g. if the user presses Exit before Stop in the middle of a detection), we have to wait for the thread to end, otherwise this could lead to segmentation faults or even worse things depending on the nature of the subthread.

Even when making use of the recommended thread.running = False flag to tell the thread to stop, usual threading mechanisms like thread.join and Event.wait will freeze the Tkinter application. When working with Tkinter, we have to explicitly tell the mainloop to wait. The solution is to use the Tk.after method (check our implementation for details).

Interfacing and Configuration

To provide flexibility while keeping maintainability we follow a pattern for interfacing that is unfortunately unpopular in Python machine learning research code, based on 3 ingredients:

  • Explicit interfaces: Every parameter list receives an exhaustive list of the elements needed by the implementation. I.e. we don’t pass composite datastructures like conf objects to be dissected, and we don’t make use of globals inside the body.

  • Segregated interfaces: Every parameter list contains precisely what is needed to run and nothing more.

  • Single, programmatic access point for a strongly typed configuration: Instead of implementing a Python file with default arguments, a serialization method to import arguments from the filesystem and a CLI system to input arguments programmatically, we make use of the popular OmegaConf library to take care of everything in one place.

This is how the configuration currently looks like:

class ConfDef:
    Check ``DemoApp`` docstring for details on the parameters. Defaults should
    work reasonably well out of the box.
    SUBSET_LABELS_PATH: Optional[str] = None
    MODEL_PATH: str = os.path.join(
        "models", "Cnn9_GMP_64x64_300000_iterations_mAP=0.37.pth")
    SAMPLERATE: int = 32000
    AUDIO_CHUNK_LENGTH: int = 1024
    RINGBUFFER_LENGTH: int = int(32000 * 2)
    MODEL_WINSIZE: int = 1024
    STFT_HOPSIZE: int = 512
    STFT_WINDOW: str = "hann"
    N_MELS: int = 64
    MEL_FMIN: int = 50
    MEL_FMAX: int = 14000
    # frontend
    TOP_K: int = 6
    TITLE_FONTSIZE: int = 28
    TABLE_FONTSIZE: int = 22

This is the single-stop shop to define the exposed parameters, their types and their default values. Then, in the main routine, this can be merged with e.g. the CLI as follows (the same can be done with OmegaConf.from_yaml):

CONF = OmegaConf.structured(ConfDef())
cli_conf = OmegaConf.from_cli()
CONF = OmegaConf.merge(CONF, cli_conf)

Then, users can change any parameter by adding e.g. TOP_K=12, and the number will automatically be interpreted as an integer (or throw an error otherwise). Nice and clean! The only thing that I miss from argparse is the ability to self-document the parameters with the -h flag.

Why does this pattern help with flexibility and maintainability?
  • It is flexible, because at any point, any parameter present at the top-level interface can be made available to the user. This is thanks to the explicit and segregated interfaces.
  • It is maintainable, because the implementations can be changed at will without worrying about side effects, developers just have to make sure interfaces are satisfied, and the single access point is updated once done. And since the configuration is strongly typed, there is no need to worry about implicit typing issues either.

All of this is in line with some of the SOLID principles in software development. While in many cases they are too strict (we ignore some of them since they are impractical for small prototypes like this one), it is definitely something useful and interesting to consider!

Thank you

As already mentioned, this is free software and we welcome any users, testers and developers to give it a try! if this sounds good 👂, feel free to contact us through the repo issues to that end.

TAGS: audio, gui, live electronics, machine learning, pytorch, signal processing, tkinter, video