A piece of research conducted by Dr. Corrigan-Kavanagh in the context of our AI for Sound project at the University of Surrey (supported by EPSRC grant EP/T019751/1) involves conducting Virtual World Cafes to investigate the interests of citizens and stakeholders, and involve them directly in our AI for sound design process. In order to analyze the outcomes, we had to transcribe a few hours of audio conversations. And since the process involves personal data, it required the development and approval of an Ethics protocol, with constraints regarding how the data can be stored and used.
Ideally, we would like to fully automate the transcription process, which thanks to the latest advances in Deep Learning is considered pretty much a solved task (particularly with English speech data), and there are plenty of services that provide this functionality out-of-the box.
But in practice, such systems aren’t perfect, e.g. we observed that certain background sounds, microphones or reverberations could lower transcription quality, and particularly strong, non-standard accents would severely hinder the outcome (none taken!). Best case, the system works 95% of the time and minor corrections here and there are required.
Furthermore, most of the existing transcription services (be it by machines or by humans) require the data to leave the lab. This was contemplated in our Ethics protocol, but only for certain services which didn’t quite work. In general, depending on the nature of the data and other factors, Ethics may be a constraint.
All of these reasons could be solved by having a centralized tool where many different transcription models could be used, and without any data leaving the system. We found that there is a myriad of existing tools that allow for transcription, but we found none that would satisfy all of the following requirements:
The closest one that we found is ELAN, an amazing Java application released under GPLv3 license that runs on all popular platforms and features a Python backend. But ELAN is rather an annotation tool, and it wasn’t 100% clear that integrating Deep Learning models would be straightforward, and the following workflow efficient enough.
For this reason, we decided to rapidly develop an in-house solution, and release it to the public under an OSS license for added value. Behold the stt_gui
!
In the next sections I’ll briefly discuss its main features and components. Note that the tool is a prototype and not intended as an enterprise solution. This said, users, feedback and contributions are most welcome!
The tool is centered around the task of rapidly converting audio into text. For that, it exposes a single window with 3 elements:
Text editor: The central element is a regular text editor that allows users to type in text, as well as the typical operations like loading/saving contents into the filesystem, and cut/copy/paste/undo/redo.
Audio pane: The right element allows users to add various audio formats into a list, either by loading them from the filesystem or recording them. The element includes an audio player to reproduce the currently selected audio file. Added files can also be removed.
Profile pane: Profiles are the plugins of the system: Runnable, parametrizable operations that can be run multiple times. The left element allows users to add any currently implemented profiles into a list, enter the desired parameters and run them. A single type of profile can be added multiple times to the pane. This can be useful when e.g. different parametrizations are being used/compared.
Users can add arbitrary functionality to the application through the plugins. Currently, the tool has 4 plugins installed: one speech-to-text profile powered by Silero, two that convert selected text to upper- and lowercase, and one example profile that developers can use as a reference to create new profiles.
See image below: the Silero profile was added and used to transcribe one dictated recording and one audio file from disk:
Users can find most commands on the Edit
and Run
menus, as well as usage instructions and a list of most active keybindings on the Help
menu.
In order to improve efficiency, most commands can be triggered through specific keybindings (e.g. Ctrl+Enter
to run currently selected profile). This way, users are able to efficiently navigate an audio recording (to e.g. check correctness of the transcription), and dictate/type any corrections needed, with very few clicks and almost no mouse being used in the process. The GIF below showcases the process of
While working with Qt from Python is an extremely efficient process, the library was natively created for C++ and this reflects in some issues encountered. This section covers those issues as well as some interesting software-related aspects of the development. Feel free to visit the GitHub link given above for full code and documentation!
This is one of the corners where Qt for Python exposed some issues: Qt’s audio recorder exposes an on_audio_probed(audio_buffer)
method, that is called periodically whenever it is recording. The problem in Python is that the audio_buffer
object is apparently empty, and the actual buffer can’t be retrieved through the API.
The workaroud I found, documented in this StackOverflow post, is to extract audio_buffer
’s pointer location from its printed name. Then, given the array length (known), and assumptions on the numeric datatype (can be known), read the corresponding array directly from memory and interpret it as audio data. If all goes well, this leads to a smooth and accurate recording when concatenated.
As you can see, this is a complete hack that no-one should expect to work in a stable manner (one of the ways it could break is if Python’s garbage collector moves the object around), but so far it seems decently stable, and the only solution out there. A bug has been filled to Qt, and hopefully the issue has been fixed by the time you are reading this 🚀
One of the most useful features for an application is the undo/redo functionality. Qt’s brilliant undo framework provides plenty of support for that, allowing developers to implement completely arbitrary undo/redo actions, even in a composite manner (e.g. multiple actions that can be undone in a single step).
In most applications, that is as simple as adding the following functionality to the main window:
self.undo_stack = QtWidgets.QUndoStack(self)
self.undo_view = QtWidgets.QUndoView(self.undo_stack)
self.undo_view.setWindowTitle("Undo View")
self.undo_view.setAttribute(QtCore.Qt.WA_QuitOnClose, False)
Furthermore, many high-level Qt objects already come with a built-in undo stack, and their own set of built-in commands. Unfortunately, this is another corner where Qt for Python can be suboptimal by not making them accessible: If we want to bypass or merge built-in functionality, things can become hairy.
That was indeed the case of the app’s text editor, which is an extension of QtWidgets.QPlainTextEdit
. We had to overcome 2 main issues:
For the first issue, the solution below is in my opinion the most elegant: whenever the text editor logs an “undoable command” to its own stack, log an analogous wrapper command to the main stack. The wrapper command looks like this:
class UndoWrapperCommand(QtWidgets.QUndoCommand):
"""
"""
COMMAND_NAME = "Text Changes"
def __init__(self, txt_editor, parent=None):
super().__init__(self.COMMAND_NAME, parent)
self.txt_editor = txt_editor
def undo(self):
self.txt_editor.document().undo()
def redo(self):
self.txt_editor.document().redo()
With this, the user only has to navigate the main stack. Whenever a wrapper command is encountered, it will execute the corresponding undo action on the text editor’s stack, and this will ensure that both stacks are always kept consistent.
The second problem by adding an event filter to the text editor. For that, the following line is added to the editor’s constructor:
self.installEventFilter(self)
And then, we implement the actual filter as a method:
def eventFilter(self, obj, evt):
"""
"""
catch = False
# documentation for keys and modifiers:
# https://doc.qt.io/qtforpython-5/PySide2/QtCore/Qt.html
if evt.type() == QtCore.QEvent.KeyPress:
catch = self.catch_keyevent_condition(evt)
#
if catch:
# block event but send it as signal
self.eventCatched.emit(evt)
return True
else:
# otherwise act normally
return super().eventFilter(obj, evt)
As we can see, whenever our desired catch_keyevent_conditions
are fulfilled, the event will be intercepted and sent to the main event loop, effectively bypassing the editor’s built-in functionality.
Both audio and profile panes feature a space, where users can arbitrarily add and remove elements to the GUI, from the GUI itself. This is achieved by extending the ScrollbarList
class:
class ScrollbarList(QtWidgets.QWidget):
"""
Override the ``setup_adder_layout(lyt)`` method with your preferred GUI
for adding elements. Then override ``add_element(*args, **kwargs)`` to
add elements accordingly to ``self.list_layout``.
"""
def __init__(self, parent, horizontal=False):
"""
"""
super().__init__(parent)
self.horizontal = horizontal
# Structure: main_layout->scroller_widget->dummy_widget->inner_layout
# and inner_layout holds the adder_layout and the list_layout
self.main_layout = (QtWidgets.QHBoxLayout(self) if horizontal
else QtWidgets.QVBoxLayout(self))
self.inner_layout = (QtWidgets.QHBoxLayout(self) if horizontal
else QtWidgets.QVBoxLayout())
# The adder layout usually holds an "add" button, but it can also
# hold more complex adding mechanisms.
self.adder_layout = (QtWidgets.QHBoxLayout() if not horizontal
else QtWidgets.QVBoxLayout())
self.setup_adder_layout(self.adder_layout)
# the list layout will be dynamically modified by the user
self.list_layout = (QtWidgets.QHBoxLayout() if horizontal
else QtWidgets.QVBoxLayout())
# inner layout holds the adder and the list layout
# self.inner_layout.addLayout(self.adder_layout)
self.inner_layout.addLayout(self.list_layout)
# create and populate scroll area
scroll_bars = (False, True) if horizontal else (True, False)
scroller, scroller_widget = get_scroll_area(*scroll_bars)
scroller_widget.setLayout(self.inner_layout)
# finally add scroll area to main layout
self.main_layout.addLayout(self.adder_layout)
self.main_layout.addWidget(scroller)
def setup_adder_layout(self, lyt):
"""
"""
add_elt_b = QtWidgets.QPushButton("Add Element")
lyt.addWidget(add_elt_b)
add_elt_b.pressed.connect(self.add_element)
def add_element(self):
"""
"""
# Create and add a button
dummy_button = QtWidgets.QPushButton("Delete Me!")
self.list_layout.addWidget(dummy_button)
# When pressed, the button deletes itself from the list
def delete_me():
b_idx = self.list_layout.indexOf(dummy_button)
b = self.list_layout.takeAt(b_idx)
recursive_delete_qt(b)
dummy_button.pressed.connect(delete_me)
As the docstring says, this is usually achieved by overriding 2 methods:
setup_adder_layout
: Populates the “adder layout” with whatever means we want for adding elements to the list. In the default implementation, a button is added that calls add_element
when clicked.
add_element
: Populates the list. In the default implementation, adds a button that removes itself when clicked. This allows users to remove existing elements in any arbitrary order.
One priority of the design was that the tool should be easily extendable with further transcription solutions. To avoid major refactoring and redesign every time this happens, a frequent solution is to adopt a plugin mechanism that provides a minimal set of constraints, so that it can be ensured that new plugins fit into the existing app without hindering their flexibility. This way, plugin developers don’t have to worry about most of the app code, only about fulfilling that set of constraints.
For this reason, we adopted that solution in this app through the Profile
plugin system. This system has a few added difficulties:
Profile
, the system must be populated with factories that create runnable objects, and not runnable objects themselves. This adds an extra layer of abstraction.In its simplest form, a plugin must contain the following minimal functionality:
Worker
, implementing a run(params) -> result
method that will be run on a separate thread. For long jobs, developers must periodically check the self._abort
flag, and return None
if true (the flag will get activated if the user decides to abort the job while running). Developers may also use the worker.update_progress(int)
method to update the progress bar.Profile
, specifying the name and signature of the exposed parameters, and implementing 2 methods:
run
: This method must instantiate the worker with the desired worker.run
parameters (usually after fetching them from the app and profile.form.get_state()
), and then call self.run_worker(worker)
to run the job. If the job is expected to run on the interactive dialog (see image below), then this method must also instantiate a ProfileDialog
and call self.run_worker(worker, dialog)
instead.on_accept(result)
: This method implements the changes that a successful job should impose on the app (e.g. pasting some transcribed code on the text editor). It is automatically called by the system.That’s all a developer needs to implement! See the following example to illustrate the relative simplicity of plugin code:
class ExampleWorker(ProfileWorker):
"""
"""
def run(self, quack, quack_count, minval, maxval):
"""
"""
result = "Result of example computation"
nvals = maxval - minval
if quack:
for i in range(quack_count):
if self._abort:
return
self.update_progress(int(i / quack_count * nvals) + minval)
result += f"\n{i+1} Quack!"
time.sleep(0.1)
return result
class ExampleProfile(Profile):
"""
"""
NAME = "Example Profile"
SIGNATURE = [("Quack?", BoolCheckBox, False),
("Quack count", PositiveIntSpinBox, 10)]
def run(self):
"""
"""
# We will run jobs through the GUI dialog to showcase it
dialog = ProfileDialog(
self, body_text="Running example...",
with_progress_bar=True, default_accept_button=False)
# Extract state from dialog and form
minval, maxval = dialog.PROGRESS_BAR_RANGE
form_dict = self.form.get_state()
quack = form_dict["Quack?"]
quack_count = form_dict["Quack count"]
# Create and run worker!
worker = ExampleWorker(quack, quack_count, minval, maxval)
self.run_worker(worker, dialog=dialog)
def on_accept(self, result):
"""
"""
dialog = InfoDialog(
"User accepted!", "(press OK to continue)",
accept_button_name="OK", print_msg=False)
dialog.accept_b.setDefault(True)
dialog.exec_()
In order to facilitate this, the plugin architecture consists of 4 elements, implemented at stt_gui.stt_app.profiles.__init__.py
:
ProfileList
: This is the ScrollbarList
for the profiles, as explained beforeProfile
: This parent class must be extended by all other profiles. It lays out the graphical elements and implements the run_worker
method, which takes care of running the job on a separate thread (and optionally through a ProfileDialog
).ProfileWorker
: This parent class implements functionality to handle any possible real-time interaction with the ProfileDialog
, so developers only have to worry about implementing run
.ProfileDialog
: As mentioned, the profile.run_worker
method can directly run workers and send the results to profile.on_accept
. But an alternative is to run the job on the GUI: this way, users are informed in real time about the parameters and progress of the job, and are allowed to abort it at any time and confirm results before calling on_accept
. All this functionality is handled by ProfileDialog
(see image below).We would very much welcome testers and developers. As a motivation, here is a very nice piece of conceptual art, I am sitting in a room, by Alvin Lucier:
In it, a small speech is reproduced in a feedback loop, until eventually the stationary response from the feedback system is all that is left. When developing new transcription plugins, I think this is an interesting way of benchmarking their runtime and robustness against feedback artifacts.
Another interesting alternative is this version by youtuber ontologist, who applied the same idea by downloading and reuploading the same video to YouTube again and again. Here, the artifacts are produced by the lossy compression codec:
👽 📶 ₴ɆɆ ɎØɄ ₳ⱤØɄ₦Đ!