Wiener filters are a class of adaptive filters for signal processing. In contrast to fixed filters (like e.g. a differentiator, which acts like a high-pass filter whose response is known a priori), the response of adaptive filters depends on the specific input and assumptions.
Proposed by MIT Professor Norbert Wiener in the 1940s, they can be used in an extremely broad variety of applications, like prediction, denoising and deconvolution. Here we’ll study the Wiener deconvolution, illustrated in the image below:
You may think that in the Deep Learning eraTM linear filters from the 40s are heavily outdated and irrelevant. The truth is that a good understanding of Wiener deconvolution is very useful to understand state-of-the-art preprocessing techniques for some tasks. E.g. features based on Generalized Cross Correlation are behind the (currently) best-performing deep models for audio source localization, like GCC-PHAT and SALSA (both methods have the same signal model and linear properties as the Wiener deconvolution). Understanding these features allows us to understand how neural networks can leverage them, and this can make modeling, training and finetuning substantially easier, if not better.
Wiener filters are optimal in the sense that they are the solution to a minimization problem: they seek to minimize the $\ell_2$ error between original signal and filter outcome. A popular variant of the Wiener deconvolution filter is a relaxation that assumes all noise present to be uncorrelated with the signal to be recovered. Most sources derive the result making use of expected values and complex calculus (see e.g. Wikipedia), as follows:
Interestingly, the original text by Wiener himself, circulated in 1942 and later published as a book called Extrapolation, Interpolation, and Smoothing of Stationary Time Series (EISSTS), follows a more algebraic approach (See chapter 3, Formulation of the General Filter Problem and onwards):
While inspiring, the EISSTS derivation can be a hard bone to chew. In this post we will revisit it with more compact/modern notation, leading to several advantages and insights:
Presenting a much shorter and simpler derivation of the objective using linear algebra only (no statistics or complex calculus involved)
Yielding an exact (i.e. zero-error) solution $W_o$ with all terms laid out explicitly in vectorized form
Expressing the popular zero-correlation filter $W_\perp$ in terms of $W_o$ plus an error term $\varepsilon_\perp$, which allows us to quantify and characterize the error of $W_\perp$.
We finalize with some experiments to illustrate the discussed points.
Our derivation relies on the following building blocks:
Complex arithmetic:
$$
\begin{eqnarray*}
u = &&a + ib \quad \in \mathbb{C}\\
v = &&c + id \quad \in \mathbb{C}\\
\overline{u} = &&a - ib \\
\mathfrak{Re}(u) = &&a \quad \quad \in \mathbb{R}\\
\mathfrak{Im}(u) = &&b \quad \quad \in \mathbb{R}\\
\vert u \vert^2 = &&u\overline{u} = a^2 + b^2\\
u\overline{v} = &&\overline{v}u = \overline{v\overline{u}} = (ac + bd) + i(bc - ad)\\
\mathfrak{Re}(uv) = &&\mathfrak{Re}(u)\mathfrak{Re}(v) - \mathfrak{Im}(u)\mathfrak{Im}(v)\\
u\overline{v} + \overline{u}v = &&2 \mathfrak{Re}(u\overline{v}) = 2\mathfrak{Re}(v\overline{u})\\
\end{eqnarray*}
$$
Discrete Fourier Transform:
We treat discrete signals as vectors, where $t$ represents discrete domain units (e.g. seconds or meters), and $f$ discrete frequency units. The symbol $\odot$ represents element-wise multiplication.
$$
\begin{eqnarray*}
x(t) \in \mathbb{R}^D \quad \iff &&X(f) \in \mathbb{C}_{Hermitian}^D\\
\mathcal{F}\{x(t)\} = &&X(f)\\
\mathcal{F}^{-1}\{X(f)\} = &&x(t)\\
X(f)Y(f) := &&X(f)\odot Y(f)\\
\end{eqnarray*}
$$
Notation: $*$ corresponds to convolution, and $\star$ to (cross-)correlation.
$$ \begin{eqnarray*} \mathcal{F}\{x(t) * y(t)\} = &&\mathcal{F}\{x(t)\} \odot \mathcal{F}\{y(t)\} = X(f)Y(f)\\[5pt] \mathcal{F}\{x(t) \star y(t)\} = &&\mathcal{F}\{x(t)\} \odot \overline{\mathcal{F}\{y(t)\}} = X(f)\overline{Y(f)} \end{eqnarray*} $$
Energy and Parseval’s Theorem:
Given any vector in $\mathbb{R}^D$ or $\mathbb{C}^D$, the energy operator $E$ returns a non-negative, real-valued scalar.
$$
\begin{eqnarray*}
E[x^2] := &&\langle x, x \rangle = \sum_{k=1}^D x(k)^2\\
E[\vert X \vert^2] := &&\langle X, X \rangle = \sum_{k=1}^D X(k)\overline{X(k)}\\
E[x^2] = && \frac{1}{D} \mathbb{E}[\vert X \vert^2]\\
\end{eqnarray*}
$$
While simple, the signal model of convolving and adding noise covers a very broad spectrum of scenarios.
In our signal model, we have a signal $s(t)$ that has been distorted via convolution with a response $r(t)$, and then subject to additive noise $n(t)$ (which can be any form of disturbance). We assume that all signals are real-valued, discrete-time series and that $s, n$ have zero mean. The result is the known observation $o(t)$, as follows:
$$ \begin{eqnarray*} o(t) = &&\big[ s(t) * r(t) \big] + n(t), \quad o, s, r, n : && \mathbb{Z} \rightarrow \mathbb{R}\\[5pt] O(f) = &&\big[ S(f)R(f) \big] + N(f)\\ \sum_{i} s(i) = &&\sum_{i} n(i) = 0 \end{eqnarray*} $$
Furthermore, we assume to know the power spectral density (PSD) of $s$ and $n$, i.e. their energy distribution as a function of frequency. Since we have finite samples, this requires us to assume wide-sense stationarity in $s$ and $n$:
$$ \begin{eqnarray*} \vert S(f) \vert^2, \vert N(f) \vert^2 \quad \text{are known} \end{eqnarray*} $$
Our last assumption is that we require the correlation between $s$ and $n$ to be provided. As already mentioned, the typical assumption in Wiener deconvolution is that $n$ is some form of stochastic (and ergodic) noise process that is uncorrelated to both $s$ and $s * r$, i.e.: $$ \langle s, n \rangle = \langle s*r, n \rangle = 0 \iff \mathbb{E}[\langle S, N \rangle] = \mathbb{E}[\langle S, N \rangle] = 0 $$
We will analyze this relaxation in detail.
The idea of Wiener filters is to look for a linear operator $w$ that will optimally approximate the signal $s$ when applied to the observation $o$. In the case of Wiener deconvolution, this operator is a convolution, and the relation is best expressed on the Fourier domain: $$ \begin{eqnarray*} \hat{s} = &&w * o \quad \iff \quad \hat{S} = WO\\ \end{eqnarray*} $$
This optimal approximation is expressed as minimizing the $\ell_2$ distance between $s$ and our reconstruction $\hat{s}$: $$ \begin{eqnarray*} \underset{w}{\text{argmin}} &&\lVert s - \hat{s} \rVert_2^2 \quad \iff \quad \underset{W}{\text{argmin}} &&E \Big[ \vert S - \hat{S} \vert^2 \Big]\\\ \end{eqnarray*} $$
For this reason this process is often called optimal restoration and $w$ is often called an optimal linear filter. Here we simply note that there are other forms of optimality that this $\ell_2$ objective does not achieve (e.g. $\ell_2$ is known to be generally bad for sparse recovery, and $\ell_1$ can be used instead, leading to a different class of filters).
As already stated, the spectral version of the objective to be minimized is the following:
$$ \begin{eqnarray*} &&E\Big[ \vert S - \hat{S} \vert^2 \Big] = E\Big[ \vert S - WO \vert^2 \Big] \end{eqnarray*} $$
Inspired by EISSTS (3.175), we first observe that all 3 spectra $S, W, O$ have the exact same dimensionality, and $W$ is unconstrained. Therefore an exact recovery is possible when $S = \hat{S}$:
$$ \begin{eqnarray*} S = \hat{S} = W_o O \iff &&S\overline{O} = W_o O\overline{O} = W_o \vert O \vert^2\\ \iff &&W_o = \frac{S\overline{O}}{\vert O \vert^2}\\ \end{eqnarray*} $$
This means that convolving with $W_o$ provides a zero-error, exact recovery. Note that, if $\vert O(f) \vert^2 = 0$ for some $f$, then also $S(f)\overline{O}(f) = 0$, so $W_o(f)$ is unconstrained, but still optimal and computable.
Compared to the long-winded results from the literature, the derivation above is almost upsettingly simple, so is it helpful? We still have 2 open points:
This expression depends on things we don’t claim to know, like $S$.
There is no apparent relation with the well-known Wiener result
In the next block we will see that both points can be resolved with further algebraic manipulation:
The first point is resolved with the $O = SR+N$ identity:
$$ \begin{eqnarray*} W_o = && \frac{S\overline{O}}{\vert O \vert^2} = \frac{S(\overline{SR} + \overline{N})}{\vert O \vert^2} = \frac{\vert S \vert^2 \overline{R} + S \overline{N}}{\vert O \vert^2}\\ \end{eqnarray*} $$
This last expression doesn’t rely anymore on unknown quantities, so the optimal filter $W_o$ is computable (remember we assumed to know $\vert S \vert^2$ and $S\overline{N}$).
In order to address the second point we introduce the following identities:
$$ \begin{eqnarray*} Z := &&\frac{S}{\vert S \vert^2}\\ \vert O \vert^2 = && \vert SR \vert^2 + \vert N \vert^2 + SR\overline{N} + \overline{SR}N\\ = && \vert S \vert^2 \Big( \overbrace{\vert R \vert^2 + \frac{\vert N \vert^2}{\vert S \vert^2}}^\Delta + \overbrace{ZR\overline{N} + \overline{ZR}N}^\gamma \Big)\\[5pt] \frac{x}{a+b} = &&\frac{x}{a} - \frac{bx}{a(a+b)} \end{eqnarray*} $$
The Wiener filter typically encountered in the literature is then a relaxation given by assuming uncorrelated noise, i.e. $ZR\overline{N} = \overline{ZR}N = Z \overline{N} = 0$ for all frequencies. Applying this assumption to $W_o$ yields the corresponding expression for $W_\perp$:
$$ \begin{eqnarray*} W_o = &&\frac{\vert S \vert^2 \overline{R} + S \overline{N}}{\vert O \vert^2} = \frac{\vert S \vert^2 \overline{R} + S \overline{N}}{\vert S \vert^2 (\Delta + \gamma)} = \frac{\overline{R} + Z \overline{N}}{\Delta + \gamma}\\ \Rightarrow W_\perp = &&\frac{\overline{R} + 0}{\Delta + 0} = \frac{\overline{R}}{\Delta} = \frac{\overline{R}}{\vert R \vert^2 + \frac{\vert N \vert^2}{\vert S \vert^2}} \end{eqnarray*} $$
Crucially, we can rearrange the optimum $W_o$ to express it in terms of the uncorrelated filter $W_\perp$, plus some correction $\varepsilon_\perp$:
$$ \begin{eqnarray*} W_o = &&W_\perp + \varepsilon_\perp\\ \Rightarrow \varepsilon_\perp = &&W_o - W_\perp = \frac{\overline{R}}{\Delta + \gamma} - W_\perp + \frac{Z \overline{N}}{\Delta + \gamma}\\ = &&\overbrace{\frac{\overline{R}}{\Delta}}^{W_\perp} - \frac{\overline{R} \gamma }{\Delta ( \Delta + \gamma)} - W_\perp + \frac{S \overline{N}}{\vert O \vert^2}\\[10pt] = &&\frac{S \overline{N}}{\vert O \vert^2} - \frac{\overline{R} \gamma \vert S \vert^2}{\Delta \vert O \vert^2}\\ = &&\frac{1}{\vert O \vert^2} \left( S \overline{N} - W_\perp (SR\overline{N} + \overline{SR}N) \right)\\ = &&\frac{1}{\vert O \vert^2} \left( \overbrace{S \overline{N}}^I - 2\overbrace{\mathfrak{Re}(SR\overline{N}) W_\perp}^{II} \right) \end{eqnarray*} $$
Finally, we can reformulate $\varepsilon_\perp$ as a function of $S\overline{N}$ parametrized by $R$. Let’s start isolating the $S \overline{N}$ component from the $II$ term:
$$ \begin{eqnarray*} II = &&2\mathfrak{Re}(SR\overline{N}) W_\perp = \frac{2}{\Delta}\mathfrak{Re}(SR\overline{N}) \overline{R}\\ = &&\frac{2}{\Delta}\left( \mathfrak{Re}(S\overline{N})\mathfrak{Re}(R) - \mathfrak{Im}(S\overline{N})\mathfrak{Im}(R) \right) \left(\mathfrak{Re}(R) - i \mathfrak{Im}(R) \right)\\ = &&\frac{2}{\Delta}\Big[ \mathfrak{Re}(S\overline{N}) \left( \mathfrak{Re}^2(R) - i \mathfrak{Re}(R)\mathfrak{Im}(R) \right)\\ &&\quad- \mathfrak{Im}(S\overline{N}) \left( \mathfrak{Re}(R)\mathfrak{Im}(R) - i \mathfrak{Im}^2(R) \right) \Big]\\ \end{eqnarray*} $$
Now we can replace $II$ in the $\varepsilon_\perp$ definition, yielding:
$$ \begin{eqnarray*} \varepsilon_\perp = && \frac{1}{\vert O \vert^2} \Big[ \mathfrak{Re}(S \overline{N}) + i \mathfrak{Im}(S \overline{N})\\ &&\qquad- \mathfrak{Re}(S\overline{N}) \frac{2 \left(\mathfrak{Re}^2(R) - i \mathfrak{Re}(R)\mathfrak{Im}(R) \right)}{\Delta}\\ &&\qquad+ \mathfrak{Im}(S\overline{N}) \frac{2 \left(\mathfrak{Re}(R)\mathfrak{Im}(R) - i \mathfrak{Im}^2(R) \right)}{\Delta} \Big]\\[10pt] = &&\mathfrak{Re}(S\overline{N}) \cdot \frac{1 - \frac{2}{\Delta} \left(\mathfrak{Re}^2(R) - i \mathfrak{Re}(R)\mathfrak{Im}(R) \right)}{\vert O \vert^2}\\ &&+ \mathfrak{Im}(S\overline{N}) \cdot \frac{i + \frac{2}{\Delta} \left(\mathfrak{Re}(R)\mathfrak{Im}(R) - i \mathfrak{Im}^2(R) \right)}{\vert O \vert^2}\\ = &&\mathfrak{Re}(S\overline{N}) \cdot \frac{1 - \frac{2}{\Delta} \left(\mathfrak{Re}^2(R) - i \mathfrak{Re}(R)\mathfrak{Im}(R) \right)}{\vert O \vert^2}\\ &&+ i \mathfrak{Im}(S\overline{N}) \cdot \frac{1 - \frac{2}{\Delta} \left(\mathfrak{Im}^2(R) + i \mathfrak{Re}(R)\mathfrak{Im}(R) \right)}{\vert O \vert^2}\\ = &&f_R(S\overline{N}) \qquad : \mathbb{C} \rightarrow \mathbb{C} \end{eqnarray*} $$
Given the $\varepsilon_\perp$ expression as a function of $I$ and $II$, we can see that $W_\perp$ is indeed optimal if the noise is uncorrelated, because in that case $\varepsilon_\perp = 0$ and $W_\perp = W_o$. Otherwise, the error for $W_\perp$ grows in the direction of $S(f)\overline{N}(f)$ minus the direction of $W_\perp(f)$. This makes sense: if we do have interactions but the filter doesn’t correct them, $\varepsilon_\perp$ grows.
We also see that the error $\varepsilon_\perp$ inherits the stability from the main filter. At a given frequency, the error is inversely proportional to the energy of the observation $O$ squared. This could indicate instability for low observed energy, but this is canceled by the fact that low observed energy generally means low energy in $S$ and $N$. Similarly, $W_\perp$ tends to 1 when we have low observed energy, so both terms $I$ and $II$ tend to stabilize and introduce a constant error in the magnitude of $\frac{S \overline{N}}{\vert O \vert^2}$.
As for the $f_R(S\overline{N})$ reformulation, it is mainly useful to explicitly encode the relation between our assumptions about $S\overline{N}$ and the resulting $\varepsilon_\perp$ (e.g. if $N$ is white noise, how is $\varepsilon_\perp$ distributed?). Furthermore, it is computationally convenient for two reasons: most computations can be done ahead of time, and it is fully differentiable, so such assumptions could be learned from data instances using gradient descent.
To wrap this up, we will now check some of the facts discussed so far with a little image processing experiment in Python. You can find the full source code in this gist (also locally hosted here), under GPLv3 License.
To make it visual, we will use an image as an example. Given the date and topic, I couldn’t think of anything more appropriate than this image of the beautiful Wiener Christkindlmarkt (note that the calculations and code naturally apply to other modalities, like audio and video):
We will simulate a typical distortion scenario, with 3 artifacts:
Artifacts 1 and 2 can be modeled by convolving with a “trajectory”, and then convolving with a gaussian blob. In practice, convolution is associative so we achieve both at the same time by convolving with a response like the following:
If we then add white noise with a signal-to-noise ratio of 10, the result is our observation:
As we can see, the quality has been drastically reduced, but only if we lack knowledge about the distortion: if we apply the optimal $W_o$ to the observation, the reconstruction ends up being identical to the original image (up to numerical error). If our knowledge is more limited, we can assume uncorrelated noise and apply $W_\perp$, resulting in the following:
While some artifacts can still be seen, the recovery is quite decent! The residual energy is brought from over 23000% down to $\sim$11% (percentage w.r.t. undistorted energy, see Python script for details), and qualitatively it can be seen that the image becomes clearer and less noisy. As studied before, the error given by $W_\perp$ is characterized by $\varepsilon_\perp$, and yields the following residual:
Note the interesting fact that, while the input to $f_R(S\overline{N})$ is uniform noise, the output is quite sparse and structured (pretty much all landmarks in the image can be recognized). In this sense, $f_R(S\overline{N})$ acts like a de-hashing function, mapping from random noise to sparse semantics. This result also confirms that the $\ell_2$ objective underlying $W_\perp$ is generally bad capturing sparse patterns like e.g. contours, and that a sparsity-aware strategy is a natural idea to reduce the impact of $\varepsilon_\perp$.
And with that we finalize this post. Of course, linear filter theory has been around since the 1940s, and numerous improvements left out here have already been long achieved. E.g., we haven’t even tackled how to estimate $\{\vert S \vert^2, \vert R \vert^2, \vert N \vert^2\}$ from the data, we just used the optimal values but we aren’t usually supposed to know all of them. Still, I hope this helps you waltz through the foundations of Wiener deconvolution as much as it helped me! 🎻 🎶