The Orthogonal Matching Pursuit Update Step

PUBLISHED ON SEP 10, 2021 — CATEGORIES: proofs

Contents

This post is intended to complement the 1994 paper by Pati, Rezaiifar and Krishnaprasad:

Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition [pdf], [online]

Specifically, I provide some auxiliary derivations and sketches to help understanding the Orthogonal Matching Pursuit (OMP) update step, which is crucial to understand the improvement provided by OMP over regular MP.


The MP Family

MP encompasses a wide theoretical corpus and family of algorithms that can be applied, among other things, to sparse signal recovery, analysis and compression. One typical formulation of the sparse recovery problem is the following:

$$ \min_{a} \quad \lVert a \rVert_0 \quad s.t. \quad Da=f,\quad a \in \mathbb{R}^N, D \in \mathbb{R}^{M \times N} $$

I.e. we seek to minimize the number of non-zero $a$ entries from a matrix $D$ (said to be a dictionary formed by atoms $x_n$) in order to reconstruct a given signal $f$. In the general case this is a very hard combinatorial problem, and two main strategies to make it tractable are popular:

  • Replace the $\ell_0$ pseudonorm with $\ell_1$, convexifying the problem. This is the idea behind Basis Pursuit. The original 2001 paper by Chen, Donoho and Saunders provides great background and as a bonus also extensive comparations with other algorithms including MP: [pdf], [online].

  • Bypass the combinatorial space by incrementally adding elements to the dictionary and not looking back, the so-called greedy methods. The MP family, including OMP, belongs to this category.

A formulation of the vanilla Matching Pursuit algorithm can be found in the 1993 paper by Mallat and Zhang ([pdf], [online]). The procedure for the MP algorithm can be coarsely defined as follows:

  1. Compute dot products $\alpha_i := \langle f, x_i \rangle$ between the target signal and all atoms, and choose an atom whose $\vert \alpha_i \vert$ is above a given threshold
  2. Update the signal as follows: $f \leftarrow (f - \alpha_i x_i)$
  3. Repeat steps 1 and 2 until a convergence criterion is met.

Among other things, the MP paper provides a proof of convergence for step 3 (refer there for more details). While elegant and fast, this formulation is exposed to several issues that later members of the family propose to address.

OMP was proposed 1 year later, and introduces a substantial improvement by refining the update step. Consider the image below from the 1994 OMP paper mentioned at the beginning:

Figure 1 from the OMP paper: reconstruction of vector $f$ using vectors $x_1$ and $x_2$ takes an unbounded amount of iterations when using regular MP, while it should take just 2, since all 3 vectors are in $\mathbb{R}^2$.

The reason for this problem is that MP leaves the previously found coefficients untouched due to its greedy nature. OMP addresses precisely this by being less greedy: upon each iteration, all coefficients are efficiently backwards-updated, guaranteeing an optimal recovery of any vector within a $k$-span in at most $k$ steps. The improved update step takes longer to compute, but the algorithm makes it up by requiring less iterations to converge in most cases, thus being faster and providing sparser solutions. For this reason OMP is generally preferred over MP.

But exactly how much better is it? Apart form the computational speed, a very important criterion to assess the quality of such an algorithm is the quality of the recovery, where using as little non-zero entries as possible and achieving a reconstruction as close to the original as possible are desirable goals.

Subsequent advances in the field of Compressed Sensing led to a theoretical framework that allows us now to compare different algorithms for sparse recovery, by quantifying the requirements that a system must meet in order to allow for an exact recovery of the signal $f$ (also in noisy conditions). The main focus is usually the Restricted Isometry Property (RIP) of the dictionary, which tells us how “bad” a dictionary can be while still providing exact recovery. This framework was originally proposed in a 2004 paper by Candes, Romberg and Tao [pdf], [online], and the headline is that an exact recovery can be possible for sparse signals even when sampling way below the Nyquist limit!.

As it turns out, some algorithms have better recovery properties than others, and this is reflected in larger bounds for the RIP constant. I found this presentation ([online version]) by Prof. Chakraborty particularly enlightning. The presentation covers generalized OMP (gOMP) and other techniques, and is a nice example of how to compare different algorithms via their RIP bounds. This kind of analysis, together with speed, are the main factors taken into account in order to decide for a particular algorithm. In this context, Cai and Wang provided a characterization for OMP in 2011 [paper], [online].

Also relevant is the 2007 paper by Blumensath and Davies [paper] ([online]), which provides a lean algebraic definition of OMP and Orthogonal Least Squares (OLS) as well as key insights to understand and compare them. On the top of OMP, OLS introduces one extra step that, while slower, guarantees to choose the atom that will provide the best residual after orthogonalization, unlike OMP, which selects before orthogonalization and can’t guarantee this.

Now we have some general pointers on the main motivations and approaches surrounding OMP, and is a good time to dive into the paper’s details.


Auxiliary Derivations

The notation may differ slightly from the paper for clarity/economy reasons. I also use notation for real-valued signals but $v^T$ could also be interpreted as the Hermitian transpose for complex signals.

In section 2, the authors mention that “it may be shown that the correct update from the $k^{th}$-order model to the model of order $(k+1)$, is given by”:

$$ \begin{align} a_n^{k+1} = &a_n^{k} - \alpha_k b_n^k, \quad n = 1, \dots, k\\ a_{k+1}^{k+1} = &\alpha_k = \frac{\langle R_kf, x_{k+1} \rangle}{\lVert \gamma_k \rVert_2^2} \end{align} $$

But how exactly do we arrive to that result? to show that, we start defining the signal model $f$, which corresponds to the optimal approximation after $k$ steps for a reconstruction $f_k := Da$ and corresponding residual $R_kf$:

$$ f := f_k + R_kf = Da + R_kf = \sum_{n=1}^k a_n x_n + R_kf\\ \text{with } \langle R_kf, x_n \rangle = 0 \quad \forall n $$

The optimality is defined in terms of minimizing the euclidean distance between the original signal $f$ and the reconstruction $f_k$, i.e. it is not an $\ell_0$ or an $\ell_1$ objective, but $\ell_2$. Note that this assumption of optimality is encoded in the form of zero-valued dot products: when the $a_n$ coefficients are chosen so that the euclidean distance between $f_k$ and $f$ is minimal, then $f_k$ and $R_kf$ are orthogonal. And not only that: $R_kf$ is orthogonal to each atom $x_i$. Let’s show this.

Orthogonality between optimal reconstruction $f_k$ and residual $R_fk$

Finding $a$ that minimizes the euclidean distance between $f$ and $f_k$ is a least squares problem that can be solved via normal equations. Since it is a quadratic function, the idea is to set the derivative to zero:

$$ \begin{align*} a^* = \min_{a} \quad \lVert R_kf \rVert_2^2 = &\lVert f - f_k \rVert_2^2 = \lVert f - Da \rVert_2^2\\ = &f^Tf - 2f^TDa + a^T D^T D a\\ \frac{\partial}{\partial a} \lVert R_kf \rVert_2^2 = 0 \iff & 0 - 2 D^Tf + 2 D^T D a^* = 0\\ \iff &a^* = (D^T D)^{-1} D^T f =: A^{-1} D^T f \end{align*} $$

Now replacing $a$ with $a^*$ will show that the residual is orthogonal to every atom, and therefore the residual is also orthogonal to the reconstruction (which is a linear combination of atoms):

$$ \begin{align*} \langle x_i, R_kf \rangle = &\langle x_i, f - Da^* \rangle = \langle x_i, f - D A^{-1} D^T f \rangle\\ = &x_i^T f - x_i^TD A^{-1} D^T f = x_i^T f - A_i^T A^{-1} D^T f\\ = &x_i^T f - x_i^T f = 0 \end{align*} $$

Now that we’re sure that $f_k$ is optimal for (and orthogonal to) a given basis $D$, let’s add another element, yielding our signal model for the next iteration:

$$ f := f_{k+1} + R_{k+1}f = D^\prime a^\prime + R_{k+1}f = \sum_{n=1}^{k+1} a_n^\prime x_n + R_kf\\ \text{with } \langle R_kf, x_n \rangle = 0 \quad \forall n $$

But now, the authors note that, since we aren’t assuming orthogonality in $D$, the newly added atom $x_{k+1}$ may interact with the existing ones, hence the $a$ coefficients must be updated to remain orthogonal to the residual (i.e. to remain optimal). In order to analyze the interferences of $x_{k+1}$, the authors propose to decompose it in 2 orthogonal components, in what they call the “auxiliary model” for $x_{k+1}$:

$$ x_{k+1} = \sum_{n=1}^{k} b_n x_n + \gamma_k, \quad \text{with } \langle \gamma_k, x_n \rangle = 0 \quad \forall n $$

The proof from above also can be used to show that orthogonality between $\gamma_k$ and each $x_n$ implies that the $b_n$ coefficients are chosen to be the optimal reconstruction of $x_{k+1}$ in terms of euclidean distance. We have already shown that such $b$ exist and a way of obtaining them via normal equations (the paper proposes a more efficient iterative procedure).

To continue the derivation we will now make use of a convenient property of the $\ell_2$ objective:

Orthogonal decomposition of $\ell_2$ norm
Given 2 orthogonal vectors $u, v$: $$ \begin{align*} \lVert w - &(u + v) \rVert_2^2 = w^Tw - 2w^T(u+v) + (u+v)^T(u+v)\\ = &w^Tw + u^Tu + v^Tv - 2w^Tu -2w^Tv + 2u^Tv\\ = &(w^Tw - 2w^Tu + u^Tu) -2w^Tv + v^Tv + 0\\ = &\lVert w - u \rVert_2^2 + \lVert w - v \rVert_2^2 - \lVert w \rVert_2^2 \end{align*} $$

I.e. under orthogonality, the $\ell_2$ objective can be decomposed in further $\ell_2$ objectives. We can now proceed to derive the OMP update results.

OMP coefficient updates

We start replacing the auxiliary model of $x_{k+1}$ into the updated signal: $$ \begin{align*} f := &\sum_{n=1}^{k+1} a_n^\prime x_n + R_{k+1}f = \sum_{n=1}^{k} a_n^\prime x_n + a_{k+1}^\prime x_{k+1} + R_{k+1}f\\ = & \sum_{n=1}^{k} (a_n^\prime + a_{k+1}^\prime b_n) x_n + a_{k+1}^\prime \gamma_k + R_{k+1}f \end{align*} $$

Now, we can orthogonally decompose the $\ell_2$ objective for our new $f$:

$$ \begin{align*} a^{\prime*} = \min_{a} &\lVert R_{k+1}f \rVert_2^2\\ = &\lVert f - (\sum_{n=1}^{k} (a_n^\prime + a_{k+1}^\prime b_n) x_n + a_{k+1}^\prime \gamma_k) \rVert_2^2\\ = &\lVert f - (\sum_{n=1}^{k} (a_n^\prime + a_{k+1}^\prime b_n) x_n) \rVert_2^2 + \lVert f - a_{k+1}^\prime \gamma_k \rVert_2^2\\ &- \lVert f\rVert_2^2 \end{align*} $$

As a result of the orthogonality between $\gamma$ and all $x_n$, we were able to decompose the problem in 2 simpler ones, where the second sub-problem does not depend on the first one. We will arrive to the update equations provided by the paper by solving them sequentially.

The second sub-problem is a quadratic expression, so once again we apply the normal equations to obtain the optimum:

$$ \begin{align*} a_{k+1}^{\prime*} &= \min_{a_{k+1}^{\prime}} \lVert f - a_{k+1}^\prime \gamma_k \rVert_2^2\\ \Rightarrow a_{k+1}^{\prime*} &= \frac{\langle f, \gamma_k \rangle}{\lVert \gamma_k \rVert_2^2} \end{align*} $$

Now we leverage the fact that all atoms in $D$ are orthogonal to both $R_kf$ and $\gamma_k$, and arrive to the $\alpha_k$ result from the paper:

$$ \begin{align*} \langle f, \gamma_k \rangle = &\langle Da + R_kf, \gamma_k \rangle = \langle R_kf, \gamma_k \rangle\\ = &\langle R_kf, x_{k+1} - Db \rangle = \langle R_kf, x_{k+1} \rangle\\ \Rightarrow a_{k+1}^{\prime*} &= \frac{\langle R_kf, x_{k+1} \rangle}{\lVert \gamma_k \rVert_2^2} \end{align*} $$

Finally, to solve the first sub-problem, note how replacing each $(a_n^\prime + a_{k+1}^\prime b_n)$ with $a_n$ yields an objective that we have already solved:

$$ \min_{a} \quad \lVert f - (\sum_{n=1}^{k} a_n x_n) \rVert_2^2 = \lVert f - Da \rVert_2^2 $$

It follows that each $(a_n^\prime + a_{k+1}^\prime b_n)$ must equal the previously computed $a_n$, which was the optimal result for $f, D$. The update provided in the paper enforces exactly this:

$$ a_n^\prime + a_{k+1}^\prime b_n = a_n \quad \iff \quad a_n^\prime = a_n - a_{k+1}^\prime b_n \quad \forall n = 1, \dots, k $$

A few key insights can allow us to understand better this update:

  • The $f_k$ reconstruction was already optimal for the atoms in $D$, so adding an orthogonal $\gamma$ wouldn’t alter the coefficients. Therefore any interference introduced by $x_{k+1}$ must be neutralized, and this can be done via naive subtraction using the auxiliary model as shown.

  • Note that $f = f_k + R_kf = f_{k+1} + R_{k+1}f$, i.e. we can use the constant $f$ to pivot among the $k^{th}$ and $(k+1)^{th}$ stages. This, together with the fact that the auxiliary model for $x_{k+1}$ has also most components in the $k^{th}$ stage, allows us to isolate the genuinely new components and express them as a function of already known ones in an iterative fashion.


Sketches

So far we used orthogonal decompositions of the signal model, auxiliary model and objective to algebraically derive the results from the paper. Before finishing, and in line with the pizzazz displayed so far, I include here some drawings to try to provide geometrical intuition on the MP vs. OMP update.

The scenario is very much in the spirit of Figure 1 from the OMP paper, but in 3 dimensions in order to accommodate a third step: we have a signal $f$ that we want to reconstruct, and we are given a reconstruction $f_2 = a_1x_1 + a_2x_2$ that is optimal insofar it is the closest we can get to $f$ by just combining $x_1$ and $x_2$.

Now we decided to add $x_3$ to the mix, and we can use either MP or OMP:

When using MP (sketch below), $f_2$ stays fixed, and $x_3$ must be added “on top of it”. Since $x_3$ doesn’t point in the exact direction of $f$ we will always have an error (in green). The $f_3$ solution entails the smallest error possible given this constraint. Note that the new residual (dotted green line) is orthogonal to $f_3$, analogously to how the former residual (dotted vertical yellow line) is orthogonal to $f_2$.

Matching Pursuit recovery of a 3-dimensional vector f in terms of x1, x2 and x3. Although f is in the span of the x vectors, we can observe an error (in green) due to the greedy nature of MP.

When using OMP (sketch below), we decompose $x_3$ in 2 orthogonal components: $b_1x_1 + b_2x_2$ and $\gamma$ (brown vertical line). In the last section we saw that the optimal reconstruction is achieved when we add gamma on the top of the former $f_2$ until it is closest to $f$ (in this example the reconstruction is exact and $f_3=f$). This is illustrated here. We also see the effect of the $ a_n^\prime = a_n - a_{k+1}^\prime b_n$ update: instead of adding $x_3$ on the top of $f_2$, which leads to the error seen for MP, we use here $a_1^\prime, a_2^\prime$, which in this case leads to an exact reconstruction.

Orthogonal Matching Pursuit recovery of a 3-dimensional vector f in terms of x1, x2 and x3. Since f is in the span of the x vectors, an optimal recovery is achieved after 3 steps.


Now, who said that high school math wasn’t helpful in real life?

TAGS: algebra, compressed sensing, matching pursuit, proof, signal processing, sparsity