This post has two aims:
XDiag
algorithm (introduced in [ETW2024] and discussed in more detail in [E2025])XDiag
in terms of the previously proposed Hutch++
[MMW2021], which leads to a combined algorithm giving us more control over the runtime/memory tradeoffThis is in a very practical spirit, as part of my own efforts developing the skerch
library to perform sketched computations at scale. At the end I also include a brief experiment showing the above points, which is a matter of a few lines of code using skerch
. For more rigorous discussion, see the sources listed below.
Consider a square, linear operator $A \in \mathbb{C}^{N \times N}$, of (potentially large) dimension $N$, where our only way to access its contents is via (potentially slow) linear measurements $v \to Av$. The task at hand is to estimate $diag(A)$.
The naïve approach is to perform $N$ measurements in the form $A_{ii} = e_i^T A e_i$, for $i \in ${$1, 2, \dots N$}. There are two substantial problems with this approach:
Can we do better than this?
The plan here is to do $k < N$ random measurements that globally capture the diagonal, yielding an estimate that gets better and better as we increase the number of measurements, hopefully becoming good enough well before reaching $N$. This is the essence of the Girard-Hutchinson (GH) diagonal estimator:
The advantage of the GH estimator is that it converges towards the actual diagonal values, getting better and better as $k$ increases. The problem is that it still suffers from slow convergence: for an error tolerance of $\varepsilon$, we need $\mathcal{O}(1 / \varepsilon^2)$ measurements which, for a reasonably small $\varepsilon$, may very well be larger than $N$ and thus worse than naïve!
The good news is that this convergence can be drastically improved to $\mathcal{O}(1 / \varepsilon)$ if we perform rank deflation (see [MMW2021] and [BN2022] for probabilistic bounds). Sketched rank-$k$ deflation works as follows:
With this, we can express our rank-deflated variation of GH, also called Hutch++
:
Actually, let’s focus on the last point: it would be great if we were able to somehow recycle the measurements for both the deflation and the residual estimation! This is in a nutshell the core idea behind XDiag
[ETW2024].
To formalize this, let’s define $\Omega_i$ as the “downdated” matrix of all measurement vectors $\omega$ except $\omega_i$. Then, we have the corresponding downdated orthogonal deflation matrix $Q_i = orth(A \Omega_i)$. And since $Q_i$ is now independent from $\omega_i$, we still can formulate a downdated and deflated Girard-Hutchinson estimator that is unbiased: $$ \begin{align*} diag_i(A) &= \underbrace{diag(Q_i Q_i^H A)}_{\text{Deflation: }k-1~\text{measurements}} + \underbrace{\bar{\omega}_i \odot ([I - Q_i Q_i^H] A \omega_i)} _{\text{GH: 1 measurement}} \end{align*} $$ Unbiased, but not necessarily good: note how the deflated part now has just one measurement. The good news is that we have $k$ such estimators (one for each leave-one-out), so we can average them, leveraging the exchangeability principle: $diag(A) = \frac{1}{k} \sum _i diag _i(A)$. This way, we can fully recycle the deflation measurements for the GH estimation: No $\gamma$, just $\omega$ in the picture. This is still unbiased and leads to substantially better results, as we show at the end of this post (see also [E2025, Ch.16] for more exhaustive experiments).
Another great realization from [ETW2024] is that, if we know the full $Q$, and for each $\omega_i$ we downdate, it is possible to efficiently find a vector $s_i$ such that $Q_i Q_i^H = Q (I - s_i s_i^H) Q^H$. This helps to drastically simplify the downdated estimator (derivation from [E2025]): $$ \begin{align*} diag_i(A) &= diag(Q_i Q_i^H A) + \bar{\omega}_i \odot ([I - Q_i Q_i^H] A \omega_i) \\ &= diag(Q [I - s_i s_i^H] Q^H A) + \bar{\omega}_i \odot ([I - Q (I - s_i s_i^H) Q^H] A \omega_i) \\ &= diag(Q Q^H A) - Q s_i \odot A^T \bar{Q} \bar{s}_i \\ &\qquad \qquad + \bar{\omega}_i \odot \big[ \underbrace{(I - Q Q^H) A \omega_i} _{= 0} + Q s_i s_i^H \underbrace{Q^H A \omega_i} _{= Q^H Q r_i = r_i} \big] \\ &= diag(Q Q^H A) - Q s_i \odot A^T \bar{Q} \bar{s}_i + \bar{\omega}_i \odot Q s_i (s_i^H r_i) \\ &= diag(Q Q^H A) + Q s_i \odot \big[(s_i^H r_i) \bar{\omega}_i - A^T \bar{Q} \bar{s}_i \big] \end{align*} $$
And now we can average all estimators, yielding a computationally efficient expression for XDiag
:
Where $S$ is the matrix of all $s_i$ vectors. As mentioned, not only the computations involved are very succint, but also the same $\Omega$ measurements are used both the deflation and GH step, leading to increased sample efficiency. Great!
One drawback of exchangeability is that now deflation and GH measurements are coupled: If we obtained $Q$ from $k$ measuements $\omega_i$, we can only use those measurements for the GH estimator. Conversely: If we need $k$ GH measurements to achieve a certain accuracy, we are forced to keep $k$ vectors in memory. So if we need a larger $k$ but we can’t store it, do we need to drop XDiag
altogether and fallback to Hutch++
?
In this section we observe that XDiag
can be seen as a special case of Hutch++
, where we get the first $k$ GH samples for basically free. This allows us to add more GH measurements on top of XDiag
, without having to alter the deflation step and thus keeping memory within $\mathcal{O}(k)$.
Let $\Psi = (I - \frac{1}{k} S S^H)$ using $S$ and $Q = orth(A \Omega)$ from before. Then, the following estimator is unbiased:
$$ \begin{align*} diag(A) &= diag(Q \Psi Q^H A) + diag([I - Q \Psi Q^H] A) \\ &= \underbrace { diag(Q Q^H A) - \frac{1}{k}[QS \odot A^T \bar{Q} \bar{S}]1 } _{\text{Shared with XDiag!}} + \mathbb{E} _{\gamma \sim \mathcal{D}} \big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \\ \end{align*} $$
But this new formulation seems to be a step back!
But if we look more closely, this estimator shares 2 out of 3 components with XDiag
. And since both are unbiased, we get:
$$
\begin{align*}
\underbrace{ \frac{1}{k} \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1 } _{\text{The non-shared component of XDiag}} \approx \mathbb{E} _{\gamma \sim \mathcal{D}} \big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big]
\end{align*}
$$
In other words, the “non-shared” component from XDiag
estimates the diagonal of the deflation $(I - Q \Psi Q^H) A$, so we can interpret it as the initial estimation for that expectation.
And this also allows us to extend XDiag
with $q$ more GH measurements without having to alter or increase $Q$:
$$ \begin{align*} \mathbb{E}_{\gamma \sim \mathcal{D}} &\big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \\ \approx &\frac{1}{k + q} \Big( \underbrace{ \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1} _{\text{XDiag initialization}} + \underbrace{ \sum_i^q \gamma_i \odot (I - Q \Psi Q^H) A \gamma_i } _{\text{deflated GH}} \Big) \end{align*} $$
Yielding a hybrid algorithm between XDiag
and Hutch++
:
Note how this is equivalent to XDiag
for $q=0$. Conversely, it is equivalent to Hutch++
if we remove the initialization component. This allows us to set $k$ and $q$ independently, regulating the memory/runtime tradeoff! But does this work?
To verify the above observations, we apply the different diagonal estimators to a random matrix and measure the recovery error (full source code in this gist and also here). Here is a fragment of the random matrix used in this study ($1000\times1000$, $rank=300$) and its singular values:
And here are the obtained recovery errors, where the error metric is the normalized residual $\ell_2$ error (lower is better, 0 means perfect recovery, $\geq 1$ means worse than doing nothing), where $D$ is the actual diagonal, and $\hat{D}$ the corresponding approximation:
$$ err_2(\hat{D}) = \frac{\lVert D - \hat{D} \rVert_2}{\lVert D \rVert_2} \in \mathbb{R}_{\geq 0} $$
GH
with 400 measurements converges to around 50% errorHutch++
and XTrace
XDiag
with just 200 measurements has about the same performance as $k=200$ deflation followed by 200 GH measurements. This shows that they are effectively recycledXDiag
dominates, because effectively we are using 400 for deflation and 400 for GHXDiag
, to get further improvements at no added cost in memoryAnd that’s it! Thanks for your attention, even if I’m pretty sure it was all diagonal reading ↖️ 👀 👍
Special thanks to Ethan N. Epperly, the main
XDiag
author, for helping me getting a better understanding of it. Be sure to check out his other work!