This post has two aims:
XDiag algorithm (introduced in [ETW2024] and discussed in more detail in [E2025])XDiag in terms of the previously proposed Hutch++ [MMW2021], which leads to a combined algorithm giving us more control over the runtime/memory tradeoffThis is in a very practical spirit, as part of my own efforts developing the skerch library to perform sketched computations at scale. At the end I also include a brief experiment showing the above points, which is a matter of a few lines of code using skerch. For more rigorous discussion, see the sources listed below.
Consider a square, linear operator $A \in \mathbb{C}^{N \times N}$, of (potentially large) dimension $N$, where our only way to access its contents is via (potentially slow) linear measurements $v \to Av$. The task at hand is to estimate $diag(A)$.
The naïve approach is to perform $N$ measurements in the form $A_{ii} = e_i^T A e_i$, for $i \in ${$1, 2, \dots N$}. There are two substantial problems with this approach:
Can we do better than this?
The plan here is to do $k < N$ random measurements that globally capture the diagonal, yielding an estimate that gets better and better as we increase the number of measurements, hopefully becoming good enough well before reaching $N$. This is the essence of the Girard-Hutchinson (GH) diagonal estimator:
The advantage of the GH estimator is that it converges towards the actual diagonal values, getting better and better as $k$ increases. The problem is that it still suffers from slow convergence: for an error tolerance of $\varepsilon$, we need $\mathcal{O}(1 / \varepsilon^2)$ measurements which, for a reasonably small $\varepsilon$, may very well be larger than $N$ and thus worse than naïve!
The good news is that this convergence can be drastically improved to $\mathcal{O}(1 / \varepsilon)$ if we perform rank deflation (see [MMW2021] and [BN2022] for probabilistic bounds). Sketched rank-$k$ deflation works as follows:
With this, we can express our rank-deflated variation of GH, also called Hutch++:
Actually, let’s focus on the last point: it would be great if we were able to somehow recycle the measurements for both the deflation and the residual estimation! This is in a nutshell the core idea behind XDiag [ETW2024].
To formalize this, let’s define $\Omega_i$ as the “downdated” matrix of all measurement vectors $\omega$ except $\omega_i$. Then, we have the corresponding downdated orthogonal deflation matrix $Q_i = orth(A \Omega_i)$. And since $Q_i$ is now independent from $\omega_i$, we still can formulate a downdated and deflated Girard-Hutchinson estimator that is unbiased: $$ \begin{align*} diag_i(A) &= \underbrace{diag(Q_i Q_i^H A)}_{\text{Deflation: }k-1~\text{measurements}} + \underbrace{\bar{\omega}_i \odot ([I - Q_i Q_i^H] A \omega_i)} _{\text{GH: 1 measurement}} \end{align*} $$ Unbiased, but not necessarily good: note how the deflated part now has just one measurement. The good news is that we have $k$ such estimators (one for each leave-one-out), so we can average them, leveraging the exchangeability principle: $diag(A) = \frac{1}{k} \sum _i diag _i(A)$. This way, we can fully recycle the deflation measurements for the GH estimation: No $\gamma$, just $\omega$ in the picture. This is still unbiased and leads to substantially better results, as we show at the end of this post (see also [E2025, Ch.16] for more exhaustive experiments).
Another great realization from [ETW2024] is that, if we know the full $Q$, and for each $\omega_i$ we downdate, it is possible to efficiently find a vector $s_i$ such that $Q_i Q_i^H = Q (I - s_i s_i^H) Q^H$. This helps to drastically simplify the downdated estimator (derivation from [E2025]): $$ \begin{align*} diag_i(A) &= diag(Q_i Q_i^H A) + \bar{\omega}_i \odot ([I - Q_i Q_i^H] A \omega_i) \\ &= diag(Q [I - s_i s_i^H] Q^H A) + \bar{\omega}_i \odot ([I - Q (I - s_i s_i^H) Q^H] A \omega_i) \\ &= diag(Q Q^H A) - Q s_i \odot A^T \bar{Q} \bar{s}_i \\ &\qquad \qquad + \bar{\omega}_i \odot \big[ \underbrace{(I - Q Q^H) A \omega_i} _{= 0} + Q s_i s_i^H \underbrace{Q^H A \omega_i} _{= Q^H Q r_i = r_i} \big] \\ &= diag(Q Q^H A) - Q s_i \odot A^T \bar{Q} \bar{s}_i + \bar{\omega}_i \odot Q s_i (s_i^H r_i) \\ &= diag(Q Q^H A) + Q s_i \odot \big[(s_i^H r_i) \bar{\omega}_i - A^T \bar{Q} \bar{s}_i \big] \end{align*} $$
And now we can average all estimators, yielding a computationally efficient expression for XDiag:
Where $S$ is the matrix of all $s_i$ vectors. As mentioned, not only the computations involved are very succint, but also the same $\Omega$ measurements are used both the deflation and GH step, leading to increased sample efficiency. Great!
One drawback of exchangeability is that now deflation and GH measurements are coupled: If we obtained $Q$ from $k$ measuements $\omega_i$, we can only use those measurements for the GH estimator. Conversely: If we need $k$ GH measurements to achieve a certain accuracy, we are forced to keep $k$ vectors in memory. So if we need a larger $k$ but we can’t store it, do we need to drop XDiag altogether and fallback to Hutch++?
In this section we observe that XDiag can be seen as a special case of Hutch++, where we get the first $k$ GH samples for basically free. This allows us to add more GH measurements on top of XDiag, without having to alter the deflation step and thus keeping memory within $\mathcal{O}(k)$.
Let $\Psi = (I - \frac{1}{k} S S^H)$ using $S$ and $Q = orth(A \Omega)$ from before. Then, the following estimator is unbiased:
$$ \begin{align*} diag(A) &= diag(Q \Psi Q^H A) + diag([I - Q \Psi Q^H] A) \\ &= \underbrace { diag(Q Q^H A) - \frac{1}{k}[QS \odot A^T \bar{Q} \bar{S}]1 } _{\text{Shared with XDiag!}} + \mathbb{E} _{\gamma \sim \mathcal{D}} \big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \\ \end{align*} $$
But this new formulation seems to be a step back!
But if we look more closely, this estimator shares 2 out of 3 components with XDiag. And since both are unbiased, we get:
$$
\begin{align*}
\underbrace{ \frac{1}{k} \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1 } _{\text{The non-shared component of XDiag}} \approx \mathbb{E} _{\gamma \sim \mathcal{D}} \big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big]
\end{align*}
$$
In other words, the “non-shared” component from XDiag estimates the diagonal of the deflation $(I - Q \Psi Q^H) A$, so we can interpret it as the initial estimation for that expectation.
And this also allows us to extend XDiag with $q$ more GH measurements without having to alter or increase $Q$:
$$ \begin{align*} \mathbb{E}_{\gamma \sim \mathcal{D}} &\big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \\ \approx &\frac{1}{k + q} \Big( \underbrace{ \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1} _{\text{XDiag initialization}} + \underbrace{ \sum_i^q \gamma_i \odot (I - Q \Psi Q^H) A \gamma_i } _{\text{deflated GH}} \Big) \end{align*} $$
Yielding a hybrid algorithm between XDiag and Hutch++:
Note how this is equivalent to XDiag for $q=0$. Conversely, it is equivalent to Hutch++ if we remove the initialization component. This allows us to set $k$ and $q$ independently, regulating the memory/runtime tradeoff! But does this work?
To verify the above observations, we apply the different diagonal estimators to a random matrix and measure the recovery error (full source code in this gist and also here). Here is a fragment of the random matrix used in this study ($1000\times1000$, $rank=300$) and its singular values:
And here are the obtained recovery errors, where the error metric is the normalized residual $\ell_2$ error (lower is better, 0 means perfect recovery, $\geq 1$ means worse than doing nothing), where $D$ is the actual diagonal, and $\hat{D}$ the corresponding approximation:
$$ err_2(\hat{D}) = \frac{\lVert D - \hat{D} \rVert_2}{\lVert D \rVert_2} \in \mathbb{R}_{\geq 0} $$
GH with 400 measurements converges to around 50% errorHutch++ and XTraceXDiag with just 200 measurements has about the same performance as $k=200$ deflation followed by 200 GH measurements. This shows that they are effectively recycledXDiag dominates, because effectively we are using 400 for deflation and 400 for GHXDiag, to get further improvements at no added cost in memoryAnd that’s it! Thanks for your attention, even if I’m pretty sure it was all diagonal reading ↖️ 👀 👍
Special thanks to Ethan N. Epperly, the main
XDiagauthor, for helping me getting a better understanding of it. Be sure to check out his other work!