Combining XDiag with Hutch++

PUBLISHED ON OCT 21, 2025 — CATEGORIES: proofs

This post has two aims:

  1. Providing a linear and condensed review on scalable algorithms to estimate the diagonal of a linear operator, with particular focus on the XDiag algorithm (introduced in [ETW2024] and discussed in more detail in [E2025])
  2. Deriving XDiag in terms of the previously proposed Hutch++ [MMW2021], which leads to a combined algorithm giving us more control over the runtime/memory tradeoff

This is in a very practical spirit, as part of my own efforts developing the skerch library to perform sketched computations at scale. At the end I also include a brief experiment showing the above points, which is a matter of a few lines of code using skerch. For more rigorous discussion, see the sources listed below.

  • [HMT2010]: “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. Halko, Martinsson, Tropp. 2010.
  • [MMW2021]: “Hutch++: Optimal Stochastic Trace Estimation”. Meyer, Musco, Musco, Woodruff. 2021.
  • [BN2022]: “Stochastic diagonal estimation: probabilistic bounds and an improved algorithm”. Baston, Nakatsukasa. 2022.
  • [ETW2024]: “XTrace: Making The Most Of Every Sample In Stochastic Trace Estimation”. Epperly, Tropp, Webber. 2024.
  • [E2025] “Make the Most of What You Have: Resource-Efficient Randomized Algorithms for Matrix Computations” Epperly. 2025.

Getting the diagonal from a large/slow linear operator

Consider a square, linear operator $A \in \mathbb{C}^{N \times N}$, of (potentially large) dimension $N$, where our only way to access its contents is via (potentially slow) linear measurements $v \to Av$. The task at hand is to estimate $diag(A)$.

The naïve approach is to perform $N$ measurements in the form $A_{ii} = e_i^T A e_i$, for $i \in ${$1, 2, \dots N$}. There are two substantial problems with this approach:

  • It becomes intractable for large $N$, especially if measurements are slow.
  • If we try to cut corners by doing less measurements, already for $N-1$ the error can be bad (consider a diagonal with all zeros except the measurement we missed: the error is 100%).

Can we do better than this?


Stochastic diagonal estimation

Assumptions about the noise
For simplified discussion, let’s assume isotropic random vectors $\omega \sim \mathcal{D}$ with unit norm iid entries, i.e. we have $\mathbb{E} _{\omega \sim \mathcal{D}} \big[ \omega \omega^H \big] = I$ and $\omega \cdot \bar{\omega} = 1$, (so $\mathcal{D}$ is iid Rademacher or Phase noise).

The plan here is to do $k < N$ random measurements that globally capture the diagonal, yielding an estimate that gets better and better as we increase the number of measurements, hopefully becoming good enough well before reaching $N$. This is the essence of the Girard-Hutchinson (GH) diagonal estimator:

Girard-Hutchinson (GH)
$$ \mathbb{E} \big[ \bar{\omega} \odot A \omega \big]_i = \mathbb{E} \big[ \bar{\omega}_i \sum_j A _{ij} \omega_j \big] = \sum_j A _{ij} \underbrace{\mathbb{E} [\bar{\omega}_i \omega_j]} _{𝟙 _{i=j}} = A _{ii} $$

The advantage of the GH estimator is that it converges towards the actual diagonal values, getting better and better as $k$ increases. The problem is that it still suffers from slow convergence: for an error tolerance of $\varepsilon$, we need $\mathcal{O}(1 / \varepsilon^2)$ measurements which, for a reasonably small $\varepsilon$, may very well be larger than $N$ and thus worse than naïve!

The good news is that this convergence can be drastically improved to $\mathcal{O}(1 / \varepsilon)$ if we perform rank deflation (see [MMW2021] and [BN2022] for probabilistic bounds). Sketched rank-$k$ deflation works as follows:

  1. Sample $k$ random vectors $\Omega =$ {$\omega_1, \dots \omega_k$}
  2. Perform measurements $Y = A \Omega$
  3. Perform a QR decomposition: $Y = QR$, yielding matrix $Q = orth(A\Omega)$ with orthonormal columns
  4. Now $Q Q^H A$ is a pretty good rank-$k$ approximation of $A$ [HMT2010]

With this, we can express our rank-deflated variation of GH, also called Hutch++:

Hutch++
$$ \begin{align*} diag(A) &= diag(Q Q^H A) + diag([I - Q Q^H] A) \\ &= \underbrace{diag(Q Q^H A)}_{\text{exact}} + \underbrace{\mathbb{E} _{\gamma \sim \mathcal{D}} \big[\bar{\gamma} \odot (I - Q Q^H) A \gamma \big]} _{Girard-Hutchinson} \end{align*} $$
Thanks to the exact computation, the GH estimator is confined to the rank-deflated part, thus reducing the estimation error. But note how we are using new samples $\gamma$: Since $Q$ has the same range as $A \Omega$, we have $(I - Q Q^H) A \Omega = 0$ and thus we have to do two sets of measurements: one for deflation, and one for GH.


Exchangeable estimators

Actually, let’s focus on the last point: it would be great if we were able to somehow recycle the measurements for both the deflation and the residual estimation! This is in a nutshell the core idea behind XDiag [ETW2024].

To formalize this, let’s define $\Omega_i$ as the “downdated” matrix of all measurement vectors $\omega$ except $\omega_i$. Then, we have the corresponding downdated orthogonal deflation matrix $Q_i = orth(A \Omega_i)$. And since $Q_i$ is now independent from $\omega_i$, we still can formulate a downdated and deflated Girard-Hutchinson estimator that is unbiased: $$ \begin{align*} diag_i(A) &= \underbrace{diag(Q_i Q_i^H A)}_{\text{Deflation: }k-1~\text{measurements}} + \underbrace{\bar{\omega}_i \odot ([I - Q_i Q_i^H] A \omega_i)} _{\text{GH: 1 measurement}} \end{align*} $$ Unbiased, but not necessarily good: note how the deflated part now has just one measurement. The good news is that we have $k$ such estimators (one for each leave-one-out), so we can average them, leveraging the exchangeability principle: $diag(A) = \frac{1}{k} \sum _i diag _i(A)$. This way, we can fully recycle the deflation measurements for the GH estimation: No $\gamma$, just $\omega$ in the picture. This is still unbiased and leads to substantially better results, as we show at the end of this post (see also [E2025, Ch.16] for more exhaustive experiments).

Another great realization from [ETW2024] is that, if we know the full $Q$, and for each $\omega_i$ we downdate, it is possible to efficiently find a vector $s_i$ such that $Q_i Q_i^H = Q (I - s_i s_i^H) Q^H$. This helps to drastically simplify the downdated estimator (derivation from [E2025]): $$ \begin{align*} diag_i(A) &= diag(Q_i Q_i^H A) + \bar{\omega}_i \odot ([I - Q_i Q_i^H] A \omega_i) \\ &= diag(Q [I - s_i s_i^H] Q^H A) + \bar{\omega}_i \odot ([I - Q (I - s_i s_i^H) Q^H] A \omega_i) \\ &= diag(Q Q^H A) - Q s_i \odot A^T \bar{Q} \bar{s}_i \\ &\qquad \qquad + \bar{\omega}_i \odot \big[ \underbrace{(I - Q Q^H) A \omega_i} _{= 0} + Q s_i s_i^H \underbrace{Q^H A \omega_i} _{= Q^H Q r_i = r_i} \big] \\ &= diag(Q Q^H A) - Q s_i \odot A^T \bar{Q} \bar{s}_i + \bar{\omega}_i \odot Q s_i (s_i^H r_i) \\ &= diag(Q Q^H A) + Q s_i \odot \big[(s_i^H r_i) \bar{\omega}_i - A^T \bar{Q} \bar{s}_i \big] \end{align*} $$

And now we can average all estimators, yielding a computationally efficient expression for XDiag:

XDiag
$$ \begin{align*} diag(A) &\approx \frac{1}{k} \sum_i diag_i(A)\\ &= diag(Q Q^H A) + \frac{1}{k} \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) - A^T \bar{Q} \bar{S} \big) \big] 1 \end{align*} $$

Where $S$ is the matrix of all $s_i$ vectors. As mentioned, not only the computations involved are very succint, but also the same $\Omega$ measurements are used both the deflation and GH step, leading to increased sample efficiency. Great!


Trading off runtime and memory

One drawback of exchangeability is that now deflation and GH measurements are coupled: If we obtained $Q$ from $k$ measuements $\omega_i$, we can only use those measurements for the GH estimator. Conversely: If we need $k$ GH measurements to achieve a certain accuracy, we are forced to keep $k$ vectors in memory. So if we need a larger $k$ but we can’t store it, do we need to drop XDiag altogether and fallback to Hutch++?

In this section we observe that XDiag can be seen as a special case of Hutch++, where we get the first $k$ GH samples for basically free. This allows us to add more GH measurements on top of XDiag, without having to alter the deflation step and thus keeping memory within $\mathcal{O}(k)$.

Let $\Psi = (I - \frac{1}{k} S S^H)$ using $S$ and $Q = orth(A \Omega)$ from before. Then, the following estimator is unbiased:

$$ \begin{align*} diag(A) &= diag(Q \Psi Q^H A) + diag([I - Q \Psi Q^H] A) \\ &= \underbrace { diag(Q Q^H A) - \frac{1}{k}[QS \odot A^T \bar{Q} \bar{S}]1 } _{\text{Shared with XDiag!}} + \mathbb{E} _{\gamma \sim \mathcal{D}} \big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \\ \end{align*} $$

But this new formulation seems to be a step back!

  • Since $Q$ and $\Psi$ are correlated with $\Omega$, this introduces a bias: we cannot recycle samples $\omega$ anymore and we are back to needing new uncorrelated samples $\gamma$
  • Due to the above reason, the component $(I - Q Q^H) A \Gamma$ is not zero anymore and needs to be computed

But if we look more closely, this estimator shares 2 out of 3 components with XDiag. And since both are unbiased, we get: $$ \begin{align*} \underbrace{ \frac{1}{k} \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1 } _{\text{The non-shared component of XDiag}} \approx \mathbb{E} _{\gamma \sim \mathcal{D}} \big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \end{align*} $$

In other words, the “non-shared” component from XDiag estimates the diagonal of the deflation $(I - Q \Psi Q^H) A$, so we can interpret it as the initial estimation for that expectation. And this also allows us to extend XDiag with $q$ more GH measurements without having to alter or increase $Q$:

$$ \begin{align*} \mathbb{E}_{\gamma \sim \mathcal{D}} &\big[ \bar{\gamma} \odot (I - Q \Psi Q^H) A \gamma \big] \\ \approx &\frac{1}{k + q} \Big( \underbrace{ \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1} _{\text{XDiag initialization}} + \underbrace{ \sum_i^q \gamma_i \odot (I - Q \Psi Q^H) A \gamma_i } _{\text{deflated GH}} \Big) \end{align*} $$

Yielding a hybrid algorithm between XDiag and Hutch++:

XDiag++
$$ \begin{align*} diag(A) =~ &diag(Q \Psi Q^H A) + \\ \frac{1}{k + q} &\Big( \big[ Q S \odot \big(\bar{\Omega}~diag(S^H R) \big) \big] 1 + \sum_i^q \gamma_i \odot (I - Q \Psi Q^H) A \gamma_i \Big) \end{align*} $$

Note how this is equivalent to XDiag for $q=0$. Conversely, it is equivalent to Hutch++ if we remove the initialization component. This allows us to set $k$ and $q$ independently, regulating the memory/runtime tradeoff! But does this work?


A small experiment

To verify the above observations, we apply the different diagonal estimators to a random matrix and measure the recovery error (full source code in this gist and also here). Here is a fragment of the random matrix used in this study ($1000\times1000$, $rank=300$) and its singular values:

And here are the obtained recovery errors, where the error metric is the normalized residual $\ell_2$ error (lower is better, 0 means perfect recovery, $\geq 1$ means worse than doing nothing), where $D$ is the actual diagonal, and $\hat{D}$ the corresponding approximation:

$$ err_2(\hat{D}) = \frac{\lVert D - \hat{D} \rVert_2}{\lVert D \rVert_2} \in \mathbb{R}_{\geq 0} $$

Main takeaways
  • Plain GH with 400 measurements converges to around 50% error
  • The diagonal of the top-$k$ deflation is worse than GH for $k=200$, but better for $k=400$
  • We see how adding 200 GH measurements on top of the $k=200$ deflation helps, although in this case plain GH turns out to be better: In [E2025] it is mentioned that for problems with sufficiently slow spectral decay, plain GH can be more accurate than Hutch++ and XTrace
  • We also see how, if we spent our 400 measurements in deflation and then we recycle them for GH, is the same as just the deflation, since $(I - Q Q^H) A \Omega = 0$
  • XDiag with just 200 measurements has about the same performance as $k=200$ deflation followed by 200 GH measurements. This shows that they are effectively recycled
  • If we now allow for the baseline value of 400 measurements, XDiag dominates, because effectively we are using 400 for deflation and 400 for GH
  • Finally, we see how we can run $q$ additional GH measurements on top of the $k$ used for XDiag, to get further improvements at no added cost in memory

And that’s it! Thanks for your attention, even if I’m pretty sure it was all diagonal reading ↖️ 👀 👍

Special thanks to Ethan N. Epperly, the main XDiag author, for helping me getting a better understanding of it. Be sure to check out his other work!

TAGS: algebra, probability, proof, random matrices, skerch, sketching