Automatic differentation rules for linear algebra

Author

Jutho Haegeman

Published

October 27, 2023

In this document we derive the forward and reverse chain rule for a number of common matrix factorisations. In particular, we also focus partial or incomplete factorisations, as appear in Krylov algorithms, as well as on truncated (low-rank) factorisations. We first introduce some important concepts that will be used in several of those rules.

1 Preliminaries

1.1 Complex variables and chain rules

Consider a scalar-valued function $f$, depending on complex parameters $z^k = x^k + {\mathrm{i}}y^k$, that is real-valued (and can thus not be holomorphic). Given a trajectory $z^k(t)$ with derivatives $\dot{z}^k = \dot{x}^k + {\mathrm{i}}\dot{y}^k$, we can write

\[\begin{align} \dot{f} &= \sum_{k} \frac{\partial f}{\partial x^k} \dot{x}^k + \frac{\partial f}{\partial y^k} \dot{y}^k \\ &= \sum_{k,l} \frac{\partial f}{\partial z^l} \left(\frac{\partial z^l}{\partial x^k} \dot{x}^k + \frac{\partial z^l}{\partial y^k} \dot{y}^k\right) + \frac{\partial f}{\partial \bar{z}^l} \left(\frac{\partial \bar{z}^l}{\partial x^k} \dot{x}^k + \frac{\partial \bar{z}^l}{\partial y^k} \dot{y}^k\right)\\ &= \sum_{k} \frac{\partial f}{\partial z^k} \dot{z}^k + \frac{\partial f}{\partial \bar{z}^k} \dot{\bar{z}}^k \end{align}\] Here, we have that \[\begin{align} \frac{\partial f}{\partial z^k} &= \frac{1}{2}\left(\frac{\partial f}{\partial x^k} - {\mathrm{i}}\frac{\partial f}{\partial y^k}\right)\\ \frac{\partial f}{\partial \bar{z}^k} &= \frac{1}{2}\left(\frac{\partial f}{\partial x^k} + {\mathrm{i}}\frac{\partial f}{\partial y^k}\right) = \overline{\frac{\partial f}{\partial z^k}} \end{align}\] where the final equality only holds because $f$ is assumed real. Typically, we can compute $\frac{\partial f}{\partial z^k}$ easily without explicitly going to real and imaginary components. We thus find \[\begin{align} \dot{f} = \sum_{k} \frac{\partial f}{\partial z^k} \dot{z}^k + \overline{\frac{\partial f}{\partial z^k}} \overline{\dot{z}^k} = 2 \mathop{\mathrm{Re}}\left(\sum_{k} \frac{\partial f}{\partial z^k} \dot{z}^k\right) = 2 \mathop{\mathrm{Re}}\left(\sum_{k} \overline{\frac{\partial f}{\partial \bar{z}^k}} \dot{z}^k\right) = 2 \mathop{\mathrm{Re}}\left(\boldsymbol{\breve{z}}^\dagger \boldsymbol{\dot{z}}\right) \end{align}\] with thus $\boldsymbol{\dot{z}}$ the vector with components $\dot{z}^k$ and $\boldsymbol{\breve{z}}$ the vector with components $\breve{z}_k=\frac{\partial f}{\partial \bar{z}^k}$.

When the coefficients $z^k \cong Z^{(i,j)}$ would be the components of a matrix $Z$, we would rather write this as \[\begin{align} \dot{f} = 2 \mathop{\mathrm{Re}}\left(\sum_{k} \overline{\frac{\partial f}{\partial \bar{z}^k}} \dot{z}^k\right) = 2 \mathop{\mathrm{Re}}\left(\sum_{i,j} \overline{\frac{\partial f}{\partial \bar{Z}^{i,j}}} \dot{Z}^{i,j}\right) = 2 \mathop{\mathrm{Re}}\left(\mathop{\mathrm{Tr}}\left[\breve{Z}^\dagger \dot{Z}\right]\right) = 2\mathop{\mathrm{RTr}}\left[\breve{Z}^\dagger \dot{Z}\right] \end{align}\] where I have introduced the new symbol $\mathop{\mathrm{RTr}}$ for the real part of the trace.

Now consider the composition where $f$ is itself a function of complex parameters $\boldsymbol{w}$, which are themselves complex (not necessarily holomorphic) functions of other complex paramters $\boldsymbol{z}$. We can then write

\[\begin{align} \dot{f} = \sum_{k,l} \begin{bmatrix} \overline{\breve{w}_k} & \breve{w}_k\end{bmatrix}\begin{bmatrix} \frac{\partial w^k}{\partial z^l} & \frac{\partial w^k}{\partial \bar{z}^l}\\ \frac{\partial \bar{w}^k}{\partial z^l} & \frac{\partial \bar{w}^k}{\partial \bar{z}^l} \end{bmatrix} \begin{bmatrix} \dot{z}^l \\ \overline{\dot{z}^l} \end{bmatrix} = \sum_{k,l} \begin{bmatrix} \breve{w}_k \\ \overline{\breve{w}_k}\end{bmatrix}^\dagger \begin{bmatrix} \frac{\partial w^k}{\partial z^l} & \frac{\partial w^k}{\partial \bar{z}^l}\\ \frac{\partial \bar{w}^k}{\partial z^l} & \frac{\partial \bar{w}^k}{\partial \bar{z}^l} \end{bmatrix} \begin{bmatrix} \dot{z}^l \\ \overline{\dot{z}^l} \end{bmatrix} \end{align}\] In this case (where $w$ is complex), the entries of the Jacobian matrix are related via $\overline{\frac{\partial w^k}{\partial z^l}} = \frac{\partial \bar{w}^k}{\partial \bar{z}^l}$ and $\overline{\frac{\partial w^k}{\partial \bar{z}^l}} = \frac{\partial \bar{w}^k}{\partial z^l}$ Trying to rewrite the above expression in the standard form $2\mathop{\mathrm{Re}}(\boldsymbol{\breve{z}}^\dagger \boldsymbol{\dot{z}})$, we find

\[\begin{align} \dot{f} &= \sum_{l} \begin{bmatrix} \left(\sum_k \overline{\breve{w}_k} \frac{\partial w^k}{\partial z^l} + \breve{w}_k \frac{\partial \bar{w}^k}{\partial z^l}\right) & \left(\sum_k \overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l} + \breve{w}_k \frac{\partial \bar{w}^k}{\partial \bar{z}^l}\right) \end{bmatrix} \begin{bmatrix} \dot{z}^l \\ \overline{\dot{z}^l} \end{bmatrix} \\ &= \sum_{l} \begin{bmatrix} \sum_k \breve{w}_k \overline{\frac{\partial w^k}{\partial z^l}} + \overline{\breve{w}_k} \overline{\frac{\partial \bar{w}^k}{\partial z^l}} \\ \sum_k \breve{w}_k \overline{\frac{\partial w^k}{\partial \bar{z}^l}} + \overline{\breve{w}_k} \overline{\frac{\partial \bar{w}^k}{\partial \bar{z}^l}} \end{bmatrix}^\dagger \begin{bmatrix} \dot{z}^l \\ \overline{\dot{z}^l} \end{bmatrix} \\ &= \sum_{l} \begin{bmatrix} \sum_k \breve{w}_k \frac{\partial \bar{w}^k}{\partial \bar{z}^l} + \overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l} \\ \sum_k \overline{\breve{w}_k} \overline{\frac{\partial \bar{w}^k}{\partial \bar{z}^l}} + \breve{w}_k \overline{\frac{\partial w^k}{\partial \bar{z}^l}}\end{bmatrix}^\dagger \begin{bmatrix} \dot{z}^l \\ \overline{\dot{z}^l} \end{bmatrix} \qquad \begin{matrix}\text{(by substituting ${\overline{\frac{\partial \bar{w}^k}{\partial z^l}}}\to\frac{\partial w^k}{\partial \bar{z}^l}$)}\\ \text{(by switching the order of the two terms)}\end{matrix}\\ &= 2\mathop{\mathrm{Re}}\left(\sum_l \left[\sum_{k} \breve{w}_k \frac{\partial \bar{w}^k}{\partial \bar{z}^l} + \overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l}\right]^\dagger \dot{z}^l\right) = 2\mathop{\mathrm{Re}}(\boldsymbol{\breve{z}}^\dagger\dot{z}) \end{align}\]

with thus $\breve{z}_l = \sum_k \left[\breve{w}_k \frac{\partial \bar{w}^k}{\partial \bar{z}^l} + \overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l}\right]$. The same result would have been obtained from inserting $\dot{w}^k = \frac{\partial w^k}{\partial z^l}\dot{z}^l + \frac{\partial w^k}{\partial \bar{z}^l}\dot{\bar{z}}^l$ in $\dot{f} = 2\mathop{\mathrm{Re}}(\boldsymbol{\breve{w}}^\dagger \boldsymbol{\dot{w}})$ as

\[\begin{align} \dot{f} &= 2\mathop{\mathrm{Re}}(\boldsymbol{\breve{w}}^\dagger \boldsymbol{\dot{w}}) = 2\mathop{\mathrm{Re}}\left(\sum_k \overline{\breve{w}_k} \dot{w}^k\right)\\ &= 2\mathop{\mathrm{Re}}\left(\sum_{k,l} \overline{\breve{w}_k} \frac{\partial w^k}{\partial z^l} \dot{z}^l\right) + 2\mathop{\mathrm{Re}}\left(\sum_{k,l} \overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l} \overline{\dot{z}^l}\right)\\ &= 2\mathop{\mathrm{Re}}\left(\sum_{k,l} \overline{\breve{w}_k \overline{\frac{\partial w^k}{\partial z^l}}} \dot{z}^l\right) + 2\mathop{\mathrm{Re}}\left(\sum_{k,l} \overline{\overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l}} \dot{z}^l\right) \end{align}\] where in the second term we have used that we can conjugate the whole argument of the $\mathop{\mathrm{Re}}$ function. Hence, we again obtain

\[\begin{align} \dot{f} = 2\mathop{\mathrm{Re}}\left(\sum_{k,l} \overline{\left[\breve{w}_k \overline{\frac{\partial w^k}{\partial z^l}}+\overline{\breve{w}_k} \frac{\partial w^k}{\partial \bar{z}^l}\right]} \dot{z}^l\right) \end{align}\]

which (not unexpectedly) yields the same result for $\breve{z}_l$, upon substituing $\overline{\frac{\partial w^k}{\partial z^l}} = \frac{\partial \bar{w}^k}{\partial \bar{z}^l}$. The reverse rules required by automatic differentation are exactly of the form that relate $\breve{z}_l$ to $\breve{w}_k$ (and possibly its conjugate). The way we derive them in practice is by starting from $\mathop{\mathrm{Re}}\left(\sum_k \overline{\breve{w}_k} \dot{w}^k\right)$ and inserting the (easier) forward rule that expresses $\dot{w}^k$ in terms of $\dot{z}^l$ (and possibly its conjugate), and then using the fact that we can conjugate within the argument of $\mathop{\mathrm{Re}}$ or $\mathop{\mathrm{RTr}}$ in order to isolate the contribution that is contracted with $\dot{z}^l$.

1.2 Matrices and their tangents and cotangents

We will be working with complex matrices $A \in {\mathbb{C}}^{m \times n}$. Sometimes these matrices satisfy additional constraints, which also constrains their tangents $\dot{A}$. Given that the constraints on the tangent vectors are always linear (in contrast to those on the variables themselves), we can typically satisfy them explicitly by choosing a proper parameterisation.

Constraints on the parameters do not directly impose constraints on the adjoint/dual variables $\breve{A}$ (cotangents). However, in the way they couple to the tangents, they will be automatically projected into the relevant subspace.

Note that the coupling between tangents and cotangents takes the form $\mathop{\mathrm{RTr}}(\breve{A}^\dagger \dot{A})$, which can be read as an inner product ${\left\langle A,B\right\rangle} = \mathop{\mathrm{RTr}}(A^\dagger B)$. However, because we are taking the real part, this corresponds to treating the vector spaces ${\mathbb{C}}^{m \times n}$ of complex matrices as a real vector space. Indeed, the inner product can be rewritten as the standard (real) Euclidean inner product

\[{\left\langle A,B,=\right\rangle} \sum_{i=1}^m\sum_{j=1}^n \mathop{\mathrm{Re}}(A_{ij}) \mathop{\mathrm{Re}}(B_ij) + \mathop{\mathrm{Im}}(A_ij)\mathop{\mathrm{Im}}(B_ij).\]

This has a number of interesting implications.

1.2.1 Diagonal and triangular matrices

The easiest classes of special matrices are those where some of the entries are zero, such as diagonal or upper triangular matrices. Because of the structure of the inner product, only the corresponding entries of the dual variables will contribute. In fact, because of the structure of the real-valued inner product that we work with, we can even impose the real or imaginary component of an entry to be zero separately, and this information will also propagate onto the dual variables.

We can make this explicit by introducing specific projection (super)operators ${\mathbb{C}}^{m\times n} \to {\mathbb{C}}^{m\times n}$ onto the relevant subspaces that we need below, namely \[\begin{align} \mathcal{P}_L(A)_{k,l} &= \begin{cases} A_{k,l},& k > l\\ 0,& k\leq l\end{cases}\\ \mathcal{P}_U(A)_{k,l} &= \begin{cases} A_{k,l},& k < l\\ 0,& k\geq l\end{cases}\\ \mathcal{P}_{rD}(A)_{k,l} &= \begin{cases} \mathop{\mathrm{Re}}(A_{k,k}),& k = l\\ 0,& k\neq l\end{cases}\\ \mathcal{P}_{iD}(A)_{k,l} &= \begin{cases} {\mathrm{i}}\mathop{\mathrm{Im}}(A_{k,k}),& k = l\\ 0,& k\neq l\end{cases}\\ \mathcal{P}_{D}(A)_{k,l} &= (\mathcal{P}_{rD}+\mathcal{P}_{iD})(A)_{k,l} \\ &= \begin{cases} A_{k,k},& k = l\\ 0,& k\neq l\end{cases} \end{align}\] All of those operators are self-adjoint with respect to the (real) inner product, i.e.

\[\mathop{\mathrm{RTr}}\left[A^\dagger \mathcal{P}(B)\right] = \mathop{\mathrm{RTr}}\left[\mathcal{P}(A)^\dagger B\right]\]

This will be important to derive the reverse rules, and it shows that the cotangents will be automatically projected onto the subspace where the tangents are defined.

Closely related to this is the remark that we made above about the dagger. We can define a dagger superoperator $\mathcal{D}:{\mathbb{C}}^{m \times n} \to {\mathbb{C}}^{n\times m}: A\mapsto \mathcal{D}({\hat{A}})=A^\dagger$. Note that this is not a single (super)operator, but one for every matrix size, and in case of rectangular matrices, it maps to matrices of opposite size. It is a complex antilinear operator, but with respect to the real vector space structure, it is real linear and we can write, for all $A\in {\mathbb{C}}^{m \times n}$ and $B\in {\mathbb{C}}^{n \times m}$,

\[\mathop{\mathrm{RTr}}(B^\dagger \mathcal{D}(A))= \mathop{\mathrm{RTr}}(B^\dagger A^\dagger) = \mathop{\mathrm{RTr}}(BA) = \mathop{\mathrm{RTr}}(\mathcal{D}(B)^\dagger A).\]

We can then define two more projection superoperators on the spaces of square matrices $A\in {\mathbb{C}}^{n\times n}$, namely those that project onto the hermitian and antihermitian part as \[\begin{align} \mathcal{P}_{H}(A) &= \frac{\mathcal{I}+\mathcal{D}}{2}(A) = \frac{A+A^\dagger}{2}\\ \mathcal{P}_{A}(A) &= \frac{\mathcal{I}-\mathcal{D}}{2}(A) = \frac{A-A^\dagger}{2} \end{align}\] which also satisfy

\[\mathop{\mathrm{RTr}}(B^\dagger \mathcal{P}_{H/A}(A)) = \mathop{\mathrm{RTr}}(\mathcal{P}_{H/A}(B)^\dagger A)\]

1.2.2 Unitary matrices

We now turn to matrix constraints which are nonlinear, and give rise to known manifolds of special matrices. The first relevant case is a unitarity constraint, which requires $m=n$: \[U^\dagger U = I_n = U^\dagger U\]

For the tangent directions, we find in this case that \[U^\dagger \dot{U} + \dot{U}^\dagger U=O.\]

For the tangent directions, the constraint has thus become linear and can easily be satisfied by parameterising $\dot{U} = U\dot{K}$ where $\dot{K}=-\dot{K}^\dagger$, i.e. $\dot{K}$ is an antihermitian matrix, parameterised by $n(n-1)/2$ complex parameters below the diagonal (and their conjugates above the diagonal) and $n$ purely imaginary parameters on the diagonal.

1.2.3 Isometric matrices

Isometric matrices are rectangular matrices with $m \geq n$ that satisfy

\[Q^\dagger Q = I_n\]

In this case, $P=QQ^\dagger$ is not the identity, but an orthogonal projector ($P^2 =P=P^\dagger$)

We can think of $Q$ as the first $n$ column of a unitary $(m \times m)$ matrix

\[U = \begin{bmatrix} Q & Q_\perp \end{bmatrix}\]

where $Q_\perp$ are an additional $(m-n)$ orthonormal columns that complete the unitary matrix, and thus satisfy $Q^\dagger Q_\perp = O$ and $Q_\perp^\dagger Q_\perp = I_{m-n}$. It furthermore holds that $Q_\perp Q_\perp = I_m - QQ^\dagger$.

If we parameterise the tangent of $U$ as \[\begin{align} \begin{bmatrix} \dot{Q} & \dot{Q}_\perp\end{bmatrix} = \begin{bmatrix} Q & Q_\perp\end{bmatrix}\begin{bmatrix} \dot{K} & -\dot{K}_\perp^\dagger \\ \dot{K}_\perp & \dot{K}_{\perp\perp}\end{bmatrix} \end{align}\] with $\dot{K}=-\dot{K}^\dagger$ (and also $\dot{K}_{\perp\perp}=-\dot{K}_{\perp\perp}^\dagger$), then we find \[\begin{align} \dot{Q} = Q \dot{K} + Q_{\perp} \dot{K}_\perp = Q \dot{K} + \dot{L} \end{align}\] where $\dot{L}$ only needs to satisfy $Q^\dagger \dot{L}=O$. In practice, both representations of $\dot{Q}$ can be useful, depending on the situation:

If we have $n \approx m$ (e.g. $n=m/2$), it can be useful to explicitly determine a $m \times (m-n)$ matrix $Q_\perp$ that completes $Q$ to unitary matrix, and to use the parameterisation $\dot{Q} = Q \dot{K} + Q_{\perp} \dot{K}_\perp$ where the $(m-n) \times n$ matrix $\dot{K}_\perp$ is completely unconstrained.
On the other hand, if $n \ll m$, then it might be that it is undesirable or even infeasible to compute and manipulate a matrix $Q_\perp$ of size $m \times (m-n)$. In that case, it is more efficient to use the parameterisation $Q \dot{K} + \dot{L}$, where $\dot{L}$ also only has size $(m\times n)$ (like $Q$ and $\dot{Q}$), but does need to satisfy the requirement that $Q^\dagger \dot{L} = 0$. Even though we can exploit this condition on $\dot{L}$ in order to derive or simplify the equations that it needs to satisfy, we have to make sure that the final equation that determines $\dot{L}$ (which will always be a linear equation), is such that it either explicitly imposes the condition $Q^\dagger \dot{L} = 0$, or is solved in such a way that the solution does satisfy this condition. Failure of doing so will typically result in the linear system for $\dot{L}$ having a nontrivial kernel and thus being singular.

1.3 Linear problems with matrices

Another common operation with matrices that we encounter is that the matrix of interest, for example, the tangent $\dot{A}$ of some matrix, is only specified implicitly as the solution of a linear system. Indeed, the matrix factorisations are written down as equations for the final solution (rather than as the algorithm that gave rise to them). Differentating such defining equations gives rise to (sometimes coupled) linear equations that need to be solved.

1.3.1 Sylvester equation and Hadamard product

The most common type of linear system for a matrix quantity $X$ that we encounter takes the form of a Sylvester equation

\[ A X - X B = C\]

where thus $X$ is the unknown. Let us denote the left hand side as the (Sylvester) superoperator $\mathcal{S}_{A,B}(X) = AX - XB$. For $X\in{\mathbb{C}}^{m \times n}$, it is assumed that $A\in{\mathbb{C}}^{m\times m}$ and $B \in {\mathbb{C}}^{n \times n}$, so that $\mathcal{S}_{A,B}$ is indeed a linear (super)opeator ${\mathbb{C}}^{m\times n} \mapsto {\mathbb{C}}^{m\times n}$, i.e. it is mapping matrices to the same size.

The solution of the linear system is then given by

\[X = S_{A,B}^{-1}(C)\]

provided the inverse of the Sylvester superoperator exists. The Sylvester superoperator has a nontrivial kernel (and is thus not invertible) whenever $A$ and $B$ have a common eigenvalue $\lambda$. Indeed, with $A\boldsymbol{v}=\lambda \boldsymbol{v}$ and $\boldsymbol{w}^\dagger B = \lambda \boldsymbol{w}^\dagger$, it is clear that $\mathcal{S}_{A,B}(\boldsymbol{v}\boldsymbol{w}^\dagger) = O$.

If both $A$ and $B$ can be diagonalised as $A=V_AD_AV_{A}^{-1}$ and $B=V_BD_BV_{B}^{-1}$, a simple way to obtain the solution of the Sylvester equation is as \[\begin{align} D_{A} \tilde{X} - \tilde{X}D_{B} = \tilde{C} \end{align}\] with $\tilde{X} = V_{A}^{-1}XV_{B}$ and analoguously for $\tilde{C}$. In this case, the $i,j$ component of this equation reads

\[(\lambda_i - \mu_j) \tilde{X}_{ij} = \tilde{C}_{ij}\]

with $(D_A)_{ii} = \lambda_i$ and $(D_{B})_{jj}=\mu_j$ the respective eigenvalues. Here too, the importance of not having coinciding eigenvalues of $A$ and $B$ is clear, unless then also $\tilde{C}_{ij} = 0$. In the above form, we can solve the equation and find

\[ \tilde{X}_{ij} = \frac{\tilde{C}_{ij}}{\lambda_i - \mu_j}\]

We will typically denote this

\[\tilde{X}=\tilde{C} \odot F\]

where $F_{ij} = \frac{1}{\lambda_i - \mu_j}$ and $\odot$ is the pointwise or Hadamard product, satisfying

\[(A\odot B)_{ij} = (B\odot A)_{ij} = A_{ij} B_{ij}\].

Finally, we investigate how these operations behave with respect to the real inner product. Note that the Sylvester superoperator acts as

\[\begin{align} \mathop{\mathrm{RTr}}(X^\dagger \mathcal{S}_{A,B}(Y)) &= \mathop{\mathrm{RTr}}(X^\dagger (AY-YB)) \\ &= \mathop{\mathrm{RTr}}((A^\dagger X - XB^\dagger)^\dagger Y)\\ &= \mathop{\mathrm{RTr}}(\mathcal{S}_{A^\dagger,B^\dagger}(X)^\dagger Y) \end{align}\]

As a consequence, when we have a variable $\dot{X} = \mathcal{S}_{A,B}^{-1}(\dot{Y})$, it must hold that

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{X}^\dagger \dot{X})&= \mathop{\mathrm{RTr}}(\breve{X}^\dagger \mathcal{S}_{A,B}^{-1}(\dot{Y}))\\ &= \mathop{\mathrm{RTr}}(\mathcal{S}_{A^\dagger,B^\dagger}^{-1}(\breve{X})^\dagger \dot{Y}) \end{align}\]

and thus $\breve{Y} = \mathcal{S}_{A^\dagger,B^\dagger}^{-1}(\breve{X})$.

Related to this is the behaviour of the Hadamard product with respect to this inner product, for which we obtain

\[\mathop{\mathrm{RTr}}(A^\dagger (B \odot C)) = \mathop{\mathrm{RTr}}((A\odot \overline{B})^\dagger C).\]

All of these results will prove useful below.

1.4 Gauge freedom

A final issue that is in some sense “orthogonal” to constraints is that of gauge freedom. This arises when certain variables are not uniquely determined, such as the phase of eigenvectors or singular vectors. For example, it could happen that an output matrix $A$ of a certain step of the computation is only determined up to an overall multiplication with another matrix $D$ (for example diagonal, or unitary, …), so that we could also have obtained $A' = AD$.

If we now consider a one-parameter family of such equivalent matrices, this gives rise to tangent vectors $\dot{A}=A\dot{D}$. However, for objective functions $f$ that are insensitive to such gauge changes, it should hold that the partial derivatives $\breve{A} = \frac{\partial f}{\partial \overline{A}_{ij}}$ should be such that

\[\mathop{\mathrm{RTr}}(\breve{A}^\dagger A\dot{D}) = 0\]

for all $\dot{D}$ that are allowed. Such a condition can easily be checked at the beginning of a reverse rule, and then warned about if it is not satisfied, as this would be indicative of an objective function that uses $A$ in a way that depends on implementation details, i.e. an objective function that is not “gauge invariant”.

2 QR decomposition

2.1 Full rank case

Consider a rectangular matrix $A\in {\mathbb{C}}^{m \times n}$. The typical scenario is when $m \geq n$ and $\mathop{\mathrm{rank}}(A)=n$, i.e. $A$ has full rank. In this case, there exists a decomposition

\[A =Q R\]

with $Q\in {\mathbb{C}}^{m\times n}$ an isometric matrix ($Q^\dagger Q=I_n$) and $R\in{\mathbb{C}}^{n \times n}$ an upper-triangular matrix ($R_{k,l}=0$ for $k > l$). This decomposition can be made unique by choosing the diagonal elemenents $R_{kk} > 0$. For $\mathop{\mathrm{rank}}(A)=\mathop{\mathrm{rank}}(R)=n$, none of those elements can be zero and $R$ is invertible, with its inverse $R^{-1}$ also being an upper triangular matrix

For the forward rule, we find

\[\dot{A} = \dot{Q} R + Q \dot{R}\]

where we now insert $\dot{Q} = Q\dot{K}+\dot{L}$ as described above. Projecting those equations onto the column space of $Q$ and onto the orthogonal complement thereof, we obtain

\[\begin{align} Q^\dagger \dot{A} R^{-1} &= \dot{K} + \dot{R}R^{-1}\\ (I- QQ^\dagger)\dot{A}R^{-1} &= \dot{L} \end{align}\]

using $(I- QQ^\dagger)\dot{L} =\dot{L}$ by definition. For the first equation, the second term in the right hand side is also upper triangular with a real diagonal, whereas the first term is antihermitian. The first term is thus determined by the part below the diagonal on the left hand side, as well as the imaginary part of its diagonal. We can then write the forward rule as

\[\begin{align} \dot{K} &= \mathcal{P}_L(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_{iD}(Q^\dagger \dot{A} R^{-1}) - \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})^\dagger\\ \dot{R} &= \left[\mathcal{P}_{rD}(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_U(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})^\dagger \right]R\\ \dot{L} &= (I- QQ^\dagger)\dot{A}R^{-1}\\ \Rightarrow \dot{Q} &= Q \dot{K} + \dot{L} \\ &= \dot{A}R^{-1} + Q\left[-Q^\dagger \dot{A}R^{-1} + \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})+ \mathcal{P}_{iD}(Q^\dagger \dot{A} R^{-1}) - \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})^\dagger\right]\\ &= \dot{A}R^{-1} - Q\left[\mathcal{P}_U(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_{rD}(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})^\dagger\right] = \dot{A}R^{-1} - Q\dot{R}R^{-1} \end{align}\]

To derive the reverse rule, we then start from $\mathop{\mathrm{RTr}}(\breve{Q}^\dagger\dot{Q}) + \mathop{\mathrm{RTr}}(\breve{R}^\dagger\dot{R})$ and insert the expressions above, which we then simplify by making use of the cyclic invariance of the trace and the self-adjointness of the $\mathcal{P}$ operators with respect to the real $\mathop{\mathrm{RTr}}$ inner product. We find

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{Q}^\dagger\dot{Q}) &+ \mathop{\mathrm{RTr}}(\breve{R}^\dagger\dot{R}) \\ &= \mathop{\mathrm{RTr}}(\breve{Q}^\dagger \dot{A}R^{-1}) + \mathop{\mathrm{RTr}}([\breve{R}^\dagger - R^{-1} \breve{Q}^\dagger Q]\dot{R})\\ &= \mathop{\mathrm{RTr}}([\breve{Q}R^{-\dagger}]^\dagger \dot{A}) + \mathop{\mathrm{RTr}}([\breve{R} - Q^\dagger\breve{Q}R^{-\dagger}]^\dagger\dot{R})\\ &= \mathop{\mathrm{RTr}}([\breve{Q}R^{-\dagger}]^\dagger \dot{A}) + \mathop{\mathrm{RTr}}([\breve{R} - Q^\dagger\breve{Q}R^{-\dagger}]^\dagger [\mathcal{P}_{rD}(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_U(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})^\dagger ]R)\\ &= \mathop{\mathrm{RTr}}([\breve{Q}R^{-\dagger}]^\dagger \dot{A}) + \mathop{\mathrm{RTr}}([\breve{R}R^\dagger - Q^\dagger\breve{Q}]^\dagger [\mathcal{P}_{rD}(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_U(Q^\dagger \dot{A} R^{-1}) + \mathcal{P}_L(Q^\dagger \dot{A} R^{-1})^\dagger ])\\ &= \mathop{\mathrm{RTr}}([\breve{Q}R^{-\dagger}]^\dagger \dot{A}) + \\ &\qquad \mathop{\mathrm{RTr}}([ \mathcal{P}_{rD}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{L}((\breve{R}R^\dagger - Q^\dagger\breve{Q})^\dagger)]^\dagger Q^\dagger \dot{A} R^{-1}) \end{align}\]

Note that, for the last term in the square brackets, we first had to use $\mathop{\mathrm{RTr}}(A^\dagger B^\dagger) = \mathop{\mathrm{RTr}}(AB) = \mathop{\mathrm{RTr}}((A^\dagger)^\dagger B)$, before we could use the self-adjointness property of $\mathcal{P}_L$. Halfway in the computation, we have also introduced the short-hand notation $R^{-\dagger} = (R^{-1})^\dagger = (R^\dagger)^{-1}$. If we furthermore use that $\mathcal{P}_L(A^\dagger) = \mathcal{P}_U(A)^\dagger$, we can further rewrite this result as

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{Q}^\dagger\dot{Q}) &+ \mathop{\mathrm{RTr}}(\breve{R}^\dagger\dot{R}) \\ &= \mathop{\mathrm{RTr}}([\breve{Q}R^{-\dagger}]^\dagger \dot{A}) + \\ &\qquad \mathop{\mathrm{RTr}}([ \mathcal{P}_{rD}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})^\dagger]^\dagger Q^\dagger \dot{A} R^{-1})\\ &= \mathop{\mathrm{RTr}}([\breve{Q}R^{-\dagger} + Q(\mathcal{P}_{rD}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})^\dagger)R^{-\dagger}]^\dagger \dot{A}) \end{align}\]

from which we obtain the final result for the reverse rule

\[\begin{align} \breve{A} = \breve{Q}R^{-\dagger} + Q(\mathcal{P}_{rD}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}R^\dagger - Q^\dagger\breve{Q})^\dagger)R^{-\dagger} \end{align}\]

2.2 General result

Now suppose that we have a matrix $A\in{\mathbb{C}}^{m \times n}$, where we do not require that $m \geq n$, and furthermore can have $\mathop{\mathrm{rank}}(A) = r \leq \min(m,n)$. Let us split $m=m_1 + m_2 + m_3$ and $n=n_1 + n_2 =$ with $n_1 = m_1 = r$, $m_2 = \min(m,n)-r$, $n_2 = n-r$ and $m_3=m-\min(m,n)$. We still have to make one assumption, which is that $A$ is such that its first $n_1=r$ columns already span its image (column space), so that a full rank QR decomposition of size $m \times r$ could be obtained from these columns.

We now write the QR decomposition using a square (and thus unitary) matrix $Q$, in the form

\[\begin{bmatrix} A_1 & A_2\end{bmatrix} = \begin{bmatrix} Q_1 & Q_2 & Q_3\end{bmatrix} \begin{bmatrix} R_{11} & R_{12} \\ O& O\\ O& O\end{bmatrix}\]

Here, $A_j \in {\mathbb{C}}^{m \times n_j}$, $Q_j \in {\mathbb{C}}^{m \times m_i}$ and $R_{ij} \in {\mathbb{C}}^{m_i \times n_j}$ for $i=1,2,3$ and $j=1,2$. Note that $R_{21}$, $R_{31}$ and $R_{32}$ are always zero because of the upper triangularity of $R$. A compact QR decomposition would not return $Q_3$ and the bottom row of $R$, since already $m_1 + m_2 = \min(m,n)$. We include it here for completeness.

Note that the partitioning of $m$ and $n$ is dependent on the current matrix $A$, and we only have $m_2>0$ and $n_2>0$ when both $m$ and $n$ exceed the rank $r$ of the current matrix. That the block $R_{22}$ is zero expresses precisely this rank deficiency, but it will become nonzero as soon as we perturb away from it. A consequence of this is that $\dot{R}_{22}$ will be nonzero. To assess the effect of $R_{22}=0$, we will actually replace it with $R_{22} = \epsilon S_{22}$ and study the limit $\epsilon \to 0$.

Note that $Q_3$ (present for $m > n$) can never be uniquely defined, whereas $Q_2$ becomes ambiguous in the limit $\epsilon \to 0$ and will change abruptly under small variations. Hence, we can expect that $\dot{Q}_{22}$ will scale as $\frac{1}{\epsilon}$. We first try to derive a forward rule by writing

\[\begin{bmatrix} \dot{A}_1 & \dot{A}_2\end{bmatrix} = \begin{bmatrix} \dot{Q}_1 & \dot{Q}_2 & \dot{Q}_3 \end{bmatrix} \begin{bmatrix} R_{11} & R_{12} \\ O& \epsilon S_{22}\\ O& O\end{bmatrix} + \begin{bmatrix} Q_1 & Q_2 & Q_3\end{bmatrix} \begin{bmatrix} \dot{R}_{11} & \dot{R}_{12} \\ O& \dot{R}_{22} \\ O& O\end{bmatrix}\]

which gives rise to

\[\begin{align} \dot{A}_1 &= \dot{Q}_1 R_{11} + Q_1 \dot{R}_{11}\\ \dot{A}_2 &= \dot{Q}_1 R_{12} + \epsilon \dot{Q}_2 S_{22} + Q_1 \dot{R}_{12} + Q_2 \dot{R}_{22} \end{align}\]

The first equation fixes $\dot{Q}_1$ and $\dot{R}_{11}$ completely, using the forward rule of the full rank QR derived above. Suppose that the resulting $\dot{Q}_1$ takes the form $\dot{Q}_1 = Q_1 \dot{K}_{11} + \dot{L}_1$. From the structure of $\dot{Q}$, we know that $\dot{L}_1 = Q_2 \dot{K}_{21} + Q_3 \dot{K}_{31}$, and then $\dot{Q}_2 = -Q_1 \dot{K}_{21}^\dagger + Q_2 \dot{K}_{22}+Q_3 \dot{K}_{32}$, where thus $\dot{K}_{21} = Q_2^\dagger \dot{L}_1 = Q_2^\dagger \dot{Q}_1$ and $\dot{K}_{22}$ and $\dot{K}_{32}$ are new variables (with $\dot{K}_{22}$ antihermitian). Inserting this expression for $\dot{Q}_2$ into the second equation and projection onto $Q_1$, $Q_2$ and $Q_3$ separately, we obtain

\[\begin{align} Q_1^\dagger (\dot{A}_2-\dot{Q}_1 R_{12}) &= \epsilon Q_1^\dagger\dot{Q}_2 S_{22} + \dot{R}_{12} = -\epsilon \dot{Q}_1^\dagger Q_2 S_{22} + \dot{R}_{12}\\ Q_2^\dagger (\dot{A}_2-\dot{Q}_1 R_{12}) &= \epsilon \dot{K}_{22} S_{22} + \dot{R}_{22} \\ Q_3^\dagger (\dot{A}_2-\dot{Q}_1 R_{12}) &= \epsilon \dot{K}_{32} S_{22} \end{align}\]

We see that the divergent contribution in $\dot{Q}_2$ disappears along the component $Q_1^\dagger\dot{Q}_2$, because this equals $-Q_2^\dagger\dot{Q}_1$. Here, we can safely take the limit $\epsilon \to 0$, which yields

\[\begin{align} \dot{R}_{12} = Q_1^\dagger (\dot{A}_2-\dot{Q}_1 R_{12}). \end{align}\]

The divergent contributions will thus appear in $\dot{K}_{22}=Q_2 \dot{Q}_2$ and $\dot{K}_{32}=Q_3 \dot{Q}_2$. As the original matrix $A$ only fixes $Q_1$, but not $Q_2$ and $Q_3$ (except for the fact that together they span the orthogonal complement of $Q_1$), it follows naturally that the objective function is such that $Q_2^\dagger \breve{Q}_2$ and $Q_3^\dagger \breve{Q}_3$ are approximately zero. The fact that we know that $Q_2$ needs to be orthogonal to $Q_1$ does give meaning tot the projection $Q_1^\dagger \breve{Q}_2$, which thus does not need to be zero. As $Q_3$ is typically not part of the return result, we do not consider its tangent and cotangent at all.

PS: In practice, this doesn’t work. For any small update $A \to A +\epsilon \dot{A}$, we know that the change to $Q_2$ will be of order 1, i.e. writing this is $Q_2 + \epsilon \dot{Q}_2$, this implies that $\dot{Q}_2$ is $\mathop{\mathrm{\mathscr{O}}}(\epsilon^{-1})$. While it is true that it is only $Q_2^\dagger \dot{Q}_2$ that order $\epsilon^{-1}$, and $Q_1^\dagger \dot{Q}_2$ is indeed of order 1, it does not correspond to $-(Q_2^\dagger \dot{Q}_1)^\dagger$ up to order $\epsilon$, because there are order $\epsilon * \epsilon^{-1} = 1$ corrections. As a consequence, the physically meaningfull contribution to $Q_1^\dagger \breve{Q}_2$ is polluted and it only makes sense to check for $\breve{Q}_2$ as a whole to be zero as a condition for a gauge-invariant cost function.

What remains to be discussed is the faith of $\dot{R}_{22}$, and thus, whether there can be a meaningfull cotangent $\breve{R}_{22}$ associated to it. As it seems from these equations, it couples to $Q_2$, which is not well defined, which seems to imply that also $\dot{R}_{22}$ can impossibly defined unambiguously and would potentially also be divergent. It turns out that the truth is more subtle. Consider thereto

\[\begin{align} (I-Q_1Q_1^\dagger)(\dot{A}_2 - \dot{Q}_1 R_{12})=\dot{A}_2 - \dot{Q}_1 R_{12} - Q_1\dot{R}_{12} = \epsilon \dot{Q}_2 S_22 + Q_2 \dot{R}_{22}. \end{align}\]

The left hand side is by construction orthogonal to $Q_1$, and thus supported in its orthogonal complement. If it would happen that the QR decomposition of the left hand side is exactly $Q_2 \dot{R}_{22}$, there would be no need for divergent contributions in $\dot{Q}_{22}$, i.e. we would have $\dot{K}_{22} =O$ and $\dot{K}_{33}=O$. Note also that, as $\dot{R}_{22}$ is a tangent to $R_{22}=O$, we want to diagonal elements of $\dot{R}_{22}$ to be positive instead of just real, if the whole decomposition is to represent a QR decomposition with positive diagonal elements. Any deviation between the value of $Q_2$ that happens to be chosen in the QR decomposition of $A$, and the Q-factor in the QR decomposition of $\dot{A}_2 - \dot{Q}_1 R_{12} - Q_1\dot{R}_{12}$ gives rise to divergent contributions in $\dot{Q}_2$, but does not affect $\dot{R}_{22}$. Hence, $\dot{R}_{22}$ is well defined and nondiverging as the R-factor in the (positive) QR decomposition of $\dot{A}_2 - \dot{Q}_1 R_{12} - Q_1\dot{R}_{12}$. However, while the R-factor in a QR decomposition is homogeneous (rescaling the matrix rescales the R-factor) it is not additive (the R-factor of the sum of two matrices is not the sum). Hence, $\dot{R}_{22}$, while being non-diverging, does not satisfy the linearity properties associated with a well defined tangent. Hence, we must also assume (or verify) that the associated cotangent $\breve{R}_{22}$ approximates zero for a gauge invariant cost function.

For the reverse rule, we now start from \[\begin{align} \mathop{\mathrm{RTr}}(\breve{Q}_1^\dagger\dot{Q}_1) &+ \mathop{\mathrm{RTr}}(\breve{Q}_2^\dagger\dot{Q}_2)+ \mathop{\mathrm{RTr}}(\breve{R}_{11}^\dagger\dot{R}_{11})+ \mathop{\mathrm{RTr}}(\breve{R}_{12}^\dagger\dot{R}_{12})+ \mathop{\mathrm{RTr}}(\breve{R}_{22}^\dagger\dot{R}_{22})\end{align}\]

and use our assumptions that $\breve{R}_{22}=0$ and $\breve{Q}_2 = (Q_1Q_1^\dagger + Q_2Q_2^\dagger + Q_3Q_3^\dagger)\breve{Q}_2 = Q_1Q_1^\dagger\breve{Q}_2$ together with $Q_1^\dagger\dot{Q}_2 = -\dot{Q}_1^\dagger Q_2$ to write

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{Q}_1^\dagger\dot{Q}_1) &- \mathop{\mathrm{RTr}}((Q_1^\dagger \breve{Q}_2]^\dagger\dot{Q}_1^\dagger Q_2)+ \mathop{\mathrm{RTr}}(\breve{R}_{11}^\dagger\dot{R}_{11})+ \mathop{\mathrm{RTr}}(\breve{R}_{12}^\dagger\dot{R}_{12}) \\ &= \mathop{\mathrm{RTr}}([\breve{Q}_1-Q_2\breve{Q}_2^\dagger Q_1]^\dagger\dot{Q}_1) + \mathop{\mathrm{RTr}}(\breve{R}_{11}^\dagger\dot{R}_{11})+ \mathop{\mathrm{RTr}}(\breve{R}_{12}^\dagger Q_1^\dagger(\dot{A}_2 - \dot{Q}_1R_{12}))\\ &= \mathop{\mathrm{RTr}}([\breve{Q}_1-Q_2\breve{Q}_2^\dagger Q_1 - Q_1\breve{R}_{12}R_{12}^\dagger]^\dagger\dot{Q}_1) + \mathop{\mathrm{RTr}}(\breve{R}_{11}^\dagger\dot{R}_{11})+ \mathop{\mathrm{RTr}}([Q_1\breve{R}_{12}]^\dagger \dot{A}_2) \end{align}\]

From this point on, we can follow the derivation of the full rank case to reduce $\dot{Q}_1$ and $\dot{R}_11$ to $\dot{A}_1$, which then defines the cotangent $\breve{A}_1$, while we can immediately read of $\breve{A}_2$. We thus find

\[\begin{align} \breve{A}_1 &= \breve{Q}R_{11}^{-\dagger} + Q_1(\mathcal{P}_{rD}(\breve{R}_{11}R_{11}^\dagger - Q_1^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}_{11}R_{11}^\dagger - Q_1^\dagger\breve{Q})+\mathcal{P}_{U}(\breve{R}_{11}R_{11}^\dagger - Q_1^\dagger\breve{Q})^\dagger)R_{11}^{-\dagger}\\ \breve{A}_2 &= Q_1\breve{R}_{12} \end{align}\] where the value \[\begin{align} \breve{Q} = \breve{Q}_1-Q_2\breve{Q}_2^\dagger Q_1 - Q_1\breve{R}_{12}R_{12}^\dagger \end{align}\] should be inserted.

3 Eigenvalue decomposition

3.1 Hermitian eigenvalue decomposition

Given a Hermitian matrix $A=A^\dagger$ with $A\in{\mathbb{C}}^{n\times n}$, it admits an eigenvalue decomposition of the form

\[A = U D U^{\dagger}\]

with $U\in{\mathbb{C}}^{n\times n}$ unitary and $D$ a real diagonal matrix containing the eigenvalues. This decomposition is not unique, as we can multiply $U$ with another unitary matrix $V$ that commutes with $D$. If all eigenvalues in $D$ are mutually distinct, this amounts to $V$ being diagonal and unitary, or thus, a matrix with pure complex phases. If some eigenvalues are repeated, $V$ can take a block diagonal form.

Let us first derive the forward rule from writing the decomposition as $AU =UD$

\[\dot{A} U + A \dot{U} = \dot{U} D + U\dot{D}\]

and inserting $\dot{U}=U\dot{K}$ with $K=-K^\dagger$, we obtain

\[U^\dagger \dot{A}U= \dot{K}D - D\dot{K} + \dot{D}\]

or, in components, where we denote the $i$th column of $U$ as $\boldsymbol{u}_i$, the $i$th eigenvector, and the $i$th diagonal element of $D$ as $\lambda_i$, the corresponding eigenvalue:

\[\boldsymbol{u}_i^\dagger \dot{A} \boldsymbol{u}_j = \dot{K}_{ij} (\lambda_j - \lambda_i) + \dot{\lambda}_i \delta_{ij}\]

Note that $\dot{K}_{ij}$ cannot be determined if $\lambda_j = \lambda_i$, which includes in particular the diagonal elements $i=j$. This corresponds exactly to the indeterminedness of $U$ itself in these cases. Any adjoint variables originating from a “gauge-invariant” objective function should thus satisfy

\[\mathop{\mathrm{RTr}}(\breve{U}^\dagger U K) = 0\]

for any antihermitian $K$ that contains nonzero entries $(i,j)$ only where $\lambda_i = \lambda_j$. This leads to

\[(U^\dagger \breve{U})_{ij} - \overline{(U^\dagger \breve{U})_{ji}} = 0,\quad \text{for all $1\leq i\leq j\leq n$ for which $\lambda_i = \lambda_j$}\]

We can then solve the forward rule as

\[\begin{align} \dot{K}_{ij} &= \begin{cases} 0,& i = j\text{ or } \lambda_i=\lambda_j \\ \frac{\boldsymbol{u}_i^\dagger \dot{A}\boldsymbol{u}_j}{\lambda_i-\lambda_j},&\text{otherwise}\end{cases}\\ \dot{\lambda}_i &= \boldsymbol{u}_i^\dagger \dot{A}\boldsymbol{u}_i \end{align}\]

The rule for $\dot{K}$ is often written as $\dot{K} = F \odot (U^\dagger \dot{A}U)$ with $\odot$ the Hadamard product (see above) and $F_{ij} = \begin{cases} 0, &\lambda_i = \lambda_j\\ \frac{1}{\lambda_j-\lambda_i},&\text{otherwise}\end{cases}$.

It is furthermore important whether we consider $D$, and thus also its adjoint $\breve{D}$, to be a matrix, or only a vector of its diagonal elements. Here, we make the former choice. Indeed, we can write the forward rules as

\[\begin{align} \dot{K} &= F \odot (U^\dagger \dot{A}U)\\ \dot{D} &= \mathcal{P}_{rD}(U^\dagger \dot{A}U) \end{align}\] with $\mathcal{P}_{rD}$ the superoperator from before, that projects away everything but the real part of the diagonal. Note that the diagonal elements are anyway real in this case. If the adjoint $\breve{D}$ would be a full matrix, all its non-diagonal elements as well as possible imaginary contributions on the diagonal are projected away, because of the self-adjointness property of this superoperator. Indeed, we can now obtain the backward rule as

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{U}^\dagger \dot{U})+ \mathop{\mathrm{RTr}}(\breve{D}^\dagger \dot{D}) &= \mathop{\mathrm{RTr}}(\breve{U}^\dagger U\dot{K})+ \mathop{\mathrm{RTr}}(\breve{D}^\dagger \dot{D})\\ &= \mathop{\mathrm{RTr}}\left[(U^\dagger \breve{U})^\dagger (F \odot (U^\dagger\dot{A}U))\right] + \mathop{\mathrm{RTr}}\left[\breve{D}^\dagger \mathcal{P}_{rD}(U^\dagger \dot{A}U)\right]\\ &= \mathop{\mathrm{RTr}}\left[(F\odot(U^\dagger \breve{U}))^\dagger (U^\dagger\dot{A}U)\right] + \mathop{\mathrm{RTr}}\left[\mathcal{P}_{rD}(\breve{D})^\dagger (U^\dagger \dot{A}U)\right]\\ &=\mathop{\mathrm{RTr}}\left[(U(F\odot (U^\dagger \breve{U}) + \mathcal{P}_{rD}(\breve{D}))U^\dagger)^\dagger \dot{A}\right] \end{align}\]

and thus obtain

\[\breve{A} = U(F\odot (U^\dagger \breve{U}) + \mathcal{P}_{rD}(\breve{D}))U^\dagger\]

with thus $F_{ij} = \begin{cases} 0, &\lambda_i = \lambda_j\\ \frac{1}{\lambda_j-\lambda_i},&\text{otherwise}\end{cases}$. As $F$ is real, no complex conjugations are introduced when it $F$ moved from the tangent to the cotangent variables.

3.2 Nonhermitian eigenvalue decomposition

Most of this analysis can be repeated for the nonhermitian case, where the eigenvalue decomposition reads

\[A =VDV^{-1} \iff AV=VD\]

with $D$ still diagonal, but $V$ no longer unitary. Still parameterising $\dot{V}=V\dot{K}$, where $\dot{K}$ can now be a general matrix, we find

\[\begin{align} \dot{K}&= F \odot (V^{-1}\dot{A}V)\\ \dot{D}&= \mathcal{P}_D (V^{-1}\dot{A}V) \end{align}\]

with $\mathcal{P}_{D} = \mathcal{P}_{rD}+\mathcal{P}_{iD}$ the superoperator that selects the diagonal entries (real + imaginary part) and removes all non-diagonal entries. The matrix $F$ is still given by

\[F_{ij} = \begin{cases} 0, &\lambda_i = \lambda_j\\ \frac{1}{\lambda_j-\lambda_i},&\text{otherwise}\end{cases}\]

and still uses the fact that $\dot{K}_{ij}$ cannot be determined if $\lambda_i = \lambda_j$, and that we just put these components equal to zero. The major difference with the Hermitian case is that now the eigenvalues $\lambda_i = D_{ii}$ can no longer assumed to be real.

From this result, we obtain the reverse rule

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{V}^\dagger \dot{V})+ \mathop{\mathrm{RTr}}(\breve{D}^\dagger \dot{D})&= \mathop{\mathrm{RTr}}(\breve{V}^\dagger V\dot{K})+ \mathop{\mathrm{RTr}}(\breve{D}^\dagger \dot{D})\\ &= \mathop{\mathrm{RTr}}\left[(V^\dagger\breve{V})^\dagger (F \odot (V^{-1}\dot{A}V))\right] + \mathop{\mathrm{RTr}}\left[\breve{D}^\dagger \mathcal{P}_{D}(V^{-1} \dot{A}V)\right]\\ &= \mathop{\mathrm{RTr}}\left[(\overline{F}\odot (V^\dagger\breve{V}))^\dagger (V^{-1}\dot{A}V)\right] + \mathop{\mathrm{RTr}}\left[\mathcal{P}_{D}(\breve{D})^\dagger (V^{-1} \dot{A}V)\right]\\ &=\mathop{\mathrm{RTr}}\left[(V^{-\dagger}(\overline{F}\odot (V^\dagger\breve{V}) + \mathcal{P}_{D}(\breve{D}))V^{\dagger})^\dagger \dot{A}\right] \end{align}\]

and thus obtain

\[\breve{A} = V^{-\dagger}(\overline{F}\odot (V^\dagger \breve{V}) + \mathcal{P}_{D}(\breve{D}))V^{\dagger}\]

3.3 Schur decomposition

Another decomposition that is closely related to the eigenvalue decomposition of general matrices $A\in{\mathbb{C}}^{n \times n}$ is the Schur decomposition, namely

\[A = UTU^\dagger\]

where now $U\in {\mathbb{C}}^{n\times n}$ is unitary, and $T\in{\mathbb{C}}^{n\times n}$ is upper triangular. In the case of a hermitian matrix $A$, the hermiticity of $T$ requires that also its entries above the diagonal vanish, so that the Schur decomposition then coincides with the eigenvalue decomposition. As in the hermitian eigenvalue decomposition, the phase of the columns of $U$ is not fixed by this equation. However, in this case, a phase change in the columns of $U$ also affects the off-diagonal (=above diagonal) elements of $T$.

For the forward rule, we again parameterise $\dot{U}=U\dot{K}$ with $\dot{K}=-\dot{K}^\dagger$ in order to find

\[\dot{A}U + A\dot{U} = \dot{U}T+ U\dot{T}\]

or thus

\[U^\dagger \dot{A}U = \dot{K}T-T\dot{K} + \dot{T}\]

From the gauge freedom, it follows that the diagonal elements of $\dot{K}$ do not appear in this equations, and can be put to zero (but we should have that the imaginary diagonal components of $U^\dagger \breve{U}$ should be zero for the adjoint variables associated with a gauge invariant objective function). We can interpret the equation above as a black-box linear system for the below-diagonal entries of $\dot{K}$ (which also fix the above-diagonal entries via antihermiticity) and the entries of $\dot{T}$ and feed it into a linear solver. Note that we should then actually separate everything into real and imaginary components.

Let us instead try to find some more structure in these equations. We first focus on the entries $1 \leq j < i \leq n$, for which $\dot{T}_{ij} = 0$. We then find

\[(U^\dagger\dot{A}U)_{ij} = \sum_{k=1}^j \dot{K}_{ik} T_{kj} - \sum_{k=i}^n T_{ik} \dot{K}_{kj}\]

… to be continued …

4 Partial eigenvalue and Schur factorisation

4.1 Invariant subspaces

Consider a matrix $A\in{\mathbb{C}}^{n \times n}$, and a matrix $V \in {\mathbb{C}}^{n \times r}$ of which the columns span an invariant subspace of $A$. This means that it is such that there exists a matrix $B \in {\mathbb{C}}^{r \times r}$ such that we have an exact equality

\[AV = VB\]

The matrix $V$ could be composed out of a number of eigenvectors of $A$, in which case $B$ would be diagonal and contain the corresponding eigenvalues. Alternatively, we can choose the columns of $V$ to be orthonormal, i.e. $V$ isometric, and at the same time $B$ upper triangular. We then have a partial Schur factorisation. As always, both choices coincide in the case where $A$ is hermitian.

4.2 Single eigenvector

We start with the case where $r=1$. A one-dimensional invariant subspace needs to be an eigenvector. A single eigenvector $\boldsymbol{u}$ and associated eigenvalue $\lambda$ satisfy

\[A\boldsymbol{u} = \lambda \boldsymbol{u}\]

from which we obtain

\[\dot{A}\boldsymbol{u} + A\boldsymbol{\dot{u}} = \dot{\lambda} \boldsymbol{u} + \lambda \boldsymbol{\dot{u}}\]

Since a single eigenvector can always be chosen normalised, we now set $\boldsymbol{\dot{u}} = \boldsymbol{u} k + \boldsymbol{\dot{l}}$ where $k$ is purely imaginary and $\boldsymbol{\dot{l}}$ is orthogonal to $\boldsymbol{u}$. The scalar $k$ corresponds to changes in the phase of the eigenvector, and will disappear from the equations. It cannot be uniquely determined and so it should also be annihilated by the adjoint associated with any gauge-invariant objective function, i.e. it should hold for any purely imaginary $k$ that

\[\mathop{\mathrm{Re}}(\boldsymbol{\breve{u}}^\dagger \boldsymbol{u} k) = 0\quad \Rightarrow\quad \mathop{\mathrm{Im}}(\boldsymbol{\breve{u}}^\dagger \boldsymbol{u}) = 0\]

If we thus choose $\dot{k}=0$, we find that $\boldsymbol{\dot{u}}=\boldsymbol{\dot{l}}$ lives in the orthogonal complement of $\boldsymbol{u}$, i.e. $\boldsymbol{u}^\dagger \boldsymbol{\dot{u}}=0$. By projecting the equation above onto $\boldsymbol{u}$ and its orthogonal complement, we then find \[\begin{align} \dot{\lambda} &= \boldsymbol{u}^\dagger \dot{A}\boldsymbol{u} + \boldsymbol{u}^\dagger A\boldsymbol{\dot{u}} \\ \lambda \boldsymbol{\dot{u}} - (I- \boldsymbol{u}\boldsymbol{u}^\dagger)A\boldsymbol{\dot{u}} &= (\lambda I- A)\boldsymbol{\dot{u}} + \boldsymbol{u}\boldsymbol{u}^\dagger A\boldsymbol{\dot{u}}= (I- \boldsymbol{u}\boldsymbol{u}^\dagger) \dot{A} \boldsymbol{u}\\ \end{align}\] When $A$ is hermitian, $\boldsymbol{u}^\dagger$ is also a left eigenvector and we can omit the terms containing $\boldsymbol{u}^\dagger A\boldsymbol{\dot{u}}$ in the first and second equation. In particular, we recover the well-known result $\dot{\lambda}=\boldsymbol{u}^\dagger \dot{A}\boldsymbol{u}$. However, we here deal with the more general case.

Another way to write this linear system is as \[\begin{align} \begin{bmatrix} \lambda I- A & \boldsymbol{u}\\ \boldsymbol{u}^\dagger & 0 \end{bmatrix}\begin{bmatrix} \boldsymbol{\dot{u}} \\ \dot{\lambda}\end{bmatrix} = \begin{bmatrix} \dot{A}\boldsymbol{u}\\ 0 \end{bmatrix} \end{align}\]

Despite the appearance of $(\lambda I-A)$, which has an eigenvalue zero, the coefficient matrix in this system can be inverted (provided the multiplicity of eigenvalue $\lambda$ of $A$ is one). This follows from writing it as

\[\begin{align} \begin{bmatrix} \lambda I- A & \boldsymbol{u}\\ \boldsymbol{u}^\dagger & 0 \end{bmatrix}=\begin{bmatrix} \lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -a \end{bmatrix} + \frac{1}{a} \begin{bmatrix} \boldsymbol{u} \\ a\end{bmatrix}\begin{bmatrix} \boldsymbol{u}^\dagger & a \end{bmatrix} \end{align}\] where $a$ is an arbitrary nonzero number. When $A$ is hermitian, it is useful to also restrict $a$ to be real, so that both terms in this sum are hermitian.

Assuming that the eigenvalue $\lambda$ has multiplicity one, the matrix $\lambda I- A-\frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger$ is invertible for any $a \neq \infty$, since the spectrum of $A+\frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger$ is given by the spectrum of $A$ with eigenvalue $\lambda$ replaced by $\lambda+\frac{1}{a}$. Indeed, we find that $\boldsymbol{u}$ is still a right eigenvector as

\[ (A+\frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)\boldsymbol{u} = \lambda \boldsymbol{u} + \frac{1}{a} \boldsymbol{u}(\boldsymbol{u}^\dagger \boldsymbol{u})\]

and $\boldsymbol{u}$ was assumed normalised to one. Note that this does not require that $A$ is hermitian. That the other eigenvalues $\tilde{\lambda}\neq \lambda$ are not affected by this extra term, follows from applying the left eigenvector $\boldsymbol{\tilde{v}}^\dagger$ associated with those eigenvalues, which satisfy $\boldsymbol{\tilde{v}}^\dagger\boldsymbol{u}=0$. However, in the general non-Hermitian case, both the left eigenvector for eigenvalue $\lambda$ and the right eigenvectors associated with eigenvalues $\tilde{\lambda}\neq \lambda$ are affected by this extra term.

We can then compute the inverse of this matrix using the Sherman-Morisson formula, as the second term is a rank-1 update. We find

\[\begin{align} &\begin{bmatrix} \lambda I- A & \boldsymbol{u}\\ \boldsymbol{u}^\dagger & 0 \end{bmatrix}^{-1} \\ &=\begin{bmatrix} (\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -\frac{1}{a} \end{bmatrix} - \frac{1}{a} \frac{\begin{bmatrix} (\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -\frac{1}{a} \end{bmatrix} \begin{bmatrix} \boldsymbol{u} \\ a\end{bmatrix}\begin{bmatrix} \boldsymbol{u}^\dagger & a \end{bmatrix} \begin{bmatrix} (\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -\frac{1}{a} \end{bmatrix}}{1+ \frac{1}{a} \begin{bmatrix} \boldsymbol{u}^\dagger & a \end{bmatrix} \begin{bmatrix} (\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -\frac{1}{a} \end{bmatrix} \begin{bmatrix} \boldsymbol{u} \\ a\end{bmatrix}}\\ &=\begin{bmatrix} (\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -\frac{1}{a} \end{bmatrix} - \frac{1}{a} \frac{ \begin{bmatrix} -a\boldsymbol{u} \\ -1\end{bmatrix}\begin{bmatrix} \boldsymbol{u}^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & -1 \end{bmatrix}}{1+ \frac{1}{a} \begin{bmatrix} \boldsymbol{u}^\dagger & a \end{bmatrix} \begin{bmatrix} -a\boldsymbol{u} \\ -1\end{bmatrix}}\\ &=\begin{bmatrix} (\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{o}\\ \boldsymbol{o}^\dagger & -\frac{1}{a} \end{bmatrix} - \begin{bmatrix} \boldsymbol{u} \\ \frac{1}{a}\end{bmatrix}\begin{bmatrix} \boldsymbol{u}^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & -1 \end{bmatrix}\\ &=\begin{bmatrix} (I-\boldsymbol{u}\boldsymbol{u}^\dagger)(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & \boldsymbol{u}\\ -\frac{1}{a}\boldsymbol{u}^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} & 0 \end{bmatrix} \end{align}\]

Here, we have used that $(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}\boldsymbol{u} = -a \boldsymbol{u}$, but we have not assumed that $A$ is hermitian. If that is the case, then also the entry in the bottom left corner simplifies down to $\boldsymbol{u}^\dagger$.

Inserting this inverse in the linear system, we now find

\[\begin{align} \begin{bmatrix} \boldsymbol{\dot{u}}\\ \dot{\lambda} \end{bmatrix} = \begin{bmatrix} (I-\boldsymbol{u}\boldsymbol{u}^\dagger)(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1} \dot{A}\boldsymbol{u}\\ -\frac{1}{a}\boldsymbol{u}^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}\dot{A}\boldsymbol{u} \end{bmatrix} \end{align}\]

Note that the equation

\[\dot{\lambda} = -\frac{1}{a}\boldsymbol{u}^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}\dot{A}\boldsymbol{u}\]

reduced to its well known form $\dot{\lambda}=\boldsymbol{u}^\dagger \dot{A}\boldsymbol{u}$ in the case where $A$ is Hermitian. One might have expected the nonhermitian generalisation $\dot{\lambda}=\boldsymbol{v}^\dagger \dot{A}\boldsymbol{u}$ where $\boldsymbol{v}^\dagger$ is the left eigenvector of $A$ corresponding to eigenvalue $\lambda$. In fact, it turns out that it holds indeed that

\[\boldsymbol{v}^\dagger = -\frac{1}{a}\boldsymbol{u}^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}\]

as can be verified by writing it as

\[\boldsymbol{v}^\dagger(\lambda I- A) - \frac{1}{a} (\boldsymbol{v}^\dagger \boldsymbol{u})\boldsymbol{u^\dagger} = -\frac{1}{a}\boldsymbol{u}^\dagger\]

Indeed, it is exactly the left eigenvector $\boldsymbol{v}$, normalised such that $\boldsymbol{v}^\dagger \boldsymbol{u}=1$, that satisfies this equation. However, instead of explicitly needing to know this left eigenvector, either by separately solving the eigenvalue equation for the left eigenvector, or by solving the linear system above, we only need to solve one linear system, namely

\[(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}\dot{A}\boldsymbol{u}\]

in order to determine both $\boldsymbol{\dot{u}}$ and $\dot{\lambda}$. Note, finally, that while the precise choice of $a$ was irrelevant in all of the above, in practice it might affect the condition number of the matrix that needs to be inverted, and it can be beneficial to make a careful choice, based on what properties of $A$ are known.

Finally, we derive the backward rule, via

\[\begin{align} \mathop{\mathrm{Re}}(\boldsymbol{\breve{u}}^\dagger \boldsymbol{\dot{u}})+\mathop{\mathrm{Re}}(\breve{\lambda}^\dagger \dot{\lambda}) &=\mathop{\mathrm{Re}}([(I-\boldsymbol{u}\boldsymbol{u}^\dagger)\boldsymbol{\breve{u}} - \frac{1}{\overline{a}}\breve{\lambda}\boldsymbol{u}]^\dagger(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}\dot{A}\boldsymbol{u})\\ &=\mathop{\mathrm{RTr}}(((\overline{\lambda}I-A^\dagger -\frac{1}{\overline{a}}\boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}[(I-\boldsymbol{u}\boldsymbol{u}^\dagger)\boldsymbol{\breve{u}} - \frac{1}{\overline{a}}\breve{\lambda}\boldsymbol{u}]\boldsymbol{u}^\dagger)^\dagger \dot{A}) \end{align}\]

from which we obtain

\[\breve{A} = (\overline{\lambda}I-A^\dagger -\frac{1}{\overline{a}}\boldsymbol{u}\boldsymbol{u}^\dagger)^{-1}[(I-\boldsymbol{u}\boldsymbol{u}^\dagger)\boldsymbol{\breve{u}} - \frac{1}{\overline{a}}\breve{\lambda}\boldsymbol{u}]\boldsymbol{u}^\dagger\]

4.3 Single eigenvector (alternative)

The previous scheme of course hints towards an alternative scheme, that is probably better known or more expected, for the case of a nonhermitian matrix $A$ for which both the left eigenvector $\boldsymbol{u}$ and the right eigenvector $\boldsymbol{v}^\dagger$ associated with an eigenvalue $\lambda$ are known. It is furthermore assumed that $\lambda$ has (algebraic and geometric) multiplicity one, so that there are no Jordan blocks and $\boldsymbol{v}^\dagger \boldsymbol{u}\neq 0$. We then normalise $\boldsymbol{v}^\dagger$ such that $\boldsymbol{v}^\dagger \boldsymbol{u}=1$. If we are only interested in the chain rule with respect to $\boldsymbol{u}$, we can now gauge the tangent direction so that $\boldsymbol{v}^\dagger \boldsymbol{\dot{u}} =0$, giving rise to the linear system

\[\begin{align} \begin{bmatrix} \lambda I- A & \boldsymbol{u}\\ \boldsymbol{v}^\dagger & 0 \end{bmatrix}\begin{bmatrix} \boldsymbol{\dot{u}} \\ \dot{\lambda}\end{bmatrix} = \begin{bmatrix} \dot{A}\boldsymbol{u}\\ 0 \end{bmatrix} \end{align}\]

Most of the analysis from the previous subsection can easily be generalised by substituting $\boldsymbol{u}^\dagger$ with $\boldsymbol{v}^\dagger$. But since now $\boldsymbol{v}^\dagger$ is a left eigenvector of $A$, and also of $A + \frac{1}{a} \boldsymbol{u}\boldsymbol{v}^\dagger$, some simplification occurs. In this case, the spectral (or Jordan) decomposition of $A + \frac{1}{a} \boldsymbol{u}\boldsymbol{v}^\dagger$ coincides completely with that of $A$ (both eigenvalues and left and right eigenvectors), except for eigenvalue $\lambda$ being shifted to $\lambda + \frac{1}{a}$. We obtain as final result

\[\begin{align} \begin{bmatrix} \boldsymbol{\dot{u}}\\ \dot{\lambda} \end{bmatrix} = \begin{bmatrix} (I-\boldsymbol{u}\boldsymbol{v}^\dagger)(\lambda I- A - \frac{1}{a} \boldsymbol{u}\boldsymbol{v}^\dagger)^{-1} \dot{A}\boldsymbol{u}\\ \boldsymbol{v}^\dagger\dot{A}\boldsymbol{u} \end{bmatrix} \end{align}\]

For the reverse rule, we now obtain

\[\breve{A} = (\overline{\lambda}I-A^\dagger -\frac{1}{\overline{a}}\boldsymbol{v}\boldsymbol{u}^\dagger)^{-1}(I-\boldsymbol{u}\boldsymbol{v}^\dagger)\boldsymbol{\breve{u}} + \breve{\lambda}\boldsymbol{v}\boldsymbol{u}^\dagger\]

Note that this alternative method is useful if the left eigenvector is already known. If however it still needs to be computed, the previous approach is expected to be computationally cheaper, as it only requires solving one linear system. In the case where $A$ is hermitian and thus $v=u$, these two approaches coincides.

4.4 Higher-dimensional invariant subspace

Now let us consider the case where $V\in{\mathbb{C}}^{n \times p}$ and contains $p$ eigenvectors ${v}_i$ and associated eigenvectors $\lambda_i$ for $i=1,\ldots,p$. Of course, we can just reuse the result from the previous subsection for each pair $(\lambda_i,{v}_i)$ separately. For the reverse rule, because of linearity, we can then simply add the results, i.e.

\[\begin{align} \breve{A} = \sum_{i=1}^{p}(\overline{\lambda}_iI-A^\dagger -\frac{1}{\overline{a}}\boldsymbol{v}_i\boldsymbol{v}_i^\dagger)^{-1}[(I-\boldsymbol{v}_i\boldsymbol{v}_i^\dagger)\boldsymbol{\breve{v}_i} - \frac{1}{\overline{a}}\breve{\lambda}_i\boldsymbol{v}_i]\boldsymbol{v}_i^\dagger \end{align}\]

However, it might be that there are alternative approaches.

4.4.1 Hermitian case

Let us first focus on the case where $A$ is hermitian, as in this case the matrix $V$ satisfies the additional constraint that it is isometric, so that it columns are not independent.

We can then parameterise $\dot{V}=V\dot{K}+\dot{L}$ where $\dot{K}$ is antihermitian and $V^\dagger \dot{L} = O$. We insert this decomposition in

\[\dot{A}V + A\dot{V}=\dot{V}D+V\dot{D}\]

and project the resulting equation onto $V$ and its orthogonal complement, in order to obtain

\[\begin{align} V^\dagger \dot{A}V &= \dot{K}D-D\dot{K} + \dot{D}\\ (I- VV^\dagger)\dot{A}V &= \dot{L}D -(I- VV^\dagger)A\dot{L}=-\mathcal{S}_{(I- VV^\dagger)A,D}(\dot{L}) \end{align}\]

The first equation is identical to that of a full eigenvalue decomposition, so that we can recycle the results from that section. For the Sylvester equation on the second line, one might be inclined to omit the projection $(I- VV^\dagger)$ in front of $A\dot{L}$, as $V^\dagger A\dot{L}=DV^\dagger\dot{L}$, and $V^\dagger\dot{L}$ is supposed to be zero. However, it is exactly the presence of this factor that will enforce that $\dot{L}$ will satisfy this constraint. A different perspective is that the projection is necessary to ensure $(I-VV^\dagger)A$ and $D$ do not have common eigenvalues, which makes that the Sylvester equation will have a unique solution. In fact, this requires that all eigenvalues with higher multiplicity are either completely contained in the subspace $V$ or in its orthogonal complement. Furthermore, it is also important that none of the eigenvalues in $V$, or thus of the diagonal elements in $D$, are zero. If this is the case, it can help to regularise the second equation as

\[\begin{align} (I- VV^\dagger)\dot{A}V &= \dot{L}D -A\dot{L} - VD'V^\dagger \dot{L} \end{align}\]

where $D'$ is a diagonal matrix that is chosen such that $D$ and $D+D'$ do not have any eigenvalues (diagonal elements) in common. If $D$ does not contain eigenvalues zero, then the choice $D'=-D$ reduces this equation to its original form.

We now find the solution

\[\begin{align} \dot{K} &= F \odot (V^\dagger \dot{A}V)\\ \dot{D} &= \mathcal{P}_{rD}(V^\dagger \dot{A}V)\\ \dot{L} &= - \mathcal{S}_{A+VD'V^\dagger , D}^{-1}((I- VV^\dagger)\dot{A}V) \end{align}\] with $F_{ij}=\begin{cases} \frac{1}{\lambda_j-\lambda_i},& i\neq j\\ 0,&i=j\end{cases}$ and $\lambda_i = D_{ii} \in {\mathbb{R}}$.

For the reverse rule, we insert $\dot{V}=V\dot{K} + (I- VV^\dagger)\dot{L}$, where the explicit projector onto the orthogonal complement of $V$ in the second term does not affect $\dot{L}$, but will help in projecting the corresponding component out of the adjoint variable $\breve{V}$. We then find \[\begin{align} \mathop{\mathrm{RTr}}(\breve{V}^\dagger \dot{V})&+\mathop{\mathrm{RTr}}(\breve{D}^\dagger \dot{D})\\ &=\mathop{\mathrm{RTr}}([(V^\dagger\breve{V})\odot F + \mathcal{P}_{rD}(\breve{D})]^\dagger V^\dagger\dot{A}V) + \mathop{\mathrm{RTr}}([-\mathcal{S}_{A+VD'V^\dagger , D}^{-1}((I- VV^\dagger)\breve{V})]^\dagger (I- VV^\dagger)\dot{A}V)\\ &= \mathop{\mathrm{RTr}}([V\{(V^\dagger\breve{V})\odot F + \mathcal{P}_{rD}(\breve{D})\}V^\dagger - (I- VV^\dagger)\mathcal{S}_{A+VD'V^\dagger , D}^{-1}((I- VV^\dagger)\breve{V})V^\dagger]^\dagger\dot{A}) \end{align}\]

where we have used the hermiticity of $A$ (and $D$ and $D'$) in order to bring the (inverse) Sylvester superoperator to the dual variables. We thus find

\[\breve{A}=V[(V^\dagger\breve{V})\odot F + \mathcal{P}_{rD}(\breve{D})]V^\dagger - (I- VV^\dagger)\mathcal{S}_{A+VD'V^\dagger , D}^{-1}((I- VV^\dagger)\breve{V})V^\dagger\]

where, as mentioned above, $D'$ is a an arbitrary (real) diagonal matrix that is chosen such that $D$ and $D+D'$ do not have any eigenvalues (diagonal elements) in common.

Alternatively, if we can afford to explicitly compute a complement $V_\perp$ and parameterise $\dot{L} = V_\perp \dot{K}_\perp$ and thus $\dot{V}=V\dot{K}+V_\perp \dot{K}_\perp$, we would find

\[\begin{align} \dot{K} &= F \odot (V^\dagger \dot{A}V)\\ \dot{D} &= \mathcal{P}_{rD}(V^\dagger \dot{A}V)\\ \dot{K}_\perp &= - \mathcal{S}_{V_\perp^\dagger AV_\perp, D}^{-1}(V_\perp^\dagger\dot{A}V) \end{align}\]

where again we assume that $A$ does not have common eigenvalues in the subspace $V$ (diagonal elements of $D$) and in the subspace $V_\perp$. This description translates to a reverse rule

\[\breve{A}=V[(V^\dagger\breve{V})\odot F + \mathcal{P}_{rD}(\breve{D})]V^\dagger - V_\perp\mathcal{S}_{V_\perp^\dagger AV_\perp , D}^{-1}(V_\perp^\dagger\breve{V})V^\dagger\]

4.4.2 Non-hermitian case

For a non-hermitian matrix $A$, we can still have a subspace spanned by the columns of $V$, where we assume these columns to be different eigenvectors, such that $AV=VD$

As these columns are now not necessarily orthogonal, we first compute their Gram matrix $G=V^\dagger V$, in order to build an orthogonal projector

\[P_{V} = V G^{-1} V^\dagger\]

Note that $G^{-1}V^\dagger$ is often referred to as the (Moore-Penrose) pseudoinverse of $V$ and is then denoted as $V^+$. In the case where $V = {\mathbb{F}}^{n \times n}$, it coincides with $V^{-1}$. In the case where the columns of $V$ are orthogonal, $G=I$ and it coincides with $V^\dagger$.

We can now again parameterise $\dot{V} = V \dot{K} + \dot{L}$, which is simply a way to decompose $\dot{V}$ into a component along $V$ and a component in the orthogonal complement. Put differently, we have $V\dot{K} = P_{V}\dot{V}$ and $\dot{L} = (I- P_{V})\dot{V}$, which shows that $V^\dagger \dot{L} =O$. In this case, however there is no contraint on $\dot{K}$. However, as the individual columns of $V$ are only determined up to phases and normalisation, we can still impose a condition on the diagonal of $\dot{K}$. This is exactly what was used in the full eigenvalue decomposition.

Now inserting $\dot{V} = V \dot{K} + \dot{L}$ in

\[\dot{A}V + A\dot{V} = \dot{V}D + V\dot{D}\]

and projecting onto $V$ and its orthogonal complement, we find

\[\begin{align} V^\dagger\dot{A}V + GD\dot{K} + V^\dagger A \dot{L} &= G\dot{K} D + G \dot{D}\\ (I- VG^{-1}V^\dagger)\dot{A}V + (I- VG^{-1}V^\dagger)A\dot{L} &= \dot{L} D \end{align}\]

or thus

\[\begin{align} G^{-1}V^\dagger\left(\dot{A}V +A \dot{L}\right) &= \dot{K} D - D\dot{K} + \dot{D}\\ (I- VG^{-1}V^\dagger)\dot{A}V &= \dot{L} D - (I- VG^{-1}V^\dagger)A\dot{L} = - \mathcal{S}_{(I- VG^{-1}V^\dagger)A,D}(\dot{L}) \end{align}\]

The first equation defines $\dot{D}$ and $\dot{K}$ in terms of the left hand side, which still includes $\dot{L}$. The latter can be completely determined from the second equation. This second equation, as before, contains a Sylvester operator involving the matrices $D$, the restriction of $A$ onto the subspace $V$, and $(I-VG^{-1}V^\dagger)A$. The second matrix has the same spectrum of $A$, except that all eigenvalues contained in $V$ have been mapped to zero. More generally, we could again regularise this Sylvester equation by replacing $(I-VG^{-1}V^\dagger)A$ with $A + VD'G^{-1}V^\dagger = A+V D'V^+$, where $D'$ is chosen such that $D$ and $D+D'$ have no eigenvalues in common. Note that now both choices $(I-VG^{-1}V^\dagger)A$ and $A + VD'G^{-1}V^\dagger = A+V D'V^+$ do not become equal for the choice $D'=-D$, even though the effect on the spectrum is the same (right eigenvectors in $V$ are preserved and have eigenvalue zero, complementary eigenvalues are preserved together with their left eigenvectors). To streamline the rest of the discussion, I will denote either of these choises as $A'$.

We now find the solution

\[\begin{align} \dot{L} &= - \mathcal{S}_{A', D}^{-1}((I- VV^+)\dot{A}V)=- \mathcal{S}_{A', D}^{-1}((I- P_V)\dot{A}V)\\ \dot{K} &= F \odot (V^+ \dot{A}V + V^+ A\dot{L}) = F \odot (V^+ \dot{A}V - V^+ A\mathcal{S}_{A', D}^{-1}((I- P_V)\dot{A}V))\\ \dot{D} &= \mathcal{P}_{D}(V^+ \dot{A}V + V^+ A\dot{L})=\mathcal{P}_{D}(V^+ \dot{A}V - V^+ A\mathcal{S}_{A', D}^{-1}((I- P_V)\dot{A}V))\\ \end{align}\]

with, as before, $F_{ij}=\begin{cases} \frac{1}{\lambda_j-\lambda_i},& i\neq j\\ 0,&i=j\end{cases}$ and $\lambda_i = D_{ii}$.

For the reverse rule, we then start from

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{V}^\dagger \dot{V}) &+ \mathop{\mathrm{RTr}}(\breve{D}\dot{D}) \\ &= \mathop{\mathrm{RTr}}([(V^\dagger \breve{V}) \odot {\overline{F}} + \mathcal{P}_{D}(\breve{D})]^\dagger [ V^+ \dot{A}V + V^+ A\dot{L}]) + \mathop{\mathrm{RTr}}(((I- P_V)\breve{V})^\dagger\dot{L}) \end{align}\]

and using the shorthand \[\begin{align} Z &= (V^+)^\dagger[(V^\dagger \breve{V}) \odot {\overline{F}} + \mathcal{P}_{D}(\breve{D})] = V G^{-1} [(V^\dagger \breve{V}) \odot {\overline{F}} + \mathcal{P}_{D}(\breve{D})] \end{align}\]

we find

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{V}^\dagger \dot{V}) &+ \mathop{\mathrm{RTr}}(\breve{D}\dot{D}) \\ &= \mathop{\mathrm{RTr}}(Z^\dagger \dot{A}V) + \mathop{\mathrm{RTr}}([(I- P_V)\breve{V} + A^\dagger Z]^\dagger \dot{L})\\ &= \mathop{\mathrm{RTr}}(Z^\dagger \dot{A}V) - \mathop{\mathrm{RTr}}([(I- P_V)\breve{V} + A^\dagger Z]^\dagger\mathcal{S}_{A', D}^{-1}((I- P_V)\dot{A}V))\\ &= \mathop{\mathrm{RTr}}(Z^\dagger \dot{A}V) - \mathop{\mathrm{RTr}}((\mathcal{S}_{A'^\dagger,D^\dagger}^{-1}((I- P_V)\breve{V} + A^\dagger Z))^\dagger((I- P_V)\dot{A}V))\\ &= \mathop{\mathrm{RTr}}([Z V^\dagger - (I- P_V) \mathcal{S}^{-1}_{A'^\dagger,D^\dagger}((I-P_V)\breve{V}+A^\dagger Z)V^\dagger]^\dagger \dot{A}) \end{align}\]

in order to find

\[\begin{align} \breve{A}=Z V^\dagger - (I- P_V) \mathcal{S}^{-1}_{A'^\dagger,D^\dagger}((I-P_V)\breve{V}+A^\dagger Z)V^\dagger \end{align}\]

5 Singular value decomposition

A matrix $A\in {\mathbb{C}}^{m\times n}$ can be decomposed as

\[A=USV^\dagger\]

where $S \in {\mathbb{C}}^{k \times k}$ is diagonal with real non-negative elements and are typically sorted such that $S_{ii} = s_i \geq s_{i+1}$. Furthermore, $U\in {\mathbb{C}}^{m \times k}$ and $V\in{\mathbb{C}}^{n \times k}$ are isometric. Here, $k=\min(m,n)$, but can be chosen equal to some smaller value $r = \mathop{\mathrm{rank}}(A)$ as $s_{r+1}=s_{r+2}=\ldots=s_{\min(m,n)}=0$.

Often, we are interested in a truncated singular value decomposition, where only the $p$ largest singular values are kept, but some smaller nonzero singular values are discarded. In this case, the decomposition is not exact. We however still have the exact equality

\[\begin{align} A V &= U S\\ A^\dagger U &= V S \end{align}\]

Note that we always have the gauge freedom $U\to UD$ and $V\to VD$ with $D$ a unitary matrix that satisfies $DS = SD$. When all the singular values are distinct, this amounts to a diagonal matrix with pure phases.

5.1 Full rank decomposition

We start with the case of a square matrix ($m=n$) that has full rank ($r=m$), for which we consider the full singular value decomposition.

To derive the forward rule, we write

\[\dot{A} V + A\dot{V} = \dot{U}S + U\dot{S}\]

and parameterise $\dot{U}= U\dot{K}$ and $\dot{V}=V\dot{M}$ with $\dot{K}$ and $\dot{M}$ are antihermitian. We then find

\[U^\dagger \dot{A} V = \dot{K} S - S\dot{M} + \dot{S}\]

Writing the components of these equations for the diagonal and off-diagonal separately, we obtain

\[\begin{align} (U^\dagger \dot{A} V)_{ii} &= s_i(\dot{K}_{ii} -\dot{M}_{ii}) + \dot{S}_{ii}\\ (U^\dagger \dot{A} V)_{ij} &= \dot{K}_{ij} s_j -\dot{M}_{ij} s_i, \qquad i\neq j \end{align}\]

Because the antihermitian matrices $\dot{K}$ and $\dot{M}$ only have imaginary diagonal entries, we immediately obtain $\dot{S}=\mathcal{P}_{rD}(U^\dagger \dot{A} V)$. For the imaginary part of the diagaonal components, we cannot unambiguously distribute them over $\dot{K}_{ii}$ and $\dot{M}_{ii}$ because of the gauge invariance, which means that only there difference is fixed. This implies that any adjoint variables resulting from a gauge invariant objective function should satisfy $\mathop{\mathrm{RTr}}(\breve{U}^\dagger \dot{U} + \breve{V}^\dagger \dot{V})=0$ whenever $\dot{U} = U\dot{D}$ and $\dot{V}=V\dot{D}$ for the same purely imaginary and diagonal matrix $\dot{D}$, which we could express as the condition

\[\mathcal{P}_{iD}(U^\dagger \breve{U} + V^\dagger\breve{V})=0\]

for the adjoint variables associated with a gauge-invariant cost function. We can generalise this constraint to the case where some singular values coincide.

We now make an arbitrary gauge choice that

\[\dot{K}_{ii} = - \dot{M}_{ii}= \frac{{\mathrm{i}}}{2} \frac{\mathop{\mathrm{Im}}((U^\dagger \dot{A}V)_{ii})}{s_i} = \frac{(U^\dagger \dot{A}V)_{ii} - (V^\dagger \dot{A}^\dagger U)_{ii}}{4 s_i}\]

For the off-diagonal components of $\dot{K}$ and $\dot{M}$, we parameterise the components above the diagonal in terms of those below the diagonal, in order to find

\[\begin{align} (U^\dagger \dot{A} V)_{ij} &= \dot{K}_{ij} s_j -\dot{M}_{ij} s_i, \qquad i > j\\ (U^\dagger \dot{A} V)_{ij} &= -\overline{\dot{K}_{ji}} s_j +\overline{\dot{M}_{ji}} s_i, \qquad i< j \end{align}\]

or thus, for all $i > j$: \[\begin{align} (U^\dagger \dot{A} V)_{ij} &= \dot{K}_{ij} s_j -\dot{M}_{ij} s_i\\ -(V^\dagger \dot{A}^\dagger U)_{ij} &= \dot{K}_{ij} s_i -\dot{M}_{ij} s_j \end{align}\] from which we can solve \[\begin{align} \dot{K}_{ij} &= \frac{s_j (U^\dagger \dot{A} V)_{ij} + s_i (V^\dagger \dot{A}^\dagger U)_{ij}}{s_j^2 - s_i^2} = \frac{1}{2} \frac{(U^\dagger \dot{A} V)_{ij} + (V^\dagger \dot{A}^\dagger U)_{ij}}{s_j - s_i} + \frac{1}{2} \frac{(U^\dagger \dot{A} V)_{ij} - (V^\dagger \dot{A}^\dagger U)_{ij}}{s_j + s_i} \\ \dot{M}_{ij} &= \frac{s_j (V^\dagger \dot{A}^\dagger U)_{ij} + s_i (U^\dagger \dot{A} V)_{ij}}{s_j^2 - s_i^2}=\frac{1}{2} \frac{(U^\dagger \dot{A} V)_{ij} + (V^\dagger \dot{A}^\dagger U)_{ij}}{s_j - s_i} - \frac{1}{2} \frac{(U^\dagger \dot{A} V)_{ij} - (V^\dagger \dot{A}^\dagger U)_{ij}}{s_j + s_i} \end{align}\] While these equations were derived for the entries of $\dot{K}$ and $\dot{M}$ with $i > j$, it can easily be verified that the same equations hold for $i < j$. Indeed, the first term on the last right hand side of both equations can be recognised as the Hermitian part $\mathcal{P}_{H}(U^\dagger \dot{A}V)$, which is then made antihermitian by the Hadamard product with $F$, where $F_{ij}=\begin{cases}\frac{1}{s_j - s_i},& i\neq j\\0,&\text{otherwise}\end{cases}$. The second term of both right hand sides corresponds to the antihermitian part $\mathcal{P}_{A}(U^\dagger \dot{A}V)$, which keeps remains antihermitian upon taking the Hadamard product with $G$, where $G_{ij} = \frac{1}{s_i+s_j}$. In fact, this last term is even compatible with the specific gauge we have chosen for the diagonal entries of $\dot{K}$ and $\dot{M}$, and can thus be taken to hold for all $i,j=1,\ldots,n$. We then obtain the final result for the forward rules

\[\begin{align} \dot{S} &= \mathcal{P}_{rD}(U^\dagger \dot{A}V) = \mathcal{P}_{D}(\mathcal{P}_{H}(U^\dagger \dot{A}V))\\ \dot{U} &= U (F\odot \mathcal{P}_{H}(U^\dagger \dot{A}V) + G\odot \mathcal{P}_{A}(U^\dagger \dot{A}V))\\ \dot{V} &= V (F\odot \mathcal{P}_{H}(U^\dagger \dot{A}V) - G\odot \mathcal{P}_{A}(U^\dagger \dot{A}V)) \end{align}\]

We can now readily obtain the reverse rule using \[\begin{align} \mathop{\mathrm{RTr}}(\breve{U}^\dagger \dot{U})&+\mathop{\mathrm{RTr}}(\breve{V}^\dagger \dot{V})+\mathop{\mathrm{RTr}}(\breve{S}^\dagger \dot{S})\\ &= \mathop{\mathrm{RTr}}(((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F)^\dagger \mathcal{P}_H(U^\dagger \dot{A}V)) \\ &\qquad + \mathop{\mathrm{RTr}}(((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)^\dagger \mathcal{P}_A(U^\dagger \dot{A}V))\\ &\qquad+ \mathop{\mathrm{RTr}}(\mathcal{P}_{rD}(\breve{S})^\dagger (U^\dagger\dot{A}V))\\ &= \mathop{\mathrm{RTr}}((U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_H((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F) + \mathcal{P}_A((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)]V^\dagger)^\dagger\dot{A}) \end{align}\] from which we conclude that

\[\begin{align} \breve{A} &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_H((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F) + \mathcal{P}_A((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)]V^\dagger\\ &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_A(U^\dagger\breve{U} + V^\dagger\breve{V})\odot F + \mathcal{P}_A(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ \end{align}\]

where thus

\[\begin{align} F_{ij}&=\begin{cases}\frac{1}{s_j - s_i},& i\neq j\\0,&\text{otherwise}\end{cases},\\ G_{ij} &= \frac{1}{s_i+s_j}. \end{align}\]

Using \[\begin{align} (F\odot G)_{ij}=\begin{cases}\frac{1}{s_j^2 - s_i^2},& i\neq j\\0,&\text{otherwise}\end{cases} \end{align}\] we could rewrite this as \[\begin{align} \breve{A} &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_A(U^\dagger\breve{U} + V^\dagger\breve{V})\odot F + \mathcal{P}_A(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ &= U [ \mathcal{P}_{rD}(\breve{S}) + (\{\mathcal{P}_A(U^\dagger\breve{U} + V^\dagger\breve{V}), S\} + [\mathcal{P}_A(U^\dagger\breve{U} - V^\dagger\breve{V}),S])\odot (F \odot G) + \mathcal{P}_{iD}(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ &= U [ \mathcal{P}_{rD}(\breve{S}) + (2\mathcal{P}_A(U^\dagger\breve{U})\cdot S + 2 S \cdot \mathcal{P}_A(V^\dagger\breve{V}))\odot (F \odot G)+ \mathcal{P}_{iD}(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ &= U [ \mathcal{P}_{rD}(\breve{S}) + 2\mathcal{P}_H(U^\dagger\breve{U}\odot (F \odot G))\cdot S + S \cdot 2\mathcal{P}_H(V^\dagger\breve{V}\odot (F \odot G))+ \mathcal{P}_{iD}(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ \end{align}\]

It is this last form that appears for example in https://arxiv.org/1909.02659, except that a different gauge for the diagonal elements was chosen in the forward rule, which also alters the final term of the reverse rule. Indeed, in that reference, the gauge for the diagonal elements was chosen such that

\[\dot{K}_{ii} =0, \qquad \dot{M}_{ii}= \frac{(U^\dagger \dot{A}V)_{ii} - (V^\dagger \dot{A}^\dagger U)_{ii}}{2 s_i}\]

While it might seem suspicious that this affects the reverse rule, note that dual variables associated with gauge invariant objective functions should satisfy

\[\mathcal{P}_{iD}(U^\dagger \breve{U} + V^\dagger\breve{V})=0\]

as mentioned above, so that we could then reexpress

\[\mathcal{P}_{iD}(U^\dagger\breve{U} - V^\dagger\breve{V}) = -2 \mathcal{P}_{iD}(V^\dagger\breve{V})\]

Using this substitution, our reverse rule equals Equation 1 in https://arxiv.org/1909.02659.

5.2 General result

We now discuss the general result for the singular value decomposition of a rectangular matrix $A \in {\mathbb{C}}^{m\times n}$, which potentially has a rank $r \leq \min(m,n)$, and which is possibly truncated, i.e. for which we only compute $p\leq r$ (typically the largest) singular values and corresponding vectors. We start from the exact relations

\[\begin{align} AV &= US,&A^\dagger U &= VS \end{align}\]

with thus $U \in {\mathbb{C}}^{m \times p}$, $V \in {\mathbb{C}}^{n \times p}$ and $S \in {\mathbb{R}}_{\geq 0}^{p \times p}$. For the forward derivatives, we obtain

\[\begin{align} \dot{A}V &= \dot{U}S - A\dot{V} + U\dot{S}\\ \dot{A}^\dagger U &= \dot{V}S - A^\dagger\dot{U} + V\dot{S} \end{align}\] where we now insert \[\begin{align} \dot{U}=U\dot{K}+\dot{L},\qquad\dot{K}=-\dot{K}^\dagger,\qquad U^\dagger \dot{L}=O\\ \dot{V}=U\dot{M}+\dot{N},\qquad\dot{L}=-\dot{L}^\dagger,\qquad V^\dagger \dot{N}=O \end{align}\]

Projecting the first equation onto $U^\dagger$ and the second onto $V^\dagger$, we obtain

\[\begin{align} U^\dagger \dot{A}V &= \dot{K}S - S\dot{M} + \dot{S}\\ V^\dagger \dot{A}^\dagger U &= \dot{M}S - S\dot{K} + \dot{S} \end{align}\] which are the equations we had solved in the full rank case in the previous subsection.

Instead, we now project the first equation onto $(I- UU^\dagger)$ and the second on $(I- VV^\dagger)$ in order to find

\[\begin{align} (I- UU^\dagger)\dot{A}V &= \dot{L}S - (I- UU^\dagger)A\dot{N}\\ (I- VV^\dagger)\dot{A}^\dagger U &= \dot{N}S - (I- VV^\dagger)A^\dagger\dot{L} \end{align}\] The best way to solve these equations is to treat them as a single Sylvester equation by rewriting it in block matrix form as

\[\begin{align} \begin{bmatrix} (I- UU^\dagger)\dot{A}V\\ (I- VV^\dagger)\dot{A}^\dagger U \end{bmatrix} = \begin{bmatrix} \dot{L} \\ \dot{N} \end{bmatrix} S - \begin{bmatrix} O& (I- UU^\dagger)A \\ (I- VV^\dagger)A^\dagger & O\end{bmatrix} \begin{bmatrix} \dot{L} \\ \dot{N} \end{bmatrix} = -\mathcal{S}_{\begin{bmatrix} O& (I- UU^\dagger)A \\ (I- VV^\dagger)A^\dagger & O\end{bmatrix}, S}\left(\begin{bmatrix} \dot{L} \\ \dot{N} \end{bmatrix}\right) \end{align}\]

As in the case of the partial Hermitian eigenvalue factorisation, the terms $(I- UU^\dagger)A\dot{N}$ and $(I- VV^\dagger)A^\dagger\dot{L}$ could be simplified to $A\dot{N}$ and $A^\dagger\dot{L}$ by exploiting that $V^\dagger \dot{N}=O$ and $U^\dagger \dot{L}=O$. However, it is exactly the presence of these extra projectors that will enforce this condition, and make that the Sylvester operator has a trivial null space and thus becomes invertible. While we could consider more general types of regularisation as in the eigenvalue case, we now asssume that $S$ does not contain zeros on the diagonal, as we are considering a truncated SVD, which is used in cases where zero singular values are probably meant to be truncated away. Furthermore note that $(I- UU^\dagger)A=A(I- VV^\dagger) = A-USV^\dagger$ so that the first matrix appearing in the Sylvester operator is also Hermitian, and let us introduce the notation

\[A_\perp = (I- UU^\dagger)A=A(I- VV^\dagger) = A-USV^\dagger = (I- UU^\dagger)A(I- VV^\dagger)\]

We thus obtain

\[\begin{align} \dot{S} &= \mathcal{P}_{rD}(U^\dagger \dot{A}V) = \mathcal{P}_{D}(\mathcal{P}_{H}(U^\dagger \dot{A}V))\\ \dot{U} &= U \dot{K} + \dot{L}\\ \dot{V} &= U \dot{M} + \dot{N}\\ \dot{K} &= F\odot \mathcal{P}_{H}(U^\dagger \dot{A}V) + G\odot \mathcal{P}_{A}(U^\dagger \dot{A}V)\\ \dot{M} &= F\odot \mathcal{P}_{H}(U^\dagger \dot{A}V) - G\odot \mathcal{P}_{A}(U^\dagger \dot{A}V)\\ \begin{bmatrix} \dot{L} \\ \dot{N} \end{bmatrix} &= -\mathcal{S}^{-1}_{\begin{bmatrix} O& A_\perp \\ A_\perp^\dagger & O\end{bmatrix},S}\left(\begin{bmatrix} (I- UU^\dagger)\dot{A}V\\ (I- VV^\dagger)\dot{A}^\dagger U \end{bmatrix}\right) \end{align}\]

with

\[\begin{align} F_{ij}&=\begin{cases}\frac{1}{s_j - s_i},& i\neq j\\0,&\text{otherwise}\end{cases},\\ G_{ij} &= \frac{1}{s_i+s_j}. \end{align}\]

as before. To formulate the reverse rule, we will need the quantities

\[\begin{align} \begin{bmatrix} \breve{X} \\ \breve{Y} \end{bmatrix} &= -\mathcal{S}^{-1}_{\begin{bmatrix} O& A_\perp \\ A_\perp^\dagger & O\end{bmatrix},S}\left(\begin{bmatrix} (I- UU^\dagger)\breve{U}\\ (I- VV^\dagger)\breve{V} \end{bmatrix}\right) \end{align}\]

so that we can now write

\[\begin{align} \mathop{\mathrm{RTr}}(\breve{U}^\dagger \dot{U})&+\mathop{\mathrm{RTr}}(\breve{V}^\dagger \dot{V})+\mathop{\mathrm{RTr}}(\breve{S}^\dagger \dot{S})\\ &= \mathop{\mathrm{RTr}}(((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F)^\dagger \mathcal{P}_H(U^\dagger \dot{A}V)) \\ &\qquad + \mathop{\mathrm{RTr}}(((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)^\dagger \mathcal{P}_A(U^\dagger \dot{A}V))\\ &\qquad+ \mathop{\mathrm{RTr}}(\mathcal{P}_{rD}(\breve{S})^\dagger (U^\dagger\dot{A}V))\\ &\qquad +\mathop{\mathrm{RTr}}(\breve{X}^\dagger (I-UU^\dagger) \dot{A}V)\\ &\qquad +\mathop{\mathrm{RTr}}(\breve{Y}^\dagger (I-VV^\dagger) \dot{A}^\dagger U)\\ &= \mathop{\mathrm{RTr}}((U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_H((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F) + \mathcal{P}_A((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)]V^\dagger)^\dagger\dot{A})\\ &\qquad+\mathop{\mathrm{RTr}}([(I- UU^\dagger)\breve{X}V^\dagger]^\dagger \dot{A}) + \mathop{\mathrm{RTr}}([(I- VV^\dagger)\breve{Y}U^\dagger]^\dagger \dot{A}^\dagger) \end{align}\] from which we conclude that

\[\begin{align} \breve{A} &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_H((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F) + \mathcal{P}_A((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)]V^\dagger\\ &\qquad + (I- UU^\dagger)\breve{X}V^\dagger + U\breve{Y}^\dagger (I- VV^\dagger)\\ &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_A(U^\dagger\breve{U} + V^\dagger\breve{V})\odot F + \mathcal{P}_A(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ &\qquad + (I- UU^\dagger)\breve{X}V^\dagger + U\breve{Y}^\dagger (I- VV^\dagger). \end{align}\]

Note that $U^\dagger \breve{X}=O$ and $V^\dagger\breve{Y}=O$ should automatically be satisfied by the solutions $(\breve{X},\breve{Y}$ of the Sylvester equation, and the inclusion of the extra projectors onto the orthogonal complement of $U$ and $V$ in the terms containing $\breve{X}$ and $\breve{Y}$ is therefore redundant.

If we obtained the truncated SVD from first performing a full SVD, and thus have the orthogonal complements $U_\perp$ and $V_\perp$ available, we can rewrite this

\[\begin{align} \dot{S} &= \mathcal{P}_{rD}(U^\dagger \dot{A}V) = \mathcal{P}_{D}(\mathcal{P}_{H}(U^\dagger \dot{A}V))\\ \dot{U} &= U \dot{K} + U_\perp \dot{K}_\perp\\ \dot{V} &= U \dot{M} + V_\perp \dot{M}_\perp\\ \dot{K} &= F\odot \mathcal{P}_{H}(U^\dagger \dot{A}V) + G\odot \mathcal{P}_{A}(U^\dagger \dot{A}V)\\ \dot{M} &= F\odot \mathcal{P}_{H}(U^\dagger \dot{A}V) - G\odot \mathcal{P}_{A}(U^\dagger \dot{A}V)\\ \begin{bmatrix} \dot{K}_\perp \\ \dot{M}_\perp \end{bmatrix} &= -\mathcal{S}^{-1}_{\begin{bmatrix} O& U_\perp^\dagger AV_\perp \\ V_\perp^\dagger \dot{A}^\dagger U_\perp & O\end{bmatrix},S}\left(\begin{bmatrix} U_\perp^\dagger\dot{A}V\\ V_\perp^\dagger\dot{A}^\dagger U \end{bmatrix}\right) \end{align}\]

and thus for the reverse rule

\[\begin{align} \breve{A} &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_H((U^\dagger\breve{U} + V^\dagger\breve{V})\odot F) + \mathcal{P}_A((U^\dagger\breve{U} - V^\dagger\breve{V})\odot G)]V^\dagger\\ &\qquad + U_\perp \breve{X}'V^\dagger + U\breve{Y}'^\dagger V_\perp^\dagger\\ &= U [ \mathcal{P}_{rD}(\breve{S}) + \mathcal{P}_A(U^\dagger\breve{U} + V^\dagger\breve{V})\odot F + \mathcal{P}_A(U^\dagger\breve{U} - V^\dagger\breve{V})\odot G]V^\dagger\\ &\qquad + U_\perp \breve{X}'V^\dagger + U\breve{Y}'^\dagger V_\perp^\dagger\\ \end{align}\]

where now $\breve{X}'=U_\perp^\dagger \breve{X}$ and $\breve{Y}' =V_\perp^\dagger \breve{Y}$ (running out of letters) can directly be obtained from

\[\begin{align} \begin{bmatrix} \breve{X}' \\ \breve{Y}' \end{bmatrix} &= -\mathcal{S}^{-1}_{\begin{bmatrix} O& U_\perp^\dagger AV_\perp \\ V_\perp^\dagger \dot{A}^\dagger U_\perp & O\end{bmatrix},S}\left(\begin{bmatrix} U_\perp^\dagger\breve{U}\\ V_\perp^\dagger\breve{V} \end{bmatrix}\right) \end{align}\]

or thus, writing out the sylvester equation explicitly,

\[\begin{align} \breve{X}' S - U_\perp^\dagger AV_\perp \breve{Y}' &= U_\perp^\dagger\breve{U}\\ \breve{Y}'S - V_\perp^\dagger \dot{A}^\dagger U_\perp \breve{X}' &= V_\perp^\dagger\breve{V} \end{align}\]

If $U_\perp$ and $V_\perp$ are not just any orthogonal complement of $U$ and $V$, but truly come from a full SVD computation, it furthermore holds that $U_\perp^\dagger AV_\perp$ will also be diagonalised. Note that $U_\perp \in {\mathbb{C}}^{m \times (m-p)}$ and $V \in {\mathbb{C}}^{n \times (n-p)}$, whereas $\breve{X}' \in {\mathbb{C}}^{(m-p) \times p}$ en $\breve{V}' \in {\mathbb{C}}^{(n-p)\times p}$. The matrix $U_\perp^\dagger AV_\perp$ is thus a $(m-p) \times (n-p)$ matrix with nonzero elements only on the diagonal, namely $(U_\perp^\dagger AV_\perp)_{ij} = s_{i+p} \delta_{ij}$ with $\{s_i, i=1,\ldots,\min(m,n) \}$ the list of all singular values of $A$, and $\{s_i,i=1,\ldots,p\}$ the singular values contained in $S$.

We now find the componentwise equations

\[\begin{align} \breve{X}'_{ik} s_k - s_{i+p} \breve{Y}'_{ik} &= (U_\perp^\dagger \breve{U})_{ik}\\ \breve{Y}'_{jk} s_k - s_{j+p} \breve{X}'_{ik} &= (V_\perp^\dagger \breve{V})_{jk} \end{align}\] with $i = 1,\ldots, m-p$, $j=1,\ldots,n-p$ and $k=1,\ldots,p$. Note that if $m \neq n$, we are already stretching the notation here, since the index of $s$ might be going out of bounds in the second term (which we interpret as the corresponding $s$ being zero). To solve this in way similar to the full SVD case for the square matrix, we first have to define a common square region for $i$ and $j$. Now, if there are only $r \leq \min(m,n)$ nonzero singular values ($r$ being the rank of $A$) we can separate these equations into

\[\begin{align} \breve{X}'_{ik} s_k - s_{i+p} \breve{Y}'_{ik} &= (U_\perp^\dagger \breve{U})_{ik}\\ \breve{Y}'_{jk} s_k - s_{j+p} \breve{X}'_{jk} &= (V_\perp^\dagger \breve{V})_{jk} \end{align}\]

for the square region $i,j = 1,\ldots, r-p$ and $k=1,\ldots,p$, whereas

\[\begin{align} \breve{X}'_{ik} s_k &= (U_\perp^\dagger \breve{U})_{ik}\\ \breve{Y}'_{jk} s_k &= (V_\perp^\dagger \breve{V})_{jk} \end{align}\]

for the remaining values $i=r-p+1,\ldots,m-p$ and $j=r-p+1,\ldots,n-p$, again for all $k=1,\ldots,p$.

This can now easily be solved to

\[\begin{align} \breve{X}'_{ik} &= \frac{(U_\perp^\dagger \breve{U})_{ik} s_k + s_{i+p} (V_\perp^\dagger \breve{V})_{ik}}{s_k^2 - s_{i+p}^2}= \frac{1}{2} \frac{(U_\perp^\dagger \breve{U})_{ik} + (V_\perp^\dagger \breve{V})_{ik}}{s_k - s_{i+p}} + \frac{1}{2}\frac{(U_\perp^\dagger \breve{U})_{ik} - (V_\perp^\dagger \breve{V})_{ik}}{s_k + s_{i+p}}\\ \breve{Y}'_{ik} &= \frac{(V_\perp^\dagger \breve{V})_{ik} s_k + s_{i+p} (U_\perp^\dagger \breve{U})_{ik}}{s_k^2 - s_{i+p}^2}=\frac{1}{2} \frac{(U_\perp^\dagger \breve{U})_{ik} + (V_\perp^\dagger \breve{V})_{ik}}{s_k - s_{i+p}} - \frac{1}{2}\frac{(U_\perp^\dagger \breve{U})_{ik} - (V_\perp^\dagger \breve{V})_{ik}}{s_k + s_{i+p}}\\ \end{align}\] for $i = 1,\ldots, r-p$ and $k=1,\ldots,p$, and \[\begin{align} \breve{X}'_{ik} &= \frac{(U_\perp^\dagger \breve{U})_{ik}}{s_k}\\ \breve{Y}'_{jk} &= \frac{(V_\perp^\dagger \breve{V})_{jk}}{s_k} \end{align}\]

for the remaining values $i=r-p+1,\ldots,m-p$ and $j=r-p+1,\ldots,n-p$, again for all $k=1,\ldots,p$.

6 Polar decomposition

We consider the case of a matrix $A\in{\mathbb{C}}^{m \times n}$ with $m \geq n$, and write the “left” polar decomposition as $A = W P$ where $W \in {\mathbb{C}}^{m\times n}$ with $W^\dagger W = I$ and $P \in {\mathbb{C}}^{n \times n}$ with $P = P^\dagger$ and $P > 0$ (positive definite). Here, we also assume that $A$ is not rank-deficient, i.e. $A$ and ${\hat{P}}$ have rank $n$. The case with $m \leq n$ where we want t owrite $A = P W^\dagger$ can be obtained by simply taking appropriate hermitian conjugates in all the equations.

For the tangent directions, we find \[\begin{align} \dot{A} = \dot{W} P + W \dot{P} \end{align}\] where we again parameterise $\dot{W} = W \dot{K} + \dot{L}$ with $\dot{K}$ antihermitian and $W^\dagger \dot{L} = O$.

We thus find \[\begin{align} W^\dagger \dot{A} = \dot{K} P + \dot{P}\\ (I- WW^\dagger)\dot{A} &= \dot{L}P \end{align}\] and by also considering the Hermitian conjugate of the first equation and using the antihermiticity of $\dot{K}$ and the hermiticity of $P$ and $\dot{P}$: \[\begin{align} W^\dagger \dot{A} &= \dot{K} P + \dot{P}\\ \dot{A}^\dagger W &= -P \dot{K} + \dot{P} \end{align}\] We then find \[\begin{align} W^\dagger \dot{A} - \dot{A}^\dagger W &= \dot{K} P + P \dot{K} \end{align}\] and thus \[\begin{align} \dot{K} = 2 \mathcal{S}_{P,-P}^{-1}(\mathcal{P}_A(W^\dagger \dot{A}))\\ \dot{L} = (I- WW^\dagger) \dot{A} P^{-1} \end{align}\] We can then compute \[\begin{align} \dot{P} = W^\dagger \dot{A} - \dot{K} P \end{align}\] or more explicitly as \[\begin{align} \dot{P} = \mathcal{S}_{P^{-1},-P^{-1}}^{-1} (\mathcal{P}_H(W^\dagger \dot{A})). \end{align}\]

To obtain the reverse rule, we start from \[\begin{align} \mathop{\mathrm{RTr}}(\breve{W}^\dagger \dot{W}) + \mathop{\mathrm{RTr}}(\breve{P}^\dagger \dot{P}) \end{align}\] where we should remember tath $\dot{P}$ is hermitian, and we can thus add a hermitian projector around $\breve{P}$, to further find

\[\begin{align} \mathop{\mathrm{RTr}}&((W^\dagger \breve{W})^\dagger \dot{K}) + \mathop{\mathrm{RTr}}(\breve{W}^\dagger \dot{L}) + \mathop{\mathrm{RTr}}(\mathcal{P}_H(\breve{P})^\dagger (W^\dagger \dot{A}-\dot{K}P))=\\ &\mathop{\mathrm{RTr}}((W^\dagger \breve{W} - \mathcal{P}_H(\breve{P}) P)^\dagger \dot{K}) + \mathop{\mathrm{RTr}}(\breve{W}^\dagger (I- WW^\dagger)\dot{A}P^{-1}) + \mathop{\mathrm{RTr}}((W\mathcal{P}_H(\breve{P}))^\dagger \dot{A}) \end{align}\] We can now further expand the first term, thereby remembering that $\dot{K}$ is antihermitian, and so we only want to antihermitian part of $W^\dagger \breve{W} - \mathcal{P}_H(\breve{P})$. Alternatively, we can say that \[\begin{align} \mathcal{S}_{P,-P}^{-1}\circ \mathcal{P}_A = \mathcal{P}_A \circ \mathcal{S}_{P,-P}^{-1} \end{align}\] i.e. solutions of this specific Sylvester equation nicely separates into hermitian and antihermitian part (due to $P$ being hermitian).

We thus find for the first term \[\begin{align} \mathop{\mathrm{RTr}}&((W^\dagger \breve{W} - \mathcal{P}_H(\breve{P}) P)^\dagger \dot{K}) =\\ &\mathop{\mathrm{RTr}}(\mathcal{S}_{P,-P}^{-1}(2 \mathcal{P}_A(W^\dagger \breve{W} - \mathcal{P}_H(\breve{P})P))^\dagger (W^\dagger \dot{A})) \end{align}\] such that, collecting everything together as $\mathop{\mathrm{RTr}}(\breve{A}^\dagger \dot{A})$, we obtain the final result

\[\begin{align} \breve{A} = W \left[\mathcal{S}_{P,-P}^{-1}(2 \mathcal{P}_A(W^\dagger \breve{W} - \mathcal{P}_H(\breve{P})P)) + \mathcal{P}_H(\breve{P})\right] + (I- WW^\dagger) \breve{W}P^{-1}. \end{align}\]

6.1 Alternative from SVD

Given that the polar decomposition is often computed from the SVD as \[\begin{align} W &= U V^\dagger & P &= V S V^\dagger \end{align}\] we also find \[\begin{align} \dot{W} &= \dot{U} V^\dagger + U \dot{V}^\dagger\\ P &= \dot{V} S V^\dagger + V \dot{S} V^\dagger + V S \dot{V}^\dagger \end{align}\]

We can then write \[\begin{align} \mathop{\mathrm{RTr}}&(\breve{W}^\dagger \dot{W}) + \mathop{\mathrm{RTr}}(\mathcal{P}_H(\breve{P})^\dagger \dot{P}) =\\ &\mathop{\mathrm{RTr}}((\breve{W}V)^\dagger \dot{U}) + \mathop{\mathrm{RTr}}((U^\dagger \breve{W})^\dagger \dot{V}^\dagger) +\\ & \mathop{\mathrm{RTr}}((\mathcal{P}_H(\breve{P})V S)^\dagger \dot{V}) + \mathop{\mathrm{RTr}}((V^\dagger \mathcal{P}_H(\breve{P})V )^\dagger \dot{S}) + \mathop{\mathrm{RTr}}((SV^\dagger \mathcal{P}_H(\breve{P}) )^\dagger\dot{V}^\dagger) \end{align}\] which can be rewritten as \[\begin{align} \mathop{\mathrm{RTr}}&((\breve{W}V)^\dagger \dot{U}) + \mathop{\mathrm{RTr}}(\mathcal{P}_D(V^\dagger \mathcal{P}_H(\breve{P})V )^\dagger \dot{S}) + \mathop{\mathrm{RTr}}((2\mathcal{P}_H(\breve{P})VS+ \breve{W}^\dagger U)^\dagger \dot{V}). \end{align}\] Indeed, this is exactly what AD would do to pull back the polar cotangents $(\breve{W},\breve{P})$ to SVD cotangents \[\begin{align} \breve{U} &= \breve{W} V\\ \breve{S} &= \mathcal{P}_D(V^\dagger \mathcal{P}_H(\breve{P})V )\\ \breve{V} &= 2\mathcal{P}_H(\breve{P})VS+ \breve{W}^\dagger U \end{align}\]