Efficient representations of binarized health deficit data: the frailty index and beyond

Appendix: . Binary PCA

We seek an efficient representation for binary health data: normal (0) or deficit (1). Equivalently, we are seeking (1) a basis set, (2) a set of features, or (3) a set of composite health measures. Representation efficiency is commonly measured by compression. Compression is the ability to take a set of p input variables, reduce (“compress”) them into k < p latent variables, and be able to reconstruct the original p input variables from these k latent variables. Compression fidelity is measured by a loss function that compares the original input variables to the reconstructed inputs.

Each dimensionality reduction technique is the optimal solution to a particular choice of loss function. If we require independent (orthogonal) features and minimize the mean-squared error then the solution is given by PCA (see below). The mean-squared error is not ideal for binary data, however, since we only need to know whether a variable is larger or smaller than 0.5. More appropriate choices for binary data lead to LPCA [33] and LSVD [33, 34]. The performance gains of these methods over linear PCA have been modest [33, 34], and may not justify the increased algorithmic complexity. When we refer to PCA, we mean conventional, linear PCA.

In finding the directions of maximum variance, PCA decomposes the covariance matrix into its eigenvectors. For binary healthy/deficit data, the covariance matrix takes on a special meaning. The (uncentered) covariance matrix is equivalent to the 2D pairwise joint deficit frequencies, with the diagonal corresponding to the individual variable’s marginal probability (i.e. deficit frequency). Binary PCA effectively compresses all of the marginal and pairwise deficit probabilities into a set of features of decreasing importance. We refer to each feature as a latent dimension.

A.1 Problem formalization

We seek an orthogonal basis set of features to efficiently represent the data. Orthogonality ensures that features are independent (uncorrelated) and that each individual has a unique representation in terms of the basis [70].

Let \(\vec _\) be the i th basis feature, we want to minimize the reconstruction error between the N × p data matrix, X, in its original state and after bottlenecking through the (latent) feature space of size k ≤ p. The representation of the i th individual and j th deficit from our data in the new space is given by:

$$ \beginrcl@} \hat_ &\equiv& \sum\limits_^ Z_\phi_, \end $$

(A1)

where Zin represents an individual’s feature score and \(\hat _\) is our best estimate of the reconstructed input data, Xij.

Using the orthogonality of the \(\vec _\), we estimate the feature scores using the inner product,

$$ \beginrcl@} Z_ &=& \sum\limits_^ \phi_ X_ \end $$

(A2)

thus,

$$ \beginrcl@} \hat_ &=& \sum\limits_^ \sum\limits_^ \phi_ X_ \phi_ \\ &=& \sum\limits_^ X_ \sum\limits_^ \phi_ \phi_ \\ &=& \sum\limits_^ X_ \sum\limits_^ U_U^_ \\ \implies \hat &=& XUU^ \end $$

(A3)

where U is the p × k matrix formed by having i th column equal to the i th basis, \(\vec _\), and Ut is the transpose of U.

For simplicity, convexity, and robustness, we assume the mean-squared error function; hence, we have:

$$ \beginrcl@} &&\underset_\}} \sum\limits_^\sum\limits_^ \big(X_ - [UU^\vec_]_ \big)^~\text \sum\limits_^ U_U_ = \delta_. \end $$

(A4)

This is the Pearson formalism of PCA (where the mean has not been subtracted) [33]. Z ≡ XUt is the PC score matrix and U is the rotation matrix. This formalism can be solved sequentially for each \(\vec _\) and is equivalent to picking the rotation of the data such that the first direction, Zi1, has the maximum second moment (eigenvalue), the second direction has the second largest, and so forth [18].

The solution to Eq. A4 is found by eigen-decomposition of XtX [18]. Each of the columns of U, \(\vec _\) satisfies

$$ \beginrcl@} \frac X^X \vec_ &= \lambda_\vec_ &\text~\vec_ \equiv U_ \end $$

(A5)

where λi is the i th eigenvalue and XtX/N is the 2D histogram of joint frequencies of the binary input variables, with the diagonal equal to the 1D frequencies. This implies (using \(X^X \approx X^\hat \), Eqs. A3 and A5),

$$ \beginrcl@} \frac X^X &\approx& \sum\limits_^ \lambda_ \big(\vec_\otimes\vec_\big) \end $$

(A6)

with equality when k = p. ⊗ denotes the outer/tensor product and the terms are sorted by decreasing strength. Intuitively, we are forming the 2D histogram, XtX/N, then decomposing it into a set of rank 1 matrices — i.e. square blocks — sorted by relative contribution; Fig. 2 illustrates the process for our dataset.

The principal components (PCs), P ≡ Z, are defined as the initial data transformed (“rotated”) into the latent space,

$$ \beginrcl@} P_ &\equiv& \sum\limits_^ X_U_ \end $$

(A7)

using the eigen-decomposition, Eq. A5, we can show that the norm of each PC is determined by its eigenvalue (substituting U for ϕ),

$$ \beginrcl@} &&\frac \sum\limits_\sum\limits_ X_^X_U_ = \lambda_U_ \\ &\implies& \frac \sum\limits_ \sum\limits_ \sum\limits_ X_^X_U_\!U_ = \lambda_\!\sum\limits_ \!U_\!U_ \\ &\implies& \frac \sum\limits_ P_P_ = \lambda_\delta_ \end $$

(A8)

hence the second moment of each PC determines its eigenvalue, λ, and therefore its order and relative importance. The sum of the second moments is conserved because U is an isometry [70].

A.2 Block histogram

There is a special 2D joint histogram pattern for which the first PC is equal to the FI for both logistic [33] and linear PCA (scaled by an irrelevant constant). When a uniform diagonal is on top of a dense, uniform, off-diagonal, the FI is the dominant eigenvector and is therefore the first PC.

More precisely, suppose the 2D joint frequency histogram, XtX/N, is given by:

$$ \frac X^X =\left[\begin a & b & & b\\ b & a & & \\ & & & b\\ b & & b & a \end\right] $$

(A9)

that is, the diagonal is constant, a, and the off-diagonals are also constant, b. This is a circulant matrix [71]. Note that a ≥ 0, b ≥ 0, and a ≥ b, because they are deficit frequencies ((XtX)ij = N〈xixj〉 for binary variables xi and xj, clearly \(\langle ^} \rangle \geq \langle x_x_ \rangle \) so a ≥ b because \(a=\langle ^} \rangle \) and b = 〈xixj〉, where 〈xi〉 is the mean of xi). The eigenvalues of this circulant matrix are [71]:

$$ \beginrcl@} \lambda_ &=& a - b + b\sum\limits_^ \left( \exp k i\right)}\right)^ \end $$

(A10)

where k ∈ [1,p] is an integer and p is the number of columns in X (i.e. the number of variables); \(i\equiv \sqrt \). If k≠p, the sum is a geometric series which converges to [72],

$$ \beginrcl@} \lambda_ &=& (a-b) + b\left( \frac p k i\right)}} k i\right)}}\right) \\ \lambda_ &=& a-b \text\neq \text \end $$

(A11)

If k = p, we instead have,

$$ \beginrcl@} \lambda_ &=& a - b + b\sum\limits_^ \left( \exp\right)^ \\ \lambda_ &=& a + (p-1)b \end $$

(A12)

because a,b ≥ 0 and a ≥ b we have that λpmust be the first (largest) eigenvalue (assuming b > 0, otherwise it will be a tie).

The associated eigenvectors are given by [71],

$$ \beginrcl@} U_ = \frac} e^ ikl} \end $$

(A13)

where k and l are integers. From Eq. A12, we know the first eigenvector is,

$$ \beginrcl@} U_ &=& \frac} . \end $$

(A14)

Using Eq. A7, we can calculate the first principal component,

$$ \beginrcl@} P_ &=& \sum\limits_ \frac}X_ \\ &=& \sqrt \frac\sum\limits_X_ \\ &=& \sqrt \cdot \text, \end $$

(A15)

which is a constant times the FI. Hence if the joint histogram has the form of Eq. A9, the FI will coincide with the first PC. In the next section, we show the conditions under which the first PC is sufficient.

A.3 How well can we approximate the histogram?

The 2D histogram contains all pairwise frequencies (off-diagonals) and individual frequencies, making it an important summary of the information we know about the deficit statistics. How well does the first eigenvalue/eigenvector pair approximate the complete 2D histogram, given it has the special structure of Eq. A9?

From Eq. A6, we know that the eigenvalues/eigenvectors approximate the 2D histogram as:

$$ \beginrcl@} \fracX^X &\approx& \sum\limits_^ \lambda_ \big(\vec_\otimes\vec_\big) \end $$

(A16)

with equality when k is equal to the number of variables, p (equal to the number of columns of X). Since the model is linear, we can summarize the mean-squared error using the coefficient of determination, R2, and expect R2 = 0 for a useless reconstruction and R2 = 1 for a perfect reconstruction. Specifically,

$$ \beginrcl@} R^ &= 1 - \frac__ \left( (X^X)_/N- _^ \lambda_ \big(\vec_\otimes\vec_\big)_ \right)^}__ \big((X^X)_/N\big)^}. \end $$

(A17)

Using this, we compute the accuracy of the first eigenvalue/eigenvector pair in approximating the full 2D histogram,

$$ \beginrcl@} R^ &=& 1 - \frac__ \big((X^X)_/N- _^ \lambda_ \big(\vec_\otimes\vec_\big)_ \big)^}__ \big((X^X)_/N\big)^} \\ &=& 1 - \frac__ \big((X^X)_/N - (a+(p-1)b) (1/p) \big)^}__ \big((X^X)_/N\big)^}, \end $$

(A18)

and substitute in the special form for XtX/N,

$$ \beginrcl@} R^ &=& 1 - \frac__ \big(b - (a+(p-1)b) (1/p) \big)^}_a^ + __b^} \\ && + \frac_ \big(a - (a+(p-1)b) (1/p) \big)^}_a^ + __b^} \\ &=& 1 - \frac(1-b/a)^\frac/a^)+(1-b^/a^)}\\ \end $$

(A19)

where in the last line we emphasize there are only two tunable parameters: b/a is a measure of correlation strength and p is the number of variables. Both 0 ≤ a ≤ 1 and 0 ≤ b ≤ a are constrained because X is composed of binary variables.

There are two limits of interest. First, for b > 0 if we take b → a,

$$ \beginrcl@} \underset R^ &=& 1 - \frac}\frac}} \\ &=& 1 \end $$

(A20)

this corresponds to a 2D histogram of perfectly dependent variables (which would be a rank 1 matrix). The other limit is taking a large number of variables with b > 0,

$$ \beginrcl@} \underset R^ &=& 1 - \frac\frac}} \\ &= & 1 \end $$

(A21)

which corresponds to an infinitely large 2D histogram. In both cases, R2 = 1 and the first eigenvector — equal to the FI — is sufficient to perfectly estimate the 2D histogram and hence sufficient to completely describe the first- and second-order statistics. It is interesting to note the compatibility of the two limits which imply that getting a large, but finite, p and having b close to, but not equal to, a is likely to give R2 ≈ 1.

In Fig. 13, we plot Eq. A19 for several values of the two free parameters, b/a and p. Nearly perfect R2 is achieved for fairly modest values of b/a when p is sufficiently large. Interestingly, there is an apparent diminishing return for increasing p with an elbow at p ≈ 25, this is comparable to the 30 + deficits rule for the FI [6]. The 2D joint histogram in this study had a median diagonal value of a = 0.20 and median off-diagonal value of b = 0.04, giving b/a = 0.22 (p = 55; Fig. 2). We would then expect an ideal case to have R2 = 0.84, the fit for our data yielded R2 = 0.50 — large, but smaller than the ideal case.

Fig. 13figure 13

Special joint histogram approximation (Eq. A19). Fill is the R2 fit quality for PC1 approximating the full histogram, given the histogram has the special structure given in Eq. A9. p is the number of features. a is the deficit frequency. b is the joint deficit frequency

This idealized, “toy,” model explains the approximate equivalence of the FI and PC1. What’s more, it allows us to estimate how dominant the FI/PC1 is. In the limit of a large number of variables and/or b ≈ a, we find that the FI/PC1 becomes a better approximation for the information in the 2D histogram. This is consistent with the observation that the FI is best used to describe a large number of correlated variables.

A.4 PCA approximates logistic PCA

Logistic PCA [33] minimizes the Bernoulli deviance, in analogy to the Gaussian formulation of linear (normal) PCA. The optimization problem is not convex but Landgraf and Lee [33] derive an iterative majorization-minimization scheme for solving the problem. We follow their approach and show that the first iteration of their loss function reduces to the same loss function as linear PCA. As a result, the estimated transformation, U, will be the same for either PCA or logistic PCA after the first iteration.

There are four steps to our adaptation of their approach:

1.

Initialize U(0) to be an orthogonal matrix. Pick k = p. Then, U(0)(U(0))t = I.

2.

Initialize the mean, μ = logit(?) where ? → 0+ is a small, positive number. This is akin to not subtracting the mean when we perform PCA.

3.

Fix m ≡−μ. This is the main assumption. m should be a large, positive number [33]. Our definition of μ ensures that m is a large positive number.

4.

Iterate the majorization-minimization algorithm [33] exactly once.

The initial \(\theta _^=\tilde _\), due to the orthogonality of the initial U(0). Note that \(\tilde _\equiv m(2X_-1)\) [33]. The loss function, Eq. (9) of [33], is then

$$ \beginrcl@} &&\underset \sum\limits_\sum\limits_ \left( \left[UU^\left( \tilde_-\vec\right)\right]_ - \left( \tilde_-\mu\right) \right. \\&&\left.-4\left( X_-\sigma\left( \tilde_\right)\right) \right)^ \\ &=& \underset \sum\limits_\sum\limits_ \left( \left[UU^\left( 2m\vec_-m\vec-\vec\right)\right]_ \right. \\ &&\left.- \left( 2mX_-m-\mu\right) -4\left( X_-\sigma\left( \tilde_\right)\right) \right)^ \\ &=& \underset \sum\limits_\sum\limits_ \left( 2m\left( \left[UU^\vec_\right]_-X_\right)\right. \\ &&\left.- \left( \left[UU^\left( m\vec+\vec\right)\right]_-m-\mu\right)\right. \\&&\left.-4\left( X_-\sigma\left( \tilde_\right)\right) \right)^ \\ &=& 2m \underset \sum\limits_\sum\limits_ \left( \left[UU^\vec_\right]_-X_ \right)^, \end $$

(A22)

where we use m ≡−μ and \(\mu \to -\infty \) in the last line, with \(m\to \infty \) ensuring \(\sigma (\tilde _) \to X_\) (σ is the inverse logit). The factor of 2m does not affect the position of the minimum and hence Eq. A22 finds the same optimal U as the PCA loss function, Eq. A4 (recall that U is constructed out of the set of \(\vec _\)).

留言 (0)

沒有登入
gif