Methodologies underpinning polygenic risk scores estimation: a comprehensive overview

Genome-wide PRS methods

GWAS have been successful at identifying common SNPs associated with complex phenotypes such as, but not limited to, cardiovascular diseases, cancers, neurodegenerative diseases as well as infectious diseases like tuberculosis (Abdellaoui et al., 2023; Andrews et al., 2023; Bashinskaya et al., 2015; Blauwendraat et al., 2020; Hameed et al., 2024; Liang et al., 2020; Ndong Sima et al., 2023; Nott and Holtman, 2023; Walsh et al., 2023). Although significant associations have been found, each SNP on its own has a small effect on a disease outcome. This could be explained by the common-disease-common-variant (CDCV) hypothesis which states that if a disease is common in a population (prevalence > 1–5%), then its genetic contributors will also be common in that population with small individual effect size. Therefore, the trait/disease outcome is believed to be due to the cumulative effect of multiple variants (International Schizophrenia Consortium et al., 2009; Lango Allen et al., 2010; Pereira et al., 2021).

PRS construction and analysis can be divided into three main steps: (1) quality control of data, (2) calculation of the scores, and (3) PRS performance assessment. The first two steps typically requires two independent input datasets: (i) a discovery dataset (i.e., SNPs summary statistics), from which SNPs and their subsequent effect sizes are obtained and (ii) a target dataset (i.e., cohort understudy), from which genotype dosage of each SNP included in the calculation are obtained. For the third step, the performance of the predictive models is assessed either using cross-validation methods (generally if the target sample size is too small to be split) or an independent dataset (i.e., testing set). Each step is outlined as follows:

Quality control of discovery and target data

The basis of PRS is a SNPs summary statistics which can be obtained either from GWAS or candidate genes association studies. Therefore, the initial step is quality control of input data which is performed similarly to a standard GWAS quality control (as described by (Marees et al., 2018)). In addition, to avoid systematic errors, users must confirm that all datasets are aligned to the same genome build. This is to ensure that all datasets have not only SNPs that have the same genomic positions and SNP identifiers (i.e., rsID), but also share the same SNPs. Furthermore, it is important to remove ambiguous SNPs (A/T and/or C/G). When SNPs have complementary alleles, it becomes challenging to distinguish which of the strands is being measured and therefore, which is the effect allele (Choi et al., 2020). This could therefore be a major source of systematic error. Sometimes, it is possible to solve this ambiguity using information on allele frequency, but this can be daunting if the allele frequencies of those SNPs are close to 0.5 (Chen et al., 2018). Therefore, general practice is to remove these ambiguous SNPs. It should also be ensured that discovery and target samples are independent from one another, to avoid the issue of overfitting that is inherent in machine learning (Choi et al., 2020).

Calculation of traditional polygenic risk scores: which variants to include and how to account for linkage disequilibrium?

PRS are calculated as the sum of an individual’s risk alleles, weighted by their effect sizes (Chatterjee et al., 2013; Choi et al., 2020; Lango Allen et al., 2010; Pharoah et al., 2008). This can generally be summarized as follows and illustrated as in Fig. 1 below:

$$\:_=\:\sum\limits_^__$$

(1)

Where the PRS for individual (i) is the sum of the genotype dosage (G) for each SNP (j to m) multiplied by the effect βj for that allele.

Ideally, all the SNPs should be summed across all loci. However, there are two factors to consider: (1) many GWAS are underpowered, which means that there could be more true associations than those discovered (i.e., those reaching genome-wide significance); (2) the inherent issue of linkage disequilibrium (LD), which creates a correlation structure among nearby variants and can, inflate the PRS and result in poor generalizability across populations (Dudbridge, 2013; Duncan et al., 2019; Meuwissen et al., 2001). In the hope to curb the inflation of PRS only independent markers should be used (Meuwissen et al., 2001; Vilhjálmsson et al., 2015).

PRS model performance assessment

The assessment of the PRS performance is crucial for their application in clinical settings. The evaluation typically involves measure of predictive performance such as squared correlation (R2) or Nagelkerke’s pseudo-R2 and the area under the receiver operating characteristic curve (AUC) (Choi et al., 2020). The former two quantify how much of the variation in a trait among individuals can be explained by their genetic makeup (referred to as proportion of phenotypic variance), whereas the latter is a composite of sensitivity and specificity (with maximum value of 1.0), and can be used to determine how well a model can distinguish between individuals at high risk and those at low risk for a particular disease. Additionally and of equal importance is calibration of the scores as it ensures that the predicted risks align with observed outcomes in the population. The necessity of calibrating PRS has been highlighted before clinical implementation, as miscalibrated scores could lead to inappropriate risk assessments (Vilhjálmsson et al., 2015; Wei et al., 2022). For instance, a study by Wei and colleagues showed a systematic bias between estimated risks values and observed risks for the prostate cancer, breast cancer, and colorectal cancer in three incident cohorts from the UK Biobank (β (95% CI) was 0.67 (0.58–0.76), 0.74 (0.65–0.84) and 0.82 (0.75–0.89) respectively) which was significantly lower than the expected value of 1.00 under perfect calibration (Wei et al., 2022).

Fig. 1figure 1

Graphical illustration of polygenic risk score calculation where variants weights are obtained either directly from a GWAS summary statistics or corrected by accounting for LD between them. Regardless of how the SNPs weights are estimated, for each selected SNP, the weight is multiplied by the number of effect (risk) alleles (in red) and summed over all variants to get a polygenic score

Traditional approach to PRS calculation: clumping and thresholding |

The traditional approach to PRS calculation identifies near independent SNPs using a method called clumping and thresholding (C + T). Briefly, clumping involves sorting SNPs by importance (from the most significantly associated markers to the least) where the most important SNP is tagged as the “index” of the predefined non-overlapping window and the other correlated SNPs are removed. As opposed to pruning, this procedure ensures that the index SNP is never removed, keeping at least one representative SNP for each region of the genome (Privé et al., 2019). The analysis proceeds with the next most significant SNP that has not been removed yet. In that fashion, the SNPs with the strongest signals (lowest p-value) are maintained, allowing for the construction of a more predictive genetic score. Therefore, user-driven choices of variants correlation values (as cut-off values) will consequently play a role in the prediction accuracy of the scores.

After clumping, genetic variants are approximately independent. However, the question of how significant the association needs to be for inclusion in the PRS calculation remains unanswered. One proposed solution is p-value thresholding which calculates PRS at various p-value thresholds. This method implies that the optimising parameters to compute the best predictive PRS model are a priori unknown (Choi et al., 2020). Therefore, PRS are calculated over a range of p-values and the most predictive one is chosen using a measure of predictive performance such as Nagelkerke’s pseudo-R2 or R2 (Choi et al., 2020). The C + T approach is implemented in various software packages such as PLINK, PLINK2, PRSice, and PRSice-2 (Chang et al., 2015; Choi and O’Reilly, 2019; Euesden et al., 2015; Purcell et al., 2007). While PRSice must be run several times for different p-value thresholds, PRSice2 has a more automated system for the thresholding approach. The C + T approach is an attractive method for PRS calculation as it is user friendly, not computationally extensive and with results that are easy to interpret. However, the C + T method has several flaws in that by applying single optimal cut-off values, the approach assumes similar LD patterns throughout the genome. This creates a misrepresentation of SNPs, especially within long-range LD regions of the genome (Privé et al., 2021). Additionally, this method does not take into consideration the intrinsic genetic differences that can exist between the discovery and target datasets even if both are subsets of the same population. This is especially true for African populations or admixed individuals who exhibit high levels of genetic heterogeneity across and within population. This phenomenon often results in missed association signals and therefore skewed polygenic scores.

Non-traditional approaches to PRS calculation

In recent years, many approaches to PRS calculations have investigated the inclusion of all SNPs while accounting for LD between them, thereby calculating LD-corrected variants weight (see Table 1). Unlike the C + T approach which uses the marginal effect sizes of variants (meaning the effect size of each SNP without considering the correlation between them), these methods intend to model the joint effect of all SNPs. Two approaches have mainly been explored: (1) Bayesian penalized regression, and (2) Frequentist penalized regression. Although the two approaches differ fundamentally in their underlining assumptions (e.g., prior distributions, L1 regularization, etc.) and their interpretation of uncertainty, they share common goals and methodologies in improving model performance through regularization techniques. These approaches are particularly useful for variable selection and addressing issues like overfitting, which can arise when the number of predictors exceeds the number of observations.

Bayesian penalized regression approach

The Bayesian framework leverages prior distributions and advanced modelling techniques to enhance prediction accuracy while maintaining interpretability (Ge et al., 2019; Hu et al., 2017; Lloyd‑Jones et al., 2019; Vilhjálmsson et al., 2015); and the selection of those priors are made in a way that best capture the genetic architecture of the trait of interest.

The older method, LDpred utilizes a point-normal prior for modelling the SNPs effect sizes (Vilhjálmsson et al., 2015). This assumes that there is a proportion (p) of causal variants on a given trait of interest and that their joint effect sizes are normally distributed with mean zero and variance proportional to the heritability (h2) of the trait. Importantly, p and h2 are estimated independently. For p, a grid of values is explored similarly to the p-value thresholding used in C + T; whereas h2 or SNP-based h2 is estimated using LD score regression. Then, a Bayesian Gibbs sampler is used to estimate the joint effect sizes of SNPs in the GWAS summary statistics (i.e., the discovery dataset) by accounting and modelling a matrix of the LD pattern from an external reference data. The LDpred model has been shown to outperform the traditional C + T approach (Khera et al., 2018; Vilhjálmsson et al., 2015). One example is a study by Vilhjálmsson and colleagues who compared the performance of LDpred to C + T across a range of diseases and trait (schizophrenia (SCZ), multiple sclerosis (MS), breast cancer (BC), type 2 diabetes (T2D), coronary artery disease (CAD), and height). The results showed that LDpred provided significantly better predictions than other approaches with the relative increase in Nagelkerke R2 ranging from 11% for T2D to 25% for SCZ, and a 30% increase in prediction R2 for height; the p-values were 6.3 × 10−47 for SCZ, 2.0 × 10−14 for MS, 0.020 for BC, 0.004 for T2D, 0.017 for CAD, and 1.5 × 10−10 for height (Vilhjálmsson et al. 2015). Similar trends were consistently replicated in different benchmarking studies (Lloyd-Jones et al. 2019; Khera et al. 2018).

Similarly to LDpred, the SBayesR software programme (Lloyd‑Jones et al., 2019) also uses point-normal distribution. However, it combines a likelihood function that connects the joint effect sizes with the GWAS summary statistics coupled with a finite mixture of normal distribution priors underlying the variant effects. This basically means that the SNP effect sizes are modelled as a mixture of normal distributions with mean zero and different variances. The modelling is typically done using four normal distributions all with mean zero and distinct variances. The first one is variance zero, which captures all the SNPs with a zero effect and from there, increasing values of effect sizes are allowed to exist in the model. Using summary statistics for variants from the largest GWAS meta-analysis on height and BMI (n = 700,000), Lloyd-Jones and colleagues demonstrated that on average across traits and two independent datasets that SBayesR improves prediction R2 by 5.2% compared to LDpred and by 26.5% compared to C + T (Lloyd-Jones et al. 2019). Although SBayesR performs better than the LDpred model and the C + T approach, it has the same pitfalls as the former model. In that sense, they both remain unstable in long-range LD regions of the genome and therefore severely underperform for autoimmune diseases such as Type 1 diabetes (T1D) and rheumatoid arthritis (RA) (Lloyd-Jones et al. 2019; Privé et al. 2021; Márquez-Luna et al. 2021). Additionally, both LDpred and SBayesR do not model functional enrichment of causal variants effect sizes which when accounted for, has been shown to improve polygenic prediction accuracy (Márquez‑Luna et al., 2021).

In that effect, Márquez-Luna and colleagues proposed LDpred-funct, an extension of LDpred, to estimate posterior distribution of causal effect sizes by accounting for LD between SNPs and leveraging trait-specific functional priors, which are designed based on the biological relevance of genetic variants (Márquez‑Luna et al., 2021). This approach recognizes that variants located in functional regions (e.g., coding, regularoty, or conserved regions) are more likely to contribute to complex phenotypes. The results showed a 4.6% relative improvement in average prediction accuracy (average R2 = 0.144 for 21 highly heritable traits in the UK Biobank; highest R2 = 0.413 for height) compared to SBayesR (Márquez‑Luna et al., 2021). Thus, highlighting the relative gain in predictive power when incorporating functional priors, consistent with the functional architecture of the complex trait understudy.

Another software tool is PRS-CS that employs a continuous shrinkage (CS) prior on SNP effect sizes which allows for adaptive shrinkage, where the amount of shrinkage can vary across coefficients based on the strength of association signals from GWAS (Ge et al., 2019). This adaptability makes PRS-CS robust to varying genetic architectures across different traits and populations. Additionally, the method utilizes summary statistics from existing GWAS rather than requiring individual-level genotype data. This makes it feasible to apply it in large-scale studies where individual-level data may not be available. Using data from the Partners Biobank, Ge and colleagues observed an average improvement of 18.17% and 11.41% across all trait when using PRS-CS and PRS-CA-auto (respectively) compared to LDpred. This improvement was around 3-fold increase relative to C + T (Ge et al., 2019).

A newer method, LDpred2 (Privé et al., 2021) is a recent update to LDpred with two new modes added. The first one, LDpred2-auto, estimates p and h2 directly from the model, instead of testing several values and using LD score regression (Privé et al., 2021). The other mode, LDpred2-sparse (option from the -grid model), allows for effect sizes to be exactly zero, similarly to the first mixture component of SBayesR. Additionally, LDpred2 addresses instability issues that were present in earlier methods by providing a more stable workflow and by modelling long-range LD region such as that found near the HLA region of chromosome 6 (Evseeva et al., 2010; Privé et al., 2021). Some of the earlier methods rely on removing these regions to account for the problem (Lloyd‑Jones et al., 2019; Márquez-Luna et al., 2021; Vilhjálmsson et al., 2015), which has been shown to reduce prediction accuracy since these regions harbour many immunity-related genes (Evseeva et al., 2010; Privé et al., 2021). LDpred2 has been benchmarked against other polygenic score methods, demonstrating a slight predictive performance improvement (Privé et al., 2021). For instance, it achieved a mean AUC of 65.1% across multiple traits from the UK Biobank against 63.8% for lassossum, 62.9% for PRS-CS, and 61.5% for SBayesR.

Frequentist penalized regression approach

In Frequentist penalized regression framework, the goal is to minimize a penalized loss function that combines the model’s error with a penalty that discourage large coefficients. However, in this framework unlike in the Bayesian’s, SNP weights are shrunk based on a penalty term but there is no notion of prior beliefs about parameters. This method applies a penalty directly to the regression coefficients, which helps to mitigate issues such as collinearity and overfitting, thereby enhancing model interpretation and prediction accuracy. For example, ridge regression is known for its ability to handle multicollinearity by imposing a penalty on the size of the coefficients, while Least Absolute Shrinkage and Selection Operator (LASSO) can shrink some coefficients to zero, effectively performing variable selection (Cule and De Iorio, 2013). The elastic net combines both LASSO and ridge penalties, providing a balanced approach that can be particularly useful when predictors are highly correlated (Kohannim et al., 2012; Tessier et al., 2015).

The software programme Lassosum was the first to use LASSO regression to shrink marker effects (Mak et al., 2017). This approach makes LASSO regression robust against the issue of nonconvergence which is inherent in programmes that use a Bayesian method (Mak et al., 2017). In addition, Lassosum has been reported to not only surpass the performance of both LDpred and the C + T method, but also demonstrate superior computational speed. The latter is a desired attribute, especially as GWAS sample size increases. A recent update to Lassosum is Lassosum2 (Privé, Arbel, Aschard, et al., 2021), which has been re-implemented in the R package bigsnpr and uses the same input data as LDpred2 with no additional coding nor computational time. Lassosum2 can therefore be used while already running LDpred2 with no loss on the predictive performance. As PRS is gaining momentum due to the rise of GWAS, algorithm developments are crucial to ensure that genetic prediction achieves its full potential.

Pathway-based PRS method

Evidently, a great amount of effort is being put into the development and optimisation of PRS calculation tools. Ideally, these approaches could help stratify individuals based on their relative risk of developing a particular disease. However, genome-wide PRS methods cannot always provide great insights into the heterogeneity of complex disease (Broekema et al., 2020; Choi et al., 

留言 (0)

沒有登入
gif