Adoption and Optimization of Genomic Selection To Sustain Breeding for Apricot Fruit Quality

Abstract

Genomic selection (GS) is a breeding approach which exploits genome-wide information and whose unprecedented success has shaped several animal and plant breeding schemes through delivering their genetic progress. This is the first study assessing the potential of GS in apricot (Prunus armeniaca) to enhance postharvest fruit quality attributes. Genomic predictions were based on a F1 pseudo-testcross population, comprising 153 individuals with contrasting fruit quality traits. They were phenotyped for physical and biochemical fruit metrics in contrasting climatic conditions over two years. Prediction accuracy (PA) varied from 0.31 for glucose content with the Bayesian LASSO (BL) to 0.78 for ethylene production with RR-BLUP, which yielded the most accurate predictions in comparison to Bayesian models and only 10% out of 61,030 SNPs were sufficient to reach accurate predictions. Useful insights were provided on the genetic architecture of apricot fruit quality whose integration in prediction models improved their performance, notably for traits governed by major QTL. Furthermore, multivariate modeling yielded promising outcomes in terms of PA within training partitions partially phenotyped for target traits. This provides a useful framework for the implementation of indirect selection based on easy-to-measure traits. Thus, we highlighted the main levers to take into account for the implementation of GS for fruit quality in apricot, but also to improve the genetic gain in perennial species.

Apricot (Prunus armeniaca) is a perennial fruit crop pertaining to Rosaceae family and Prunus genus, which encompasses several economically important species such as peach, almond, cherry and plum. It is one of the leading stone fruit species due to its economic contribution to the fruit industry. From a biological standpoint, apricot is characterized by its diploid genome (2n = 2x = 16) of 294 Mb/1n and its high heterozygosity (Arumuganathan and Earle 1991). The availability of a high-quality genome sequence in peach, defined as a reference Prunus species highly genetically characterized (Infante et al. 2008; Verde et al. 2013), as well as the high level of synteny between the Prunus species, have paved the way for elucidating the genetics of key commercial traits in Prunus species (Aranzana et al. 2019). They provide both a powerful framework for apricot genetic improvement and valuable tools to elucidate the genetic architecture of traits of interest.

Since the sixties, apricot breeding programs have been geared toward conventional breeding based on mass field selection, a time-consuming and labor-intensive process, which might reach 15 to 20 years from pre-breeding to the release of a new variety. Besides the length of apricot breeding cycle, several biological features inherent to this species impede genetic progress such as its wide range heterozygosity and a preferential self-incompatibility regime that induces uneven production according to climatic conditions. Recently, a particular focus has been projected toward fruit quality, a dynamic concept which encompasses a broad range of attributes linked to attractiveness, flavor, taste and texture with reference to fruit color, balance of sugars and acids and shelf-life. The burgeoning interest in fruit quality aims at shaping a sustainable fruit industry taking into account consumer preference trends that are expressed in a competitive landscape faced with climate change. Furthermore, commercial depreciation due to ripeness deficiencies resulting from early harvest and susceptibility to flesh mealiness incented the breeders to circumvent the issues linked to postharvest quality and thus contribute to the enhancement of fruit quality metrics (Gatti et al. 2009). In the scope of intrinsic challenges of apricot, fruit quality-oriented selection schemes aim to meet consumers’ needs for improved quality attributes and address stakeholders’ demands in the apricot sector. Therefore, controlled cross-pollination schemes allow recombination of desirable characteristics according to the integrated concept of ideotype, which is likely to guide biological designs of improved varieties through the identification and the integration of causal variants for high-value traits as well as the elimination of deficiencies (Donald 1968; Ramstein et al. 2019).

Here emerges one of the prominent impetuses of marker-based breeding approaches, a breeding strategy whose feasibility strongly tailored by the genetic architecture of target traits. Indeed, marker-assisted selection (MAS), is particularly relevant for monogenic inheritance, while genomic selection (GS), a promising breeding approach that has revolutionized animal and plant breeding communities, is favored for oligogenic and polygenic inheritance (Kumar et al. 2012). GS is likely to capture the missing heritability of complex traits by modeling thousands of single nucleotide polymorphisms concomitantly (Makowsky et al. 2011; Resende et al. 2012a). Meuwissen et al.’s landmark article (2001) laid the foundation for predicting genetic merit in plant and animal breeding and thus identifying superior genotypes among selection candidates according to their whole-genome sequence information (de los Campos et al. 2013). Unlike MAS, which pinpoints putative genes underlying the traits of interest, GS potentially considers all markers’ effects without prior selection (Makowsky et al. 2011; Resende et al. 2012a). This breeding approach is at its outset for crop plants and notably for perennial trees that are characterized by long breeding cycles due to the length of juvenile phase and generation time. Therefore, the recourse to GS for perennial species arises from the need to accelerate the pace of the breeding process. The relevance of this breeding approach has been assessed in forest trees such as eucalyptus (Resende et al. 2012a), black spruce (Lenz et al. 2017), white spruce, loblolly pine (Resende et al. 2012b), maritime pine (Bartholomé et al. 2016) as well as in perennial fruit crops such as grapevine (Fodor et al. 2014), apple (Muranty et al. 2015), citrus (Minamikawa et al. 2017), cranberry (Covarrubias-Pazaran et al. 2018) and kiwifruit (Testolin 2011).

Large-scale genomic information against limited phenotypic records leads to an ascertainment bias due to the number of predictors (p), which is higher than the number of observations (n), resulting in multicollinearity and overfitting and accordingly low prediction performance (Desta and Ortiz 2014). To alleviate this statistical challenge due to dimensionality, a wide range of mathematical models are intended to infer linear combinations of the original predictors in order to reduce through shrinking regression coefficients back toward zero (Whittaker et al. 2000; Gianola et al. 2003; Solberg et al. 2009). The extent of shrinkage of the marker effects differ across prediction models. For instance, in ridge regression shrinkage is performed equally across markers. However, this assumption is likely to be unreal because some markers are in linkage disequilibrium (LD) with loci with no genetic variance (Goddard and Hayes 2007). Conversely, in models designed under the Bayesian framework, shrinkage of effects is marker-specific (Crossa et al. 2017). Further, the performance of GS is markedly influenced by several factors including marker density (Grattapaglia and Resende 2011; Lenz et al. 2017), training population size (Grattapaglia and Resende 2011), genetic relationship between training population and breeding population (Rincent et al. 2012; Isidro et al. 2015; Lenz et al. 2017), population structure (Zhong et al. 2009; Rincent et al. 2017), the extent of LD (Daetwyler et al. 2008; Wientjes et al. 2013; Liu et al. 2015), statistical models (Lorenzana and Bernardo 2009; Heslot et al. 2012; Resende et al. 2012b; Onogi 2020), trait heritability (Calus et al. 2008) as well as the genetic architecture of target traits (Daetwyler et al. 2010; Morgante et al. 2018). Along with the ideotype concept, multiple traits of interest can also be considered simultaneously through multivariate models to achieve more accurate predictions in comparison to single-trait models. Several simulation and empirical studies shed light upon the significant potential of multiple trait genomic prediction in optimizing prediction performance (Calus and Veerkamp 2011; Guo et al. 2014; Covarrubias-Pazaran et al. 2018; Karaman et al. 2018; Michel et al. 2018). In this regard, the selection index strategy permits breeders to obtain genotypes that concomitantly incorporate several desirable characteristics. However, the efficiency of selection for multiple traits simultaneously depends considerably on the genetic correlation between these traits, that reflects the extent to which selection for a focal trait triggers an indirect response to selection for a secondary trait (Akdemir et al. 2019; Rana et al. 2019). In conjunction with the optimization of prediction model design, the integration of the insights gained by elucidating the genetic architecture of traits under selection might be of great interest. Several studies have emphasized the potential of including genomic information underlying the variation of target traits in prediction models (Spindel et al. 2016; Fang et al. 2017; Lopes et al. 2017; Liu et al. 2019).

Therefore, the main objectives of our study were to (1) gain further insights into key fruit quality traits which are difficult to access, (2) evaluate the performance of GS prediction model applied to breeding for apricot fruit quality and (3) optimize GS accuracy by accounting for QTL mapping findings in prediction models and performing predictions under a multivariate framework.

Materials and MethodsPlant material

The plant material used in this study is a F1 pseudo-testcross progeny of 153 individuals issued from a cross between ‘Goldrich’ and ‘Moniqui’ cultivars, which exhibit contrasted fruit quality traits. ‘Goldrich’ cv., used as female parent, is a North American early-season apricot cultivar. Self-incompatible, it is characterized by large, firm, orange fruit without blush and with a high level of acidity (Muñoz-Sanz et al. 2017). ‘Moniqui’ cv., used as male parent, is a Spanish season apricot cultivar. Self-incompatible, it is characterized by large, soft and tasty white flesh fruit. ‘Moniqui’ is characterized by a high ethylene production, which results in a fast evolution at maturity and post-harvest, while ‘Goldrich’ presents a lower ethylene production, which results in an average fruit evolution at maturity and post-harvest. The F1 progeny was grown at the INRAE experimental field of Amarine in southern France. Seedlings were randomly planted on their own roots in 2005. Trees were managed under integrated management system which implies that orchards are geared toward a sustainable production system with a trend to reduce the use of phytosanitary products.

Phenotyping for fruit quality

The phenotypic characterization of the Goldrich × Moniqui (Go×Mo) progeny was carried out over two consecutive years 2006 and 2007, which showed contrasted climatic conditions for 10 quality parameters of agronomic interest. These traits refer to the criteria which underpin consumers’ perception of apricot fruit and meet the exigencies of stakeholders in the apricot sector.

A total of 40 fruits per genotype were randomly collected close to physiological maturity stage. Fruits were sorted according to their global firmness, determined by the pressure (kPa) required to achieve 3% deformation of fruit height with a multipurpose texture analyzer (Pénélaup, Serisud, Montpellier, France). Fruits were subdivided into three homogenous lots of four fruits per genotype of contrasting maturity: commercial maturity stage with pressure from 130 to 80 kPa, half ripe stage from 80 to 50 kPa and mature fruits with firmness less than 50 kPa. The physical, physiological and biochemical traits were measured on these three representative batches for all genotype. The fruit weight (g) was measured at the same time as firmness. The ground color (Hue.g) of the non-blushed side (unexposed to sunlight) was determined using a CR-400 chromameter (Minolta, Osaka, Japan) and expressed in the CIE 1976 L*a*b*color space (illuminant D65, 0° view angle, illumination area diameter 8 mm). Hue angle, was computed using the chromaticity coordinates Embedded ImageEmbedded Image and Embedded ImageEmbedded Image as follows:Embedded ImageEmbedded Image(1)The ethylene production rate was assessed as physiological parameter linked to maturity stage of climacteric fruits. Ethylene production, expressed in nmol Embedded ImageEmbedded Image, was measured by gas chromatography after 1 h of confinement in a hermetically closed jar (Chambroy et al. 1995; Bureau et al. 2009). Then, flesh color was measured and fruits were cut and frozen at -20° for further biochemical analyses. Fruit stones were weighed individually (kg). Fruit pieces were ground with an Ultra-Turrax T25 equipment (Ika Labortechnik, Staufen, Germany) to obtain a slurry. The refractive index (RI) which stands for the solid soluble content (SSC) was determined with a digital refractometer (PR-101 ATAGO, Norfolk, VA) and expressed in °Brix at 20°. Titratable acidity (TA) was determined by neutralization up to pH 8.1 with 0.1 N NaOH and expressed in meq 100 g-1 of fresh weight using an autotitrator (Methrom, Herisau, Switzerland). Soluble sugars (glucose, fructose, sucrose) and organic acids (malic acid and citric acid) were quantified using an enzymatic method using kits for food analysis (Boehringer Mannheim Co., Mannheim, Germany) and expressed in g 100Embedded ImageEmbedded Imageof fresh weight for sugars and meq 100Embedded ImageEmbedded Imageof fresh weight for acids. These measurements were performed with an automatic analyzer BM-704 (Hitachi, Tokyo, Japan).

Statistical modeling of the phenotypic data

Statistical modeling of the fruit quality attributes was performed using R software version 4.0.2 (R Core Team 2020). Significance assessment of variance components was carried out using ANOVA tests to determine the significant factors contributing to the phenotypic variation intended to be included in the adjustment model. In light of significance tests outcome, phenotypic data were adjusted using ‘lmer’ function provided in lme4 package (Bates et al. 2015) within a mixed model framework:Embedded ImageEmbedded Image(2)In equation 2, Embedded ImageEmbedded Image is the phenotypic value of the genotype i for the year j and the maturity group k, µ is the overall mean, Embedded ImageEmbedded Image is the random effect of the genotype i, Embedded ImageEmbedded Image refers to the fixed effect of the year j, Embedded ImageEmbedded Image is the fixed effect of the maturity group k corresponding to the fruit lot, Embedded ImageEmbedded Image is the interaction effect of the genotype i and the year j and Embedded ImageEmbedded Image is the random residual effect.

Heritability computation

Broad-sense heritability Embedded ImageEmbedded Image for fruit quality traits, defined as the proportion of phenotypic variance attributed to additive, dominance and epistatic patterns, was computed using equation 3:Embedded ImageEmbedded Image(3)where Embedded ImageEmbedded Image is the genetic variance, Embedded ImageEmbedded Image is the variance attributed to the interaction between genotypes and years, Embedded ImageEmbedded Image refers to the number of years and Embedded ImageEmbedded Image is the number of fruit lots.

Genotyping data

Genotyping by sequencing, performed according to the protocol described by Elshire et al. (2011), was carried out within the FruitSelGen project. Regarding the high level of synteny between the apricot and peach genomes, the fastq sequences were aligned to the peach genome (Verde et al. 2013). Raw data filtering, sequence alignment and variant calling were performed using GATK software (Genome Analysis Toolkit) (McKenna et al. 2010). The outcome of a further filtering process using VariantAnnotation package (Obenchain et al. 2014) resulted in a set of 61,030 SNPs with a genotype quality score greater than 20 and a missing rate lower than 5%. Out of the 184 individuals, 31 individuals exhibiting spurious genotypic profile were discarded and thus 153 individuals were kept for downstream analysis. SNP markers were coded as 0, 1 and 2, according to the number of copies of the alternative allele and missing marker information was imputed as the mean of the genotypic scores of non-missing data at the level of each maker.

Construction of the linkage maps

The advent of genomic selection to breed for apricot quality traits requires a better understanding of their genetic architecture. However, up to now, the genetic determinism underlying fruit quality in apricot was scarcely investigated (Ruiz et al. 2010). Thus, linkage mapping was performed in order to uncover the genetic architecture of the 10 fruit quality traits. Prior to QTL identification, two genetic linkage maps were constructed for the full-sib progeny using a pseudo-test cross mapping strategy (Grattapaglia and Sederoff 1994). The whole set of 61,030 SNPs was filtered according to Mendelian inheritance, and those presenting strong deviation from Hardy-Weinberg equilibrium (p-value < 1 × Embedded ImageEmbedded Image) were discarded using the function filterSegreg provided by rutilstimflutre package (Flutre 2015). Afterward, the markers which depicted more than 1% of missing information and more than 1% of genotyping errors were eliminated and linkage group (LG) clustering, marker ordering and genetic distance calculations were achieved by means of mstmap.data.frame function under ASMap package (Taylor and Butler 2017). Maps construction was performed using Kosambi’s mapping function and a logarithm of the odds ratio (LOD) of 3.

QTL detection

In order to provide insights into the genetic architecture of the target traits, we performed a composite interval mapping strategy using R/qtl package (Broman et al. 2003) with the aim of identifying the genomic regions underpinning apricot fruit quality. In this respect, 1,000 permutations were undertaken with a significance level set at 0.01 in order to identify putative QTL and determine the threshold of LOD scores. Then, the part of phenotypic variance explained by SNPs significantly linked to target traits was estimated. Additionally, a joint QTL detection analysis was performed on two independent datasets recorded in 2006 and 2007 with the aim of assessing the stability of QTL associated to the adjusted means corresponding to the phenotypic records. The graphical representation of the two genetic maps as well as QTL-linked markers was drawn using MapChart 2.3 software (Voorrips 2002).

Univariate genomic prediction modeling

Prediction of the genomic estimated breeding values was performed using a baseline model where the genomic information as well as the phenotypic records were fitted in order to estimate marker effects and thus the breeding values:Embedded ImageEmbedded Image(4)In equation 4, y is the vector of the phenotypic records, X is an incidence matrix for fixed effects relating fruit quality to the vector of fixed effects β, β is a vector of fixed effects estimates, Z is an incidence matrix for random effects relating fruit quality to the vector of random additive genetic effects u. Z is inferred from SNPs’ allelic dosage for each individual coded as 0, 1 and 2 according to the number of copies of the minor allele. The term u is the vector of random effects and e denotes the vector of random errors.

Random SNP effects were assumed to follow a normal distribution u ∼ N (0, IEmbedded ImageEmbedded Image) and random residual effects a normal distribution e ∼ N (0, IEmbedded ImageEmbedded Image) with I is the identity matrix, Embedded ImageEmbedded Image is variance of random effects and Embedded ImageEmbedded Image residual variance. GEBVs were computed as the sum of estimated marker effects multiplied by the corresponding allelic doses, as follows:Embedded ImageEmbedded Image(5)In equation 5, Embedded ImageEmbedded Image denotes the matrix of allelic doses for the Embedded ImageEmbedded Image individual in the validation partition at the Embedded ImageEmbedded Image locus and Embedded ImageEmbedded Image is the estimated effect at the Embedded ImageEmbedded Image locus.

Cross-validation procedure

The performance of prediction, mirrored in the predictive accuracy for the 10 key quality traits, was assessed using a cross-validation strategy where data were randomly partitioned into two subsets: 75% of the reference set was assigned to the training set intended to calibrate the prediction model and the remaining 25% was used as the validation set whose phenotypes were assumed to be unknown. This cross-validation scheme was iterated 100 times where samples were drawn with replacement from the reference set. Pearson’s correlation between predicted phenotypes and the observed ones was used to determine the accuracy of the predicted phenotypes.

Factors controlling genomic prediction accuracy (PA)

As several parameters control the prediction performance such as statistical models, training population size and marker density, these parameters were investigated using randomly drawn subsets of the reference dataset in order to point out the factors governing the potential variation in PA, to assess their respective effect.

Impact of statistical prediction models:

Within the framework of genomic prediction, various statistical methods have been proposed in literature. These models share the same prediction equation for the estimation of the GEBVs while they are grounded on different assumptions concerning markers effects. Five Bayesian models were explored: Ridge regression best linear unbiased prediction (RR-BLUP) model implemented in the rrBLUP package (Endelman 2011) as well as Bayes A, Bayes B, Bayes C, Bayesian LASSO (BL) and Bayesian ridge regression (BRR) implemented in the package BGLR (Pérez and de los Campos 2014).

Genomic prediction models vary in their assumptions regarding marker effects. RR-BLUP model postulates that all SNPs have identical variance with small but non-zero effect. All marker effects are homogeneously shrunk toward zero but markers are allowed to have unequal effects (Desta and Ortiz 2014). Unlike RR-BLUP, Bayesian models posit that each SNP has its own variance. Under Bayes A model, markers effects follow a normal distribution and variances follow a scaled inverted Embedded ImageEmbedded Image distribution (Meuwissen et al. 2001). Similar to Bayes A, Bayes B yields a scaled inversed Embedded ImageEmbedded Image with π > 0 so that several SNPs have zero effect. Unlike Bayes A, Bayes B applies both shrinkage and variable selection methods. Bayes A and Bayes B are both characterized by a prior probability Embedded ImageEmbedded Image) that a SNP has zero effect and a probability (Embedded ImageEmbedded Image) of marker effects that are shrunk toward zero. In Bayes A, all markers have non-zero effect π = 0. On the contrary, Bayes C states that the prior of zero SNP effects π is considered as unknown. Similar to Bayes B model, Bayes C applies both shrinkage and variable selection. But, unlike Bayes B, Bayes C is characterized by a Gaussian distribution (Desta and Ortiz 2014). BRR model, a Bayesian version of Ridge regression, assumes non-zero and normally distributed marker effects and equal marker variances. Similar to RR-BLUP, BRR applies a homogenous shrinkage across markers (de los Campos et al. 2013). As for Bayesian LASSO (BL), the Bayesian version of LASSO, it posits a compromise between Lasso and ridge regression (Park and Casella 2008). BL assumes a double exponential distribution of marker variances (Desta and Ortiz 2014).

Impact of training population (TP) size:

In order to explore the impact of population size on PA, we used three randomly drawn subsets of 43, 76, and 115 individuals corresponding to 25%, 50% and 75% of the respective study population to elaborate the prediction model and thus compute breeding values of the remaining individuals.

Impact of marker density:

Furthermore, we assessed the extent to which randomly selected marker subsets of different sizes, from 1 to 100% could affect the accuracy of prediction.

Genomic prediction optimization

Herein, we assessed two prediction strategies with a view to the improvement of PA.

Accounting for genetic architecture:

The first optimization scenario made use of the information brought by QTL mapping. Thus, SNPs tightly linked to QTL with medium to large effects were included in the prediction models as fixed covariates in order to assess the PA. The genomic prediction model which accounts for prior information on genetic architecture is defined as:Embedded ImageEmbedded Image(6)where y is the vector of phenotypic observations, X’ is the design matrix for fixed effects linking phenotypes to allelic doses of SNPs tagging QTL, Embedded ImageEmbedded Image is the vector of allelic doses of SNPs closely linked to QTL, Z’ is the design matrix for random effects linking phenotypes to the remaining SNPs, Embedded ImageEmbedded Image is the vector of allelic doses of the remaining SNPs and e is the vector of residuals.

The first phase of analysis was dedicated to QTL detection within the training population that was randomly drawn using 75% of the study population. The identified QTL were subsequently included in genomic prediction models. This procedure was repeated 100 times. The baseline model that includes QTL as fixed-effect covariates attributed different weights to SNPs according to their linkage to QTL. Accordingly, we aimed at assessing the potential of prediction models that treated all SNPs as random and those that placed greater weights on QTL.

Multivariate genomic prediction:

A second scenario dedicated to optimizing genomic selection accuracy is the multi-trait prediction implemented using the R package ‘sommer’ (Covarrubias-Pazaran 2016), which provides a framework for fitting multivariate prediction models. Hence, this prediction strategy was performed using Genomic BLUP (GBLUP) model whose equation is provided in equation 7:Embedded ImageEmbedded Image(7)where Embedded ImageEmbedded Image is a N × t vector of the phenotypic records for trait i with i = 1, …, t, X is an incidence matrix for fixed effects relating fruit quality to the vector of fixed effects Embedded ImageEmbedded Image, which is a vector of fixed effects estimates, Z is an incidence matrix for random effects relating fruit quality to vectors of random additive genetic effect Embedded ImageEmbedded Image, Embedded ImageEmbedded Image is a vector of random effects and Embedded ImageEmbedded Image denotes the vector of random errors.

The implementation framework of multi-trait models follows that of (Maier et al. 2015) and (Covarrubias-Pazaran et al. 2018):Embedded ImageEmbedded Image(8)For trait i (i = 1…t), random effects Embedded ImageEmbedded Image and Embedded ImageEmbedded Image, fitted within multivariate response model, are assumed to be normally distributed with mean zero with Embedded ImageEmbedded Image) and Embedded ImageEmbedded Image).

Assuming A is the additive genomic relationship matrix computed according to (VanRaden 2008), as:Embedded ImageEmbedded Image(9)In equation 9, W is the matrix of marker alleles coded as -1, 0 and 1 for homozygote, heterozygote and homozygote individuals, respectively, Embedded ImageEmbedded Image is the allele frequency of marker i.

Within this context, components of prediction model follow multivariate normal distribution with y ∼ MVN (Embedded ImageEmbedded Image), Embedded ImageEmbedded Image ∼ MVN (Embedded ImageEmbedded Image) and Embedded ImageEmbedded Image ∼ MVN (Embedded ImageEmbedded Image).

where Embedded ImageEmbedded Imageis the variance-covariance matrix for marker effects, Embedded ImageEmbedded Image is the variance-covariance matrix for residual effects, ⊗ is Kronecker product of variance-covariance matrices.Embedded Image

留言 (0)

沒有登入
gif