Seven-chain adaptive immune receptor repertoire analysis in rheumatoid arthritis reveals novel features associated with disease and clinically relevant phenotypes

Patients and samples

A total of 86 RA patients and 104 healthy subjects were included in the present study (Additional file 1: Table S1). All patients were collected from the outpatient’s clinics of the rheumatology departments of 12 Spanish University Hospitals involved in the Immune-Mediated Inflammatory Disease Consortium [71]. Controls were recruited from healthy blood donors from Spanish hospitals in collaboration with the Spanish DNA Bank. Enrolled patients fulfilled the ACR/EULAR 2010 classification criteria for RA and had more than 2 years of follow-up since diagnosis [72]. All recruited patients had an erosive disease defined as erosions in more than one joint group including hands and/or feet. The clinical response to TNF inhibition therapy was measured using the European League Against Rheumatism treatment response criteria [73]. For all patients, the Disease Activity Score (DAS28) was measured at baseline and after 12 weeks of TNFi therapy. According to the DAS28 variation and the DAS28 at the endpoint, RA patients were categorized into good (n = 42), moderate (n = 25), and non-responders (n = 17). Good and moderate responders were grouped into a single and larger responder group. A DAS28 threshold of 5.1 at baseline was used to categorize RA patients into high (i.e., DAS28 >  = 5.1, n = 48) and moderate-low (i.e., DAS28 < 5.1, n = 38) disease activity. All RA patients included in this cohort were Caucasian European and born in Spain. The main clinical and epidemiological characteristics of this cohort of RA patients are summarized in Additional file 1: Table S1.

Whole blood RNA extraction

Peripheral whole blood from venipuncture was dispensed into PAXgene™ Blood RNA Tubes (Qiagen). Whole blood samples in PAXgene™ Blood RNA Tubes were stored at − 80 °C until RNA extraction. RNA was extracted from PAXgene™ Blood RNA Tubes via the PreAnalytiX PAXgene™ blood RNA kit (Qiagen) and automated using the QIAcube.

High-throughput sequencing of the seven receptor chains of the study population

Two independent next-generation sequencing libraries covering all TCR (i.e., TRA, TRB, TRD and TRG) and BCR (i.e., IGH, IGL and IGK) chains were generated from each sample using iR-RepSeq + 7-Chain service (iRepertoire, US). This technology allows the amplification of the seven receptors chains by addressing every known source of errors in AIRR-seq analyses [30]. In this procedure, the first set of primer pairs for each V-J combination is used, allowing the extension with tags that, in turn, enable a second set of primers to be utilized for global amplification of all seven chains. This essentially bias-free amplification was conducted in a single assay using UMI during the reverse transcription step. The introduction of UMI to distinguish between single RNA molecules during cDNA synthesis was essential to minimize the impact of PCR duplicates and sequencing errors. All necessary reagents for amplification and purification are preloaded into the cassette. Extracted RNA (1000 ng) with an appropriate volume of RT mix and nuclease-free water were added into the cassette, which was processed by the iR-Processor. The instrument can automatically set up and carry out all amplification and purification. Briefly, RT is performed using Qiagen OneStep RT-PCR mix (Qiagen). First-strand cDNA was selected, and remnant primers were removed by SPRIselect bead selection (Beckman Coulter) followed by the second round of binding and extension with the V-gene primer mix. After binding and extension, SPRIselect beads were used to purify the first and second-strand synthesis products. Library amplification was performed with a pair of primers that are specific for communal sites engineered onto the 5′ end of the C- and V- primers used in first- and second-strand synthesis. The final constructed library includes Illumina dual index sequencing adapters, a 10-nucleotide UMI, and an 8-nucleotide internal barcode associated with the C gene primer. Amplified libraries were multiplexed and pooled for sequencing on the Illumina NovaSeq platform with a 500-cycle kit (250 paired-end reads) at Hudson Alpha Institute for Biotechnology (Huntsville, AL, USA). The output of the immune receptor sequence covers within the first framework region through the beginning of the constant region, including the CDR3 hypervariable region.

AIRR-seq data processing

Illumina paired-end reads were demultiplexed and collapsed by 8 base pairs barcode and 10 base pairs UMI tag using the default MIG size threshold set by MiGEC v1.2 software [74]. In order to assemble the full TCR/BCR sequences, the resulting UMI-collapsed fastq files were mapped to the V, D, J, and C reference sequences from the International ImMunoGeneTics (IMGT) database using the MiXCR software (v3.0.14) with the clustering parameters turned off [75, 76]. From the two alignment algorithms that are implemented in MiXCR software which were developed for immune repertoire data based on a seed-and-vote strategy [77], the default Kaligner and its own built-in reference library (repseqio v1.7) were used. Sequences that were out-of-frame or contained stop codons were filtered out from the analysis. The resulting AIRR-seq data contained sequences from the seven chains that cover the receptor sequence spanning from the FR1-IMGT to FR4-IMGT region, as well as the constant region of the receptor. Sequencing information contained within the constant region of IGH was used to compute the isotype-specific features described in the “Isotype information contained within the constant region of IGH” section. In order to increase the statistical power of the present analysis, AIRR-seq data from each sample and its corresponding replicate were merged by the CDR3 nucleotide sequence. The total number of reads and UMIs obtained at each stage of the pipeline are summarized in Additional file 2: Table S2.

Genome-wide genotyping and HLA imputation

All individuals included in the case–control cohort from the Spanish population (n = 190) were genome-wide genotyped using Illumina Global Screening Array-24 Multi Disease version 2.0 (Illumina, USA) at Life and Brain Center (Bonn, Germany). Genotype calling and quality control analyses were performed using GenomeStudio (Illumina, USA) and PLINK software, respectively [78]. Two samples showing a call rate < 0.95 were excluded from the study and, therefore, samples from 84 RA patients and 104 healthy subjects were finally available for analysis. Using GWAS data from our study population, we performed the imputation of the HLA classical alleles and amino acid polymorphisms. HLA imputation was conducted using the SNP2HLA V.1.0.3 software [79].

Definition of AIRR variablesClone

In the present study, clones were defined by the CDR3 amino acid sequences. Accordingly, the number of unique clones detected within the AIRR of a particular individual corresponds to the total number of unique CDR3 amino acid sequences detected in this individual.

Clone publicity

In the present study, the publicity degree of each clone was defined as the number of individuals where the clone is detected.

Meta-clone

Groups of clones showing a high similarity at the CDR3 amino acid sequence level. More specifically, a meta-clone is a community of clones differing by at most one amino acid in their sequence that is identified within the general clone sequence similarity network (Fig. 4e).

K-mer

K-mers are groups of amino acids of length K contained within the CDR3 amino acid sequence [80]. Here, the CDR3 amino acid sequence of each clone was broken into overlapping k-mers of 4 amino acids of length.

Diversity

Diversity is one of the most commonly used metrics to characterize and quantify the vastness of the AIRR. Here, diversity was calculated at the chain level according to the observed frequency of clones within an individual’s AIRR. Given the lack of a single measure that efficiently captures all diversity features and the availability of different measures that provide complementary information on the AIRR diversity [81], we followed a recent TCR repertoire analyses on T cell subsets and we computed different diversity measures to characterize the diversity of the AIRR in RA: Diversity 20 (D20), Diversity 50 (D50), Gini-Simpson, Inverse-Simpson, and Shannon entropy [54]. D20 and D50 measures represent the number of dominant unique clones which make up 20 and 50% of the total clones, respectively. Gini-Simpson measure is computed as the unity minus the Simpson’s index and indicates if the repertoire is oligoclonal with a few highly frequent clones (i.e., low values) or polyclonal with a similar representation of each clone (i.e., high values). For the present analyses, the Gini-Simpson index was log-transformed. Inverse-Simpson index was obtained using the weighted arithmetic mean to quantify the average proportional abundance of clones, thereby indicating an even distribution of clones (i.e., high values) or clonal expansion (i.e., low values). Finally, the Shannon entropy was computed to account for both the unique clone number (i.e., repertoire richness) and the clone relative abundance (i.e., repertoire evenness).

Clone length

The length of each clone was determined by the number of amino acids that are present in its CDR3 amino acid sequence.

Chain usage

In order to determine the relative mRNA expression of each immune receptor chain, we computed the chain usage measure. For each individual, the usage of each immune receptor chain was defined as the proportion of chain UMIs.

V and J gene segment usage

The genomic rearrangement between V and J gene segments determines the V(D)J junction, which includes the CDR3 hypervariable region. The AIRR of an individual can be therefore described not only by the frequency of CDR3 sequences, but also by the frequencies of each V and J segments as well as the frequency of each VJ combination. Accordingly, we computed the usage of each V, J, and VJ segments as their frequency within the AIRR of each individual.

Isotype information contained within the constant region of IGH

In order to characterize the molecular signatures that characterize IGH isotypes in RA patients, we have computed four informative measures of the antibody functional role. Using sequencing information contained within the constant region of IGH, we have computed the isotype use (i.e., percentage of unique VDJ sequences per isotype), isotype class switching recombination index (i.e., frequency of unique VDJ regions expressed as two isotypes), and both the percentage of mutated IgD/M and degree of somatic hypermutations (i.e., average number of mutations per unique VDJ region sequence) as previously described [23]. In order to quantify these isotype-specific features, the exportAlignments function of MiXCR software was used to get maximum length sequences. Given the small number of occurrences in which an IgE primer picked up an IgG sequence, which were incorrectly called as IgE by the MiXCR pipeline, we conducted an ad hoc post-processing step that corrected this spurious isotype calling by realigning the reported IgE sequences with the Smith-Waterman algorithm [82]. More specifically, in order to compute the percentage of somatic hypermutations at the sample level, sequence information from CDR1 and CDR2 regions (i.e., starting from CDR1 until the end of the V region), which influence B cell biology, was also considered. These analytical criteria align with the standards utilized by other well-established methods for BCR analysis, such as IgBlast, in the context of somatic hypermutation estimation. Unlike IgBlast, the MiGEC/MiXCR pipeline employed in the present study excels at handling UMI data and accurately distinguishes somatic hypermutations from PCR errors. This particularity of the MiGEC/MiXCR pipeline provides a high degree of confidence in the estimation of somatic hypermutations.

Statistical analysisInfluence of technical factors and epidemiological variables on AIRR properties

We first investigated the impact that both technical factors (i.e., sequencing depth, sequencing plate, RNA integrity and RNA concentration) and epidemiological variables (i.e., age and gender) have on the main AIRR properties (Additional file 3: Fig S1). After detecting technical factors showing a high degree of collinearity, we tested the influence of non-redundant technical variables and epidemiological variables on each AIRR measure. To ensure that our analyses are not biased by technical aspects, technical factors influencing AIRR measures (r2 > 0.25, P < 0.05) were considered when the epidemiological variables were tested for association with these measures. The association between quantitative variables was assessed using the Pearson correlation. The association between categorical variables with two and more than two labels was determined using the Wilcoxon and Kruskal–Wallis tests, respectively. The complete list of associations is detailed in Additional file 3: Fig S1. The downstream AIRR-seq analyses presented in the present study were adjusted by age and gender as well as in accordance with the statistical association observed between each technical variable and AIRR measure.

Clone publicity analysis

Since the degree of clone sharing between individuals is strongly influenced by the number of individuals included in each group of interest, we performed a rarefaction analysis to compare the clone publicity between groups with an unbalanced sample size.

V and J gene segment usage analysis

In order to characterize the gene segment usage profile of our case–control cohort, gene segments showing a usage < 5% were excluded from the study. Subsequently, we conducted a principal component analysis to determine the potential of gene segment usage to differentiate RA patients from healthy individuals. The two components explaining the largest fraction of the gene segment usage variance were selected for the analysis.

Clone length comparative analysis

We conducted a multi-step statistical analysis to detect (i) shifts in the clone length distribution between two conditions and (ii) the amino acid positions at which the differential pattern starts. To identify shifts in the clone length distribution between the groups of interest, the frequency of clone abundance was compared using the Wilcoxon test. Differences in the abundance of clones with a particular length of the CDR3 amino acid sequence between two conditions were evaluated using Fisher’s exact test. To better characterize the amino acid positions at which the differential pattern starts, we computed both the empirical cumulative density distribution function and the quantile–quantile plots. These graphical representations were used to determine if the clone length frequencies observed in two different conditions follows the same probability distribution.

Baseline association analysis

In order to identify AIRR-seq variables associated with a trait of interest, we used logistic and negative binomial regression models according to the distribution followed by the AIRR-seq variable analyzed (Additional file 35: Table S28). Epidemiological variables and technical factors like sequencing depth were included as covariates in all statistical models.

Given the vastness of the AIRR and the complexity of translating repertoire information contained in a few milliliters of blood to the whole individual’s AIRR (i.e., sampling bias), discovering clones associated with complex traits is highly challenging. The little clone sharing (i.e., clone publicity) detected between our study samples leads to an extremely zero-inflated clone distribution (Additional file 36: Fig S8, Additional file 37: Fig S9). This particularity of the AIRR-seq data represents a major analytical challenge to identify disease-associated clones with high confidence [24, 60]. To identify clones that are over- or under-represented in disease conditions, in the present study we adapted the Hurdle model [61]. This model is commonly used in single-cell RNA-seq studies and allows the comparison of extremely zero-inflated clone abundance distributions [83]. It is a two-part generalized linear model that simultaneously models the rate of clone abundance over the background of various clones and the positive clone abundance mean. This modelling allows the inclusion of covariates in both the discrete and continuous parts of the model allowing the adjustment of the model for confounding factors. To adjust our model for important factors of AIRR-seq analyses, we designed and implemented the Clone Detection Rate (CDR) measure that acts as a proxy for both technical (e.g., sampling bias) and biological factors (e.g., RNA volume and extrinsic factors other than the clinical phenotype of interest) that globally influence clone abundances. The CDR was computed as the standardized proportion of clones that are detected in each sample. Together with the statistical properties of the Hurdle model, the adjustment for the variability of AIRR-seq data captured by the CDR makes the Hurdle model more effective at controlling the impact of the technical factors and reducing the false positive rate than other analytical strategies commonly used in AIRR-seq analyses (e.g., Fisher’s exact test). Before using the Hurdle model, clone counts were adjusted for the library size, multiplied per million and converted to the logarithmic scale. After excluding those clones detected in < 1% of samples (Additional file 38: Table S29), the association of each clone with disease conditions was tested using the Hurdle model adjusted for CDR, age, and gender. The false discovery rate (FDR) method was used to account for multiple testing [84].

The stringent multiple testing correction that must be applied to the association analysis at the single-clone level can reduce the power to identify disease-relevant clones. This statistical power limitation raises the need of developing new approaches to identify disease-associated clones. Given that clones with high sequence similarity are likely to recognize similar antigens [34], we exploited the similarity information across CDR3 amino acid sequences using network analysis to identify meta-clones associated with RA (Fig. 4e). Our meta-clone approach starts with the construction of the CDR3 sequence similarity network of millions of clones that follow an extremely zero-inflated frequency distribution. This is built using the mutation.network function of the tcR R package [85]. In this network, edges connect those clones that differ by at most one amino acid. Communities of highly connected clones (i.e., meta-clones or groups of clones with a high CDR3 amino acid sequence similarity) are subsequently identified by directly optimizing a modularity score with the cluster.fast.greedy function of the igraph R package [86]. The expression of each meta-clone is then computed by aggregating the expression values of the community of highly similar clones. Finally, the association of each meta-clone (i.e., detected in > 1% of samples) with disease conditions is tested using the Hurdle model adjusted for age, gender, and detection rate as previously described (Additional file 38: Table S29). The same analytical approach was used to test the association between k-mers and disease conditions (Additional file 38: Table S29).

Longitudinal association analysis

In order to evaluate the impact that TNFi drugs have on the main AIRR-seq features, we implemented a logistic and negative binomial regression (i.e., determined by the distribution of the AIRR-seq variable analyzed, Additional file 35: Table S28) using a mixed effect model. Like in the baseline association analysis, sequencing depth and epidemiological variables were included as covariates in the mixed effect models.

Similar to the baseline association analysis, studying the dynamics of the AIRR at the resolution of single clones must allow to distinguish true differences in clone frequencies from experimental noise. To address this analytical challenge in the study of the impact of TNFi therapy at the single-clone level, we used NoisET [31]. Briefly, NoisET is a newly developed Bayesian approach that exploits the information of each sample and replicate for learning a noise model that accounts for read counts variability due to sampling, library preparation, and expression noises. Following the method’s guidelines, we selected the negative binomial model for learning the experimental and technical noise. According to the AIRR-seq data required for the probabilistic method implemented in NoisET, we used the CDR3 nucleotide sequence of each clone to infer the noise model that allowed the unbiased identification of contracted and expanded clones after TNFi therapy.

Differences in the number of significantly contracted and expanded clones between BCR and TCR chains were determined using Fisher’s exact test. The effect of TNF inhibition on contracted and expanded clones was compared using the Wilcoxon test. The same statistical test was used to compare the effect of TNF inhibition on contracted and expanded clones between BCR and TCR chains.

Grouping lymphocyte interaction by paratope hotspots

To identify groups of TRB clones with shared specificity that are expanded in RA, we used the grouping of lymphocyte interactions by paratope hotspots (GLIPH) algorithm version 2 [36]. GLIPH2 method is designed to cluster clones from the TRB chain according to their probability of binding the same Major histocompatibility complex-restricted peptide antigen. In the three-step algorithm implemented in GLIPH2, global similarity and local similarity are first computed. While the former is based on differences on the CDR3 sequence of up to one amino acid, the latter is based on the sharing of amino acid motifs among CDR3 sequences (i.e., > tenfold enriched and < 0.001 probability of occurring at this level of enrichment in the naive repertoires used as reference). Global and local similarities are then used to construct clusters of TRB clones with shared specificity. The enrichment of each cluster of TRB clones in CDR3 length, clonal expansions, shared HLA alleles in subjects, motif significance, and cluster size is next tested. Finally, the resulting probabilities of each enrichment are combined and summarized in a single score. Given that TRB clones quantified in our study population were determined from bulk RNA and no information from T cell types was available, the enrichment of the TRB repertoire was tested against a naive repertoire of both CD4 + and CD8 + T cells. To avoid false positive results, TRB clusters composed of less than three clones and detected in less than three individuals were removed from the analysis.

Genetic association between HLA alleles and RA risk

The imputed HLA alleles were tested for association using a logistic regression model. In order to increase the power of identifying HLA alleles associated with RA risk in the Spanish population, we analyzed a case–control cohort of 1191 RA patients and 1558 healthy controls that we had previously collected [87]. The genetic association results were adjusted for multiple testing using the FDR method [84].

Estimating the AIRR from bulk RNA-Seq data

In order to obtain the immune repertoire from the seven receptor chains that are expressed in blood and synovial tissue of RA patients [32], we leveraged the information provided by TCR and BCR transcripts that are present in a publicly available bulk RNA-seq dataset using MiXCR software [75]. MiXCR aligns paired-end reads to a built-in library of V, D, J, and C gene sequences and, subsequently, information from the alignment of the two reads is aggregated to increase the gene assignment accuracy. After correcting for PCR and sequencing errors, identical and homologous reads are assembled into clones. In order to maximize the quantitative information available for analysis, low-quality reads are rescued by mapping them to previously assembled high-quality clones.

Clone characterizationAmino acid sequence similarity

In order to quantify the similarity between amino acid sequences, we implemented two analytical approaches. The sequence similarity among a given set of clones was determined using a network analysis strategy. This analytical strategy consisted on building a CDR3 amino acid sequence similarity network. Graphically, each clone is represented as a node and edges connect CDR3 amino acid sequences differing by at most one amino acid. The node size is proportional to the number of similarity connections with other clones and each node is divided into as many fractions as individuals have the clone. Nodes are displayed on the plane in accordance to the physical model of springs that is used by the Kamanda-Kawai layout algorithm implemented in igraph R package. The degree centrality value of the sequence similarity network was used to estimate the similarity among the set of clones. The similarity among a given set of smaller amino acid sequences (i.e., k-mers) was determined using the Levenshtein distance metric, which represents the minimum number of characters that differs between two strings of characters [88]. Unlike the commonly used Hamming distance, which is limited to count character substitutions between two sequences of the same length [89], the Levenshtein distance metric considers also the presence of character insertions/deletions and, therefore, it can provide additional insights into the characterization of the immune repertoire [90].

Biochemical profile of the CDR3 amino acid sequence

The sequence profile of a given set of CDR3 amino acid sequences was determined using the ClustalW alignment algorithm with the R package msa [91]. The sequence logo of the resulting alignment was built using the probability method provided by the R package ggseqlogo [92]. Each amino acid was colored according to its most characteristic biochemical feature (i.e., acidic, basic, hydrophobic, neutral or polar). In order to refine the biochemical characterization of a given set of CDR3 amino acid sequences, we studied additional biochemical properties using the Peptides R package. To determine whether the biochemical profile of expanded and contracted clones differs between responder and non-responder patients to TNFi therapy, the enrichment in each biochemical property was compared using a logistic regression model adjusted for the length of the CDR3 amino acid sequence. After weighting the biochemical profile of the significantly expanded or contracted clones by the magnitude of the frequency change, the same analytical approach was used to quantify the global effect that TNF inhibition exerts on the AIRR biochemical profile that characterizes responder and non-responder patients.

Comparison against a random repertoire

In order to determine whether a given set of CDR3 amino acid sequences are characterized by a particular biochemical property or a higher sequence similarity than expected by chance, we have used a resampling strategy. Using this analytical approach, a random repertoire was generated by sampling > 1000 times a random set of clones from the same chain and size as a group of clones of interest (e.g., contracted IGH clones). The random set of clones was sampled from the observed clones and the probability of being sampled was equally assigned to each one.

The statistical significance of the feature (e.g., sequence similarity or biochemical property) was computed by contrasting the observed metric in a given set of clones against the null distribution obtained after sampling multiple times (i.e., here > 1000 and 1000 for sequence similarity and biochemical properties, respectively) a random set of clones from the same size and receptor chain. After obtaining the statistical significance of enrichment and depletion in a particular biochemical feature, both statistics were log-transformed (i.e., − log10 and log10 for enrichment and depletion p-values, respectively) and subsequently aggregated into a single biochemical enrichment score (BES).

Development of the multi-chain AIRR predictor

To investigate whether AIRR molecular data could be translated into information of clinical utility, we built and tested predictor models. Multi-chain AIRR features (n = 4521 variables) and HLA genetic information (n = 6407 genetic variants) from 84 RA patients and 104 control individuals were evaluated as predictors (Additional file 39: Table S30). The predictive power of AIRR information was separately analyzed for each AIRR feature as well as aggregating information from different AIRR features. The analytical approach selected for the present predictive analysis was composed by the random forest algorithm and leave-one-out cross-validation strategy. Briefly, this analytical strategy consists of four consecutive steps. First, the dataset is split into a training set and a testing set, leaving only one observation “out” from the training set as a testing set. Second, the statistical model (i.e., here random forest) is built using only data from the training set. Third, the model is used to estimate the class probability of the left-out observation and compared to the real (observed) class. This process is repeated for all observations in the dataset to calculate the global classifier performance.

To avoid biased prediction resulting from the analysis of imbalanced classes (e.g., more TNFi responders than non-responders), we used the random forest algorithm with down-sampling (i.e., sampling the majority class to make their frequencies closer to the minority class). The predictive performance was determined using five discriminative measures: (i) accuracy (i.e., number of correct predictions out of the total number of predictions), (ii) sensitivity, (iii) specificity, (iv) precision, and (v) false positive rate. For the most powerful predictor both the area under the receiver operating characteristic and the area under the precision-recall curve were computed.

留言 (0)

沒有登入
gif