Human-specific protein-coding and lncRNA genes cast sex-biased genes in the brain and their relationships with brain diseases

Data collection

RNA-seq data were collected from multiple resources, including 510 human samples covering four developmental periods (fetal period, childhood, adolescence, and adulthood) and 16 brain regions from the psychENCODE website (http://development.psychencode.org) [16], 176 macaque samples covering two developmental periods (5 years old, 10 years old) and 14 brain regions from the GEO website (accession GSE128537) [17], and samples and controls from patients with brain diseases from public databases (Supplementary Table 1). The 16 human brain regions included the neocortex (A1C, DFC, IPC, ITC, M1C, MFC, OFC, S1C, STC, V1C, and VFC), amygdala (AMY), cerebellar cortex (CBC), hippocampus (HIP), thalamus (MD), and striatum (STR) (Supplementary Table 2), and the 14 macaque brain regions are their equivalent. These two macaque developmental periods correspond to adolescence and adulthood in humans [18]. On average, each region has 7.9 samples at each developmental period. Gene expression data are in the form of count matrices.

The brain disease data were obtained from patients with schizophrenia (SCZ) [19], autism spectrum disorder (ASD) [20], and brain tumors (low-grade glioma, LGG, and glioblastoma, GBM) (httts://xena.ucsc.edu). The UCSC Xena database is built upon The Cancer Genome Atlas (TCGA) database and the GTEx database [21, 22] by reprocessing data and removing batch effects (https://xenabrowser.net/datapages/) [23]. We used the Gene Expression Profiling Interactive Analysis (GEPIA2) platform, which establishes the tissue matching information between the above two databases [24], to obtain 152 GBM tumor samples, 508 LGG tumor samples, and 206 normal brain samples (derived from the cortex and frontal cortex regions). The SCZ RNA-seq data were obtained from the http://eqtl.brainseq.org/phase2/ website. The data cover DFC and HIP regions [19]; the DFC data comprises 149 diseased and 210 normal samples, and the HIP data comprises 130 diseased and 228 normal samples. The ASD RNA-seq data were obtained from the https://www.synapse.org/#!Synapse:syn4587609 website and included 45 diseased and 40 normal samples from the adult brain [20].

Identification of sex-biased genes and sex-biased gene expression

Multiple factors (e.g., sex and age) influence the regulation of tissue- and organ-specific gene expression. When examining some factors that exert “fixed effects”, other factors may become confounding factors. Thus, the effects of different kinds of factors should be properly estimated because confounding factors may generate misleading results. Among the multiple methods developed to handle different factors, linear mixed models (LMMs) can powerfully differentiate variations in gene expression caused by factors of interest and confounding factors [25,26,27]. We therefore used the LMM to identify four types of sex-biased genes with different factors of interest and confounding factors.

First, “spatiotemporal-specific sex-biased genes” are genes showing spatiotemporal-specific sex-biased expression. To use the LMM to detect these genes, we combined the “age” and “sex” factors in the LMM equation into an integrated categorical variable called “AgeSex” (which has 4*2 = 8 values, including fetus-female, fetus-male, childhood-female, childhood-male, adolescent-female, adolescent-male, adulthood-female, and adulthood-male). We subsequently applied this LMM to samples from one brain region using the voom function in the limma R package [28]:

$$\:\text\:\sim\:_+_\text\text\text\text\text\text+_\text\text\text+_\text\text\text+_\text\text\text\text+\epsilon\:.\:$$

Here, Y indicates the gene expression level, the postmortem interval (PMI) and the RNA integrity number (RIN) were treated as fixed effects, and the sequencing processing site (Site) was treated as a random effect. We used male samples from the same period and region as the reference to detect sex-biased genes for each period and region, yielding 16*4 = 64 gene sets. The union of the 64 gene sets showing sex-biased expression in at least one region or period contained 7647 genes (Supplementary Table 3). Second, “period-specific sex-biased genes” were defined as genes showing sex-biased expression in a specific period. To detect these genes, we fitted the following LMM to samples from all regions in the same period:

$$\begin \text\:\sim\:_ & +_\text\text\text+_\text\text\text\text\text\text\\ & \quad +_\text\text\text+_\text\text\text+_\text\text\text\text+\epsilon\:.\end $$

In this situation, since samples of a period come from all regions but only period-specific sex-biased expression was considered, “region” is a confounding factor. Thus, “Region,” “PMI,” and “RIN” were treated as fixed effects, and “Site” was treated as a random effect as before. This model yielded four sets of genes (Supplementary Table 4). Third, “region-specific sex-biased genes” were defined as genes showing region-specific sex-biased expression. To detect these genes, we fitted the following LMM to samples from one brain region:

$$\begin \text\:\sim\:_ & +_\text\text\text+_\text\text\text+_\text\text\text\\ & \quad +_\text\text\text+_\text\text\text\text+\epsilon\:.\end $$

In this situation, since samples in a region come from all periods, “age” is a confounding factor. Thus, “Age,” “PMI,” and “RIN” were treated as fixed effects, and “Site” was treated as a random effect. This model yielded 16 sets of genes (Supplementary Table 5). Fourth, “consistently sex-biased genes” were defined as genes showing consistent sex-biased expression across regions and periods. To detect these genes, we fitted the following LMM to samples from all regions and periods:

$$\begin \text\:\sim\:_ & +_\text\text\text+_\text\text\text+_\text\text\text\text\text\text\\ & \quad +_\text\text\text+_\text\text\text+_\text\text\text\text+\epsilon\:.\end $$

In this situation, since “consistently” means consistent across regions and periods, “region” and “age” were controlled as confounding factors. Thus, “Age,” “Region,” “PMI,” and “RIN” were treated as fixed effects, and “Site” was treated as a random effect as before. This model yielded one set of genes.

For multifactor (multivariable) systems, the false discovery rate (FDR) was proposed for correcting multiple testing [29]. The local FDR (LFDR) is an improved method that can measure the significance of specific observations (e.g., LFDR j denotes the probability that effect j would be a false discovery) [30]. Recently, the importance of estimating the size of effects has been acknowledged, and many measures of effect size, including the standard mean difference, odds ratio (OA), and Cohen’s d, have been developed. It is proposed that measuring confidence in the sign of each effect matters more than the confidence in each effect being nonzero, because being confident in the sign of an effect logically implies that we are confident it is nonzero [31]. With this notation, the local false sign rate (LFSR) method was developed, which takes two inputs—an effect size estimate and the corresponding standard error—rather than the usually used p value or z score [31]. The application of LFSR to genomic data analysis suggests that LFSR outperforms LFDR [32]. In theory, LFSR, which measures both effect size and confidence in each effect and combines the calculation of the two, is preferable to LFDR. Furthermore, both the fold change and the coefficients of the LMM reflect the effect size (i.e., estimation of the effect size). For these reasons, we used LFSR to compute both the effect size and significance under multiple conditions via the mashr R package developed by Stephens’ team [32]. “beta_matrix” and “se_matrix”, the coefficient matrix and standard error matrix derived from the fitted LMM, are two inputs to the mashr function mash_set_data() (i.e., mash_set_data(beta_matrix, se_matrix)).

For the sets of spatiotemporal-specific sex-biased genes, we performed a cross-region meta-analysis for each period via the mashr package, which allowed us to correct multiple testing across regions and periods. Genes with |log2FC|>1.0 and LFSR < 0.001 were defined as “spatiotemporal-specific sex-biased genes”. Owing to the small sample size from each brain region, we also performed a cross-region meta-analysis to identify region-specific sex-biased genes. Genes with |log2FC|>1.0 and LFSR < 0.001 were defined as region-specific sex-biased genes. Note that LFSR < 0.001 is much smaller than the normal LFSR threshold of 0.05 [31], which effectively reduces the occurrence of false positives. Moreover, |log2FC|>1.0 also reflects a large effect size.

We used |log2FC|>1.0 and FDR < 0.03 to identify “period-specific sex-biased genes” because age seems to be a more significant covariate of sex-biased gene expression than region and could make mashr generate a high false discovery rate. FDR = 0.03 (smaller than the popular threshold of 0.05) was used because the datasets from each period were much larger than those from each region. To identify “consistently sex-biased genes”, we (a) estimated a reasonable FDR threshold by searching the parameter space of “1.0 ≤|log2FC| ≤ 2.0 and 0.001 ≤ FDR ≤ 0.05” and (b) used Fisher’s exact test to ensure that the identified genes overlapped significantly with the union of region-, period-, and spatiotemporal-specific sex-biased genes. Genes with |log2FC|>1.0 with FDR < 0.045 were defined as consistently sex-biased genes.

Finally, the same LMM models and LFSR method were used to identify sex-biased genes in the macaque brain. Spatiotemporal-specific sex-biased genes were identified with thresholds of |log2FC|>1.0 and LFSR < 0.005 (Supplementary Table 11).

Identification of sex-related coexpression modules

Weighted gene coexpression network analysis (WGCNA) is a method and program widely used to analyze gene expression patterns across samples (especially small samples) [33]. Assuming that sex-related genes have correlated expression, we applied consensus network analysis via WGCNA to 16 sets and 4 sets of region- and period-specific sex-biased genes to detect sex-related coexpression modules shared across regions or periods (called consensus modules). Consensus coexpression network analysis (i.e., consensus module analysis) revealed the structural properties of the networks and modules. The following steps were used to detect period-specific sex-related modules (default parameters were used unless otherwise stated). First, we used the pickSoftThreshold function to determine the soft-thresholding power, and 9 was identified as the best threshold. Second, we used the TOMsimilarity function to calculate the topological overlap matrix (TOM) for each period and used the pmean function to calculate the consensus TOM across the four periods. Third, we used the cutreeDynamic function (minClusterSize = 80 and cutHeight = 0.995) to identify coexpression modules. Fourth, we used the multiSetME function to extract module eigengenes (MEs) from each module. Finally, a module was identified as a period-specific sex-related module if it contained > 5% sex-biased genes and met at least one of the two following criteria: (a) Fisher’s exact test indicated that for genes specific to the period, sex-biased genes were significantly more enriched in the module than non-sex-biased genes (P < 0.05), (b) linear regression analysis of module eigengenes via the following LMM model indicated that the coefficient of the “Sex” term was significant (P < 0.05 and R-squared > 0.4):

$$\:\text\sim_+_\text\text\text\text\text\text+_\text\text\text+_\text\text\text+_\text\text\text+\epsilon\:.$$

Here, R-squared is a statistical measure of fit that indicates how much variation in the dependent variable is explained by the independent variables. An R-squared of > 0.3 is assumed to be sufficient if there is extreme variability in the dataset; here, gene expression varies greatly across regions and sexes.

The same steps, parameters, and thresholds were used to detect region-specific sex-related modules.

Analysis of transcriptional regulation by sex hormones

Sex hormone receptors include estrogen receptor 1 (ESR1), estrogen receptor 2 (ESR2), and androgen receptor (AR). These receptors are also ligand-activated transcription factors. First, we examined the expression of AR, ESR1, and ESR2 in the human brain by using the gam function in the mgcv R package to fit their expression levels to developmental periods. Second, we validated their expression levels in brain regions using data from the GTEx project and publicly available scRNA-seq data [16]. Third, we identified these sex hormone receptors’ transcriptional target genes by using the CellOracle program to scan these sex hormone receptors’ DNA-binding sites (DBSs) in the promoter regions of genes in the module (threshold = 17, 1.5 kb upstream and downstream of the transcription start site, TSS) [34]. The DNA binding motifs of sex hormone receptors were extracted from the CIS-BP database (http://cisbp.ccbr.utoronto.ca/index.php) [35].

To examine whether lncRNA genes with sex-biased expression are regulated by sex hormone receptors, we used Fisher’s exact test (P < 0.05) to assess whether the promoter regions of lncRNA genes are enriched with DBSs of sex hormone receptors.

Analysis of transcriptional regulation by lncRNAs

On the basis of the WGCNA-identified coexpression modules, we jointly used the GENIE3 and LongTarget programs to analyze the transcriptional regulation of sex-biased genes by the lncRNAs in each module. Since some coexpression modules were very large, we first used the GENIE3 program to screen putative regulators (lncRNAs) and targets (sex-biased genes). GENIE3 predicts a regulatory network between regulators and targets using a tree ensemble-based gene network inference algorithm [36]. For genes in each module, GENIE3 identified a list of lncRNAs for each gene, with ranks indicating the probability that the former would regulate the latter. The likely regulatory relationships were identified on the basis of the following criteria: (a) the targets belonging to the top 30% for each lncRNA and (b) the lncRNAs belonging to the top 30% for each target, which has been popularly adopted by regulatory network analysis to ensure reliability [37].

Since lncRNAs can epigenetically regulate gene expression by binding to gene regulatory sequences (especially promoters) and recruiting DNA and histone modification enzymes to these binding sites to establish epigenetic modification markers, we predicted the DBS of lncRNAs in promoter regions (5 kb upstream and downstream of the TSS) of sex-biased genes using the LongTarget program [38, 39]. The likely regulatory relationship was defined as binding affinity greater than 60 (corresponding to DBS length > = 90 bp) (Supplementary Table 9).

Finally, we obtained the most likely regulatory relationships between the lncRNAs and sex-biased genes by integrating the results of GENIE3 and LongTarget. We also investigated whether sex-biased lncRNA genes are more likely to colocalize with sex-biased protein-coding genes than with non-sex-biased lncRNA genes and found that for most sex-biased genes, the likelihood is high (Fisher’s exact test, FDR < < 0.05) (Supplementary Table 10).

Gene set enrichment analysis

We performed overrepresentation analysis (ORA) using the enrichGO function in the ClusterProfiler package and the gProfiler program (the online version) and performed gene set enrichment analysis (GSEA) using the gseGO function in the ClusterProfiler package. The two kinds of analyses are called “gene set enrichment analysis” in the Results section but are called ORA and GSEA here to clearly describe their use.

First, to reveal the biological functions of spatiotemporal-specific sex-biased genes, we sorted genes according to log2FC values (female-biased genes have log2FC > 0, and male-biased genes have log2FC < 0). We then performed GSEA using the gseGO function and the Gene Ontology (GO) database [40]. GO terms with an FDR < 0.05 were considered significantly enriched. Enriched GO terms with a negative normalized enrichment score (NES) indicated enrichment of male-biased genes, and enriched GO terms with a positive NES indicated enrichment of female-biased genes (Supplementary Table 6).

Second, to reveal the functions of the consensus coexpression module of period-specific sex-biased genes and region-specific sex-biased genes (gene lists without weights), we performed ORA using the enrichGO function and the GO database (Supplementary Tables 7, 8). GO terms with p.adjust < 0.05 were considered significantly enriched. When identifying enriched GO terms, we also used the GO.db function in the R package to retrieve enriched GO term subtrees (subterms) in the GO database (Fig. 1D, E; Fig. 4B).

Fig. 1

Spatiotemporally specific expression of sex-biased genes in the human brain. (A) Numbers of sex-biased genes in the 16 regions and four periods. More genes show male-biased expression in all regions in the fetal and adult periods, and more lncRNA genes show male-biased expression in adulthood than in other periods. (B) Numbers of region-specific sex-biased genes in the fetal period. (C) Numbers of period-specific sex-biased genes in the STR region. (D) The enrichment of spatiotemporal-specific sex-biased genes in the subterms of the “neurogenesis” GO term. (E) The enrichment of spatiotemporal-specific sex-biased genes in terms of the “immune response” GO term. In (DE), the dot size indicates the number of subterms

The gProfiler program reports “driver terms in GO” [41]. We used it to identify the enriched GO terms of human-specific (in humans but not in macaques) sex-biased protein-coding genes (Fig. 4D), enriched GO terms of consensus coexpression modules of sex-biased genes in macaques (Supplementary Table 12), and enriched GO terms of sex-specific differentially expressed genes in the four brain diseases (Supplementary Table 14).

Identification of orthologous and species-specific genes in humans and macaques

We utilized the biomaRt R package to acquire orthologous genes in humans and macaques from the BioMart Ensembl website (https://mart.ensembl.org/index.html) [42]. A total of 22,887 one-to-one orthologous protein-coding genes were identified, but no orthologous lncRNA genes were found. Therefore, we manually examined homologous lncRNA genes. First, we extracted the coordinates of 14,709 human lncRNA genes from the GENCODE v21 annotation [43]. Second, we converted these coordinates from the human genome hg38 to the macaque genome rheMac10 using the LiftOver function on the UCSC Genome website (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Third, we used the bedtools intersect function in the bedtools package to detect whether the transformed coordinates of human lncRNA genes overlapped with annotated lncRNAs in the macaque genome (https://hgdownload.soe.ucsc.edu/downloads.html). Finally, 1821 macaque lncRNA genes were identified as homologous to human lncRNA genes on the basis of the criterion of > 20% sequence overlap. We also examined whether human or macaque lncRNA genes are conserved in mammals or simian-specific (conserved in simians, including monkeys and apes) using the LongMan database [39].

A recent study identified orthologous genes in hundreds of placental mammals and birds (taking humans and mice as two references) [44]. We downloaded the orthologous genes between humans and macaques from the authors’ website (https://genome.senckenberg.de//download/TOGA/) and extracted the “many2zero” and “one2zero” genes, which exist exclusively in humans but not in macaques. There are no “many2zero” genes, and 185 “one2zero” genes.

Examination of sex differences in brain diseases

The TCGA and GTEx databases contain many RNA-seq datasets of brain tumors and normal brain tissues, respectively. The UCSC Xena website reprocessed GTEx and TCGA data by removing the batch effect (http://xena.ucsc.edu) [23]. The GEPIA2 website enables gene expression analysis of the TCGA and GTEx data at the transcript level and allows researchers to compare their data with those of the TCGA and GTEx samples (http://gepia2.cancer-pku.cn/) [24]. We used these resources to examine sex differences in brain disorders and brain tumors.

To examine the sex bias among patients with brain disorders and brain tumors, we utilized the limma-voom method (the voom function in the limma package) to identify differentially expressed genes (DEGs) in diseases (comparing gene expression in female patients and female controls and in male patients and male controls) in both females and males [28]. Genes with |log2FC|>2.0 and FDR < 0.05 were considered DEGs for GBM and LGG [45], and genes with FDR < 0.05 (|log2FC| for most genes are small) were considered DEGs for SCZ and ASD [46, 47].

We also obtained male- and female-specific DEGs by comparing gene expression differences between males and females (instead of between patients and controls) (Supplementary Table 13). Using the gProfiler program and the GO database (Benjamini‒Hochberg FDR < 0.05) [41], we applied gene set enrichment analysis to male- and female-specific DEGs.

Significance test, multiple testing correction, and effect size calculation

We used Fisher’s exact test to examine whether the difference between the two sexes was significant. We used the Benjamini‒Hochberg false discovery rate (FDR) with a threshold of 0.05 for multiple testing correction, especially in gene set enrichment analysis.

We used the effectsize R package to compute the effect size together with Fisher’s exact test. Because most Fisher exact tests examine whether the difference between sex-biased genes and non-sex-biased genes is significant, we used OR to measure the effect size. Most of the effectsize results agree with the Fisher exact test results, with large OA values in many cases (Figs. 3A and 5E; Supplementary Table 10; Supplementary Table 15).

Fig. 2

The regulation of sex-biased genes by sex hormones and lncRNAs. (A) AR gene expression in the 16 regions and four periods in males and females. (B) AR/ER expression ratios in the 16 regions during brain development in males and females. (C) Regulation of sex-biased genes enriched in “neurogenesis” (see Fig. 1D) by lncRNAs in the four periods. (D) Regulation of sex-biased genes enriched in the “immune response” (see Fig. 1E) by lncRNAs in the four periods

Fig. 3

Consensus coexpression modules of period- and region-specific genes. The colors with numbers indicate modules and gene numbers. (A) Modules of period-specific genes and enriched GO terms. Each module shows period-specific enrichment for sex-biased genes. For example, turquoise_1679 is enriched for sex-biased genes only in fetuses (FDR = 3.0E-56, OR = 6.32), and purple_337, cyan_175, pink_442, tan_229, brown_840, and lightcyan_153 are enriched for sex-biased genes in adulthood (FDR = 4.02E14, 0.0005, 2.71E44, 2.99E69, 2.93E168, 2.54E50, OR = 2.39, 1.78, 4.17, 15.11, 8.36, 18.01). This feature is reflected by links between periods and modules. (B) Modules of region-specific genes and the enriched GO terms

View original article

BIOLOGY OF SEX DIFFERENCES

分享书签

0 0 0 0 0 0 0

More from this channel

Human-specific protein-coding and lncRNA genes cast sex-biased genes in the brain and their relationships with brain diseases

留言 (0)