Pervasive tandem duplications and convergent evolution shape coral genomes

Coral material and species names

Porites evermanni, Porites lobata, and Pocillopora cf. effusa colonies were collected in French Polynesia by the CRIOBE at Moorea. The assembled genome of Pocillopora was assigned to SVD1 lineage based on genome-wide SNPs (Voolstra et al. in revision). The SVD1 lineage corresponds to GSH01 [94] and carries the mtORF marker type 2/11. Therefore, this genome is Pocillopora cf. effusa. Fifty grams of each coral colony was flash frozen using liquid nitrogen and stored at − 80 °C until further use. In addition, a small fragment of Porites lobata was placed in 2 ml of Lysing Matrix A beads (MP Biomedicals, Santa Ana, CA, USA) in presence of 1.5 ml of DNA/RNA shield (Zymo Research, Irvine, CA, USA) and stored at − 20 °C until RNA purification.

DNA extraction

DNA extractions from flash frozen tissues were based on a nuclei isolation approach to minimize contamination with Symbiodiniaceae DNA. Briefly, cells were harvested using a Waterpik in 50 ml of 0.2 MEDTA solution refrigerated at 4 °C. Extracts were passed sequentially through a 100-µm then a 40-µm cell strainer (Falcon) to eliminate most of the Symbiodiniaceae. Then extracts were centrifuged at 2000 g for 10 min at 4 °C and the supernatants were discarded. Two different protocols of DNA purification were used as follows. For Porites evermanni and Pocillopora cf. effusa, the resulting pellets were homogenized in lysis buffer (G2) of the Qiagen Genomic DNA isolation kit (Qiagen, Hilden, Germany). The DNA were then purified following the manufacturer’s instructions using genomic tip 100/G. For Porites lobata, the pellet was homogenized in CTAB lysis buffer followed by a Chloroform/isoamyl alcohol purification.

RNA extraction

Total RNA of Pocillopora cf. effusa was extracted from flash frozen tissue. For Porites lobata, the fragment placed in 2 ml of Lysing Matrix A beads (MP Biomedicals, Santa Ana, CA, USA) in presence of 1.5 ml of DNA/RNA shield (Zymo Research, Irvine, CA, USA) was thawed and disrupted by the simultaneous multidirectional striking using a high-speed homogenizer FastPrep-24 5G Instrument (MP Biomedicals, Santa Ana, CA, USA) (following conditions: speed 6.0 m/s, time 30 s, pause time 60 s, cycles 3). Total RNA was then purified from one aliquot of 500 µl of homogenized suspension, following the instruction of the commercial Quick-DNA/RNA Kit (Zymo Research, Irvine, CA, USA).

Approximately 5 μg of total purified RNA was then treated with the Turbo DNA-free kit (Thermo Fisher Scientific, Waltham, MA, USA), according to the manufacturer’s protocol. Quantity was assessed on a Qubit 2.0 Fluorometer using a Qubit RNA HS Assay kit, and the quality was checked by capillary electrophoresis on an Agilent Bioanalyzer using the RNA 6000 Pico LabChip kit (Agilent Technologies, Santa Clara, CA, USA). Purified RNA was stored at − 80 °C until further use.

Illumina library preparation and sequencing

Genome sequencing of Porites evermanni was performed using Illumina reads from both paired-end and mate-pair (MP) libraries of different insert sizes. The paired-end library was prepared using the NEBNext DNA Modules Products (New England Biolabs, MA, USA) with a “on beads” protocol developed at the Genoscope, as previously described [95] in Alberti et al. The library was quantified by qPCR (MxPro, Agilent Technologies) using the KAPA Library Quantification Kit for Illumina Libraries (Roche), and its profile was assessed using a DNA High Sensitivity LabChip kit on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The library was paired-end sequenced on the Illumina HiSeq2500 (Illumina, USA) sequencing platform (2 × 251 bp). The MP libraries were prepared using the Nextera Mate Pair Sample Preparation Kit (Illumina, San Diego, CA). Briefly, genomic DNA (4 µg) was simultaneously enzymatically fragmented and tagged with a biotinylated adaptor. Tagmented fragments were size-selected (3–5; 5–8; 8–11 and 11–15 kb) through regular gel electrophoresis, and circularized overnight with a ligase. Linear, non-circularized fragments were digested and circularized DNA was fragmented to 300–1000-bp size range using Covaris E220. Biotinylated DNA was immobilized on streptavidin beads, end-repaired, and 3′-adenylated. Subsequently, Illumina adapters were ligated. DNA fragments were PCR-amplified using Illumina adapter-specific primers and then purified. Finally, libraries were quantified by qPCR and library profiles were evaluated using an Agilent 2100 Bioanalyzer. Each library was sequenced using 151 base-length read chemistry on a paired-end flow cell on the Illumina HiSeq4000 sequencing platform (Illumina, USA).

The genomes of Porites lobata and Pocillopora cf. effusa were sequenced using a combined approach of short and long reads. The short reads were obtained by preparing Illumina PCR-free libraries using the Kapa Hyper Prep Kit (Roche). Briefly, DNA (1.5 μg) was sonicated to a 100- to 1500-bp size range using a Covaris E220 sonicator (Covaris, Woburn, MA, USA). Fragments were end-repaired, 3′-adenylated, and Illumina adapters were ligated according to the manufacturer’s instructions. Ligation products were purified with AMPure XP beads (Beckman Coulter Genomics, Danvers, MA, USA) and quantified by qPCR. Library profiles were assessed using an Agilent High Sensitivity DNA kit on the Agilent 2100 Bioanalyzer. Libraries were paired-end sequenced on an Illumina HiSeq2500 instrument (Illumina, San Diego, CA, USA) using 251 base-length read chemistry.

Illumina RNA-seq libraries were prepared for both Porites lobata and Pocillopora cf. effusa using the TruSeq Stranded mRNA kit (Illumina, San Diego, CA, USA), according to the manufacturer’s protocol starting with 500 ng total RNA. Briefly, poly(A) + RNA were selected with oligo(dT) beads, chemically fragmented and converted into single-stranded cDNA using random hexamer priming. Then, the second strand was generated to create double-stranded cDNA. The resulting cDNAs were subjected to A-tailing, adapter ligation, and PCR-amplification. Ready-to-sequence Illumina libraries were then quantified by qPCR and library profiles evaluated with an Agilent 2100 Bioanalyzer. Each library was sequenced using 151-bp paired-end reads chemistry on a HiSeq4000 Illumina sequencer. Short Illumina reads were bioinformatically post-processed sensu Alberti et al. [95] to filter out low-quality data. First, low-quality nucleotides (Q < 20) were discarded from both read ends. Then remaining Illumina sequencing adapters and primer sequences were removed and only reads ≥ 30 nucleotides were retained. These filtering steps were done using in-house-designed software based on the FastX package [96]. Finally, read pairs mapping to the phage phiX genome were identified and discarded using SOAP aligner [97] (default parameters) and the Enterobacteria phage PhiX174 reference sequence (GenBank: NC_001422.1).

MinION and PromethION library preparation and sequencing

Genomic DNA fragments of Pocillopora cf. effusa ranging from 20 to 80 kb were first selected using the Blue Pippin (Sage Science, Beverly, MA, USA) then repaired and 3′-adenylated with the NEBNext FFPE DNA Repair Mix and the NEBNext® Ultra™ II End Repair/dA-Tailing Module (New England Biolabs, Ipswich, MA, USA). Sequencing adapters provided by Oxford Nanopore Technologies (Oxford Nanopore Technologies Ltd, Oxford, UK) were then ligated using the NEBNext Quick Ligation Module (NEB). After purification with AMPure XP beads, the library was mixed with the Sequencing Buffer (ONT) and the Loading Beads (ONT). For Pocillopora cf. effusa, a first library was prepared using the Oxford Nanopore SQK-LSK108 kit and loaded onto a R9.5 MinION Mk1b flow cell. Three other libraries were prepared using the same kit and following the same protocol, but without the size selection. They were loaded onto R9.4.1 MinION Mk1b flow cells. An additional library was prepared using the Oxford Nanopore SQK-LSK109 kit and loaded onto a R9.4.1 PromethION flow cell. Reads were basecalled using Albacore version 2. For Porites lobata, eight libraries were prepared using the Oxford Nanopore SQK-LSK109 kit. Four libraries were loaded onto R9.4.1 MinION Mk1b flow cells and the other four onto PromethION R9.4.1 flow cells. Reads were basecalled using Guppy version 2. In both cases, the resulting raw nanopore long reads were directly used for the genome assembly.

Short-read-based genome assembly

The genome of Porites evermanni was assembled with megahit [98] using the filtered high-quality Illumina technology paired-end reads from shotgun libraries (Additional file 1: Table S1). The resulting assembly was scaffolded using mate-pair libraries and SSpace [99] and gap-filled with gapcloser [100] and the paired-end reads. This process generated an assembly of 604 Mb composed of 8186 scaffolds with an N50 of 171 Kb (Table 1).

Long-read-based genome assemblies

To generate long-read-based genome assemblies, we generated three samples of reads: (i) all reads, (ii) 30X coverage of the longest reads, and (iii) 30X coverage of the filtlong [101] highest-score reads that were used as input data for four different assemblers, Smartdenovo [102], Redbean [103], Flye [104], and Ra (Additional file 1: Table S2). Smartdenovo was launched with -k 17 and -c 1 to generate a consensus sequence. Redbean was launched with “-xont -X5000 -g450m” and Flye with “-g 450 m.” The resulting assemblies were evaluated based on the cumulative size and contiguity, with the Smartdenovo and all read combination producing the best assembly. This assembly was polished three times using Racon [105] with Nanopore reads, and two times with Hapo-G [106] and Illumina PCR-free reads.

Reconstruction of allelic relationships and haploid assembly

The cumulative size of both assemblies was higher than expected due to the high heterozygosity rate (Additional file 1: Figure S1), but also the presence of several organisms that can bias kmer distributions. Here, we found a few contigs that correspond to the mitochondrial genome of symbiotic algae (section “ Contamination removal”). As indicated by BUSCO [107] and KAT [108], we observed the two alleles for many genes and a significant proportion of homozygous kmers were present twice in the assembly. We used Haplomerger2 with default parameters and generated a haploid version of the two assemblies. Haplomerger2 detected allelic duplications through all-against-all alignments and chose for each alignment the longest genomic regions (parameter –selectLongHaplotype), which may generate haplotype switches but ensure to maximize the gene content. We obtained two haplotypes for each genome: a reference version composed of the longer haplotype (when two haplotypes are available for a genomic locus) and a second version, named alternative, with the corresponding other allele of each genomic locus. Consistently keeping the longest allele in the reference haplotype explains the larger size of the reference assembly. As an example for Porites lobata, assemblies had a cumulative size of 642 and 588 Mb for the reference and the alternative assembly version, respectively (Additional file 1: Table S3). At the end of the process, the P. lobata and P. cf. effusa haploid assemblies have a cumulative size of 650 and 350 Mb respectively, closer to the expected ones. BUSCO and KAT analysis showed a reduction of the allelic duplications (Additional file 1: Table S5, Additional file 1: Figure S2 and Additional file 1: Figure S3). Final assemblies were polished one last time with Hapo-G [106] and Illumina short reads to ensure that no allelic regions present twice in the diploid assembly have remained unpolished.

Additionally, we compared our haploid genome assemblies with the one obtained using the Purge Dups tool [51]. We sampled 50X of the longest Nanopore reads and launched Purge Dups on the raw Nanopore assemblies with default parameters (Additional file 1: Table S3). For Porites lobata, we obtained an assembly of 588 Mb composed of 1057 contigs. By comparison, the Pocillopora cf. effusa raw assembly was not altered by Purge Dups while the coverage thresholds were consistent. This may be due to the lower proportion of haplotypic duplications (Additional file 1: Figure S4). BUSCO scores and kmer distributions (Additional file 1: Table S5 and Additional file 1: Figure S4) were very similar for both Haplomerger2 and Purge Dups assemblies for Porites lobata which is a confirmation of the great work performed by haplomerger2.

Contamination removal

As we used a DNA extraction method based on a nuclei isolation approach, we minimized contamination with DNA from other organisms (most notably, Symbiodiniacaeae). We predicted coding fragments on both assemblies using metagene and aligned their corresponding protein sequences against the nr database. Contigs were classified based on their hits, and contigs with more than 50% of genes having a best hit to bacteria, archaea, or viruses were classified as non-eukaryotic and filtered out. In addition, contigs taxonomically assigned to Symbiodiniaceae were also filtered out. In the Porites lobata assembly, we filtered 38 contigs with an average size of 25 kb and totaling 961 kb, while in the Pocillopora cf. effusa assembly, only two contigs of 25 kb and 30 Kb were filtered out.

Transcriptome assembly

First, ribosomal RNA-like reads were detected using SortMeRNA (Kopylova et al., 2012) and filtered out. Illumina RNA-Seq short reads from Porites lobata and Pocillopora cf. effusa were assembled using Velvet [109] 1.2.07 and Oases [110] 0.2.08, using a kmer size of 89 and 81 bp respectively. Reads were mapped back to the contigs with BWA-mem, and only consistent paired-end reads were kept. Uncovered regions were detected and used to identify chimeric contigs. In addition, open reading frames (ORF) and domains were searched using respectively TransDecoder and CDDsearch. Contigs were broken in uncovered regions outside ORF and domains. At the end, the read strand information was used to correctly orient RNA-Seq contigs.

Repeat detection

Libraries of genomic repeats were first detected using RepeatModeler (v.2.0.1, default parameters) on both genomes. Then, these libraries were annotated with RepeatMasker [111] (v.4.1.0, default parameters) and RepBase (from RepeatMasker v4.0.5). Finally, the P. lobata and P. cf. effusa genomes were masked using their respective libraries. The numbers of bases were counted according to the classification of repeat overlapping each base. In case of overlapping repeated fragments, the longest annotated one was selected.

Gene prediction

Gene prediction was done using proteins from 18 Cnidarian species, Acropora digitifera, Acropora millepora, Aiptasia, Aurelia aurita from Atlantic, Aurelia aurita from Pacific, Clytia hemisphaerica, Fungia sp., Galaxea fascicularis, Goniastrea aspera, Hydra vulgaris, Montipora capitata, Morbakka virulenta, Nematostella vectensis, Orbicella faveolata, Pocillopora damicornis, Porites lutea, Porites rus, and Stylophora pistillata (Additional file 1: Table S5).

The proteomes were aligned against Porites lobata and Pocillopora cf. effusa genome assemblies in two steps. Firstly, BLAT [112] (default parameters) was used to quickly localize corresponding putative genes of the proteins on the genome. The best match and matches with a score ≥ 90% of the best match score were retained. Secondly, the alignments were refined using Genewise [113] (default parameters), which is more precise for intron/exon boundary detection. Alignments were kept if more than 50% of the length of the protein was aligned to the genome. In order to reduce mapping noise, for each proteome mapping alignments without introns are removed if they represent more than 40% of the number of alignments. Moreover, alignments containing at least one unique intron (i.e. intron detected using only one proteome alignements) are removed if they cover at least 10 exons detected in all alignments using all proteomes.

To allow the detection of expressed and/or specific genes, we also aligned the assembled transcriptomes of each species on their respective genome assembly using BLAT [112] (default parameters). For each transcript, the best match was selected based on the alignment score. Finally, alignments were recomputed in the previously identified genomic regions by Est2Genome [114] in order to define precisely intron boundaries. Alignments were kept if more than 80% of the length of the transcript was aligned to the genome with a minimal identity percent of 95%.

To proceed to the gene prediction for both species, we integrated the protein homologies and transcript mapping using a combiner called Gmove [115]. This tool can find CDSs based on genome located evidence without any calibration step. Briefly, putative exons and introns, extracted from the alignments, were used to build a simplified graph by removing redundancies. Then, Gmove extracted all paths from the graph and searched for open reading frames (ORFs) consistent with the protein evidence. Single-exon genes with a CDS length smaller or equal to 100 amino acids were filtered out. From the remaining genes, only genes with homologies against more than one species (Diamond [116] v0.9.24, blastp, e-value ≤ 10−10) or spliced genes with a ratio CDS length / UTR length greater or equal to 0.75 were kept. Then, putative transposable elements (TEs) set were removed from the predicted gene using three different approaches: (i) genes that contain a TE domain from Pfam [117]; (ii) transposon-like genes detected using TransposonPSI (http://transposonpsi.sourceforge.net/, default parameters); (iii) and genes overlapping repetitive elements detected using RepeatMasker [111] and RepeatModeler [118] (v2.0.1, default parameters, repetitive sequence detected by RepeatModeler were annotated using RepeatMasker) on the genome assembly. Also, InterProScan [119] (v5.41–78.0, default parameters) was used to detect conserved protein domains in predicted genes. So, predicted genes without conserved domain covered by at least 90% of their cumulative exonic length, or matching TransposonPSI criteria or selected Pfam domains, were removed from the gene set.

Completeness of the gene catalogs was assessed using BUSCO [107] version 4.0.2 (eukaryota dataset odb10 and default parameters).

The genes of Pocillopora cf. effusa were previously named using the prefix Pmea because we first thought the sampled species was Pocillopora meandrina, but recently discovered it was Pocillopora effusa instead.

Gene prediction of alternative haplotypes

Gene prediction of the alternative haplotypes was done using proteins annotated on the reference haplotypes. Proteins were aligned as described previously, using BLAT and Genewise aligners with the same parameters. These alignments were integrated using Gmove as described previously. Additional file 1: Table S4 reports gene catalog statistics for reference and alternative haplotypes. Alignments between genes from the reference and alternative assemblies were computed using DIAMOND and only genes with best reciprocal hits were considered as allelic copies.

Adaptation of gene prediction workflow for tandemly duplicated genes

The presence of highly conserved genes in the same genomic regions can hinder the gene prediction, mostly if based on the alignments of conserved proteins. Indeed, during the spliced alignment step, individual exons of a given protein sequence can be distributed over several genes. Therefore, in these specific genomic regions, alignments of proteins or RNA-Seq data generally span several genes (with larger introns), which lead to the prediction of chimeric genes and the underestimation of the gene number (Additional file 1: Figure S10). To avoid these gene fusions, we added a step in our workflow. Namely, large introns of spliced alignments obtained with BLAT were post-processed. For each intron having a size greater than 5 kb, the corresponding alignment was splitted in two inside the intron, and the query sequence was realigned on the two new genomic regions. If the sum of the two alignment scores was greater than the score of the previous alignment, then the two new alignments were kept in place of the alignment that contained the large intron. This process was recursively applied until the sum of the two alignment scores did not satisfy the previous condition. At the end, alignments were refined (with dedicated alignment tools) as described in the Gene Prediction section.

Telomeric sequences

The interstitial telomeric sequences (ITS) were specifically searched as follows. First, the motif (TTAGGG)4 was searched in the coral genomes using blastn92. The blast result was filtered to keep hits with an identity percentage above 75%, a minimum coverage of 75% and two mismatches maximum were allowed. Distant hits of less than 400 bp were gathered to form a single ITS. The 188-bp satellite sequence was searched in the different coral genomes and in the NT database using Blastn. All the hits had an e-value < 1e − 13 and an identity percentage > 80%. Then, the matching sequences were used to build a HMM profile, using hmmbuild from HMMER suite [120] (v3.3).

Detection of tandemly duplicated genes

Protein sets of Porites lobata, Pocillopora cf. effusa, Porites evermanni, Acropora millepora, Acropora digitifera, Montipora capitata, Galaxea fascicularis, Porites lutea, Pocillopora verrucosa, Pocillopora damicornis, Stylophora pistillata, Goniastrea aspera, and Orbicella faveolata (see references in “ Orthogroups and orthologous genes” section) were aligned against themselves using Diamond [121] (v0.9.24). Only matches with an e-value ≤ 10−20 and 80% of the smallest protein aligned were kept. Two genes were considered as tandemly duplicated if they were co-localized on the same genomic contig and not distant from more than 10 genes to each other. Then, all tandemly duplicated genes were clustered using a single linkage clustering approach.

Validation of tandemly duplicated genes

We validated the structure of the clusters of TDG, by comparing their overlap with Nanopore long reads. Considering a cluster with three tandemly duplicated genes A, B, and C, we first analyzed the two pairs of adjacent genes A/B and B/C. If at least one Nanopore read completely overlaps genes A and B, we classify the pair A/B as validated. Secondly, we analyzed the whole cluster, in our example, the cluster is validated if at least one Nanopore read overlaps the three genes A, B, and C (Additional file 1: Figure S6). Nanopore reads were mapped using minimap2 and following parameters “-t 36 –sam-hit-only -a -x map-ont” and secondary alignments were filtered using “-F 2308” from samtools. Overlaps between reads and gene positions were computed using bedtool intersect (Additional file 1: Table S9, Additional file 1: Figure S7 and Additional file 1: Figure S8).

Functional assignment of predicted genes

The derived proteins of Porites lobata and Pocillopora cf. effusa predicted genes were functionally assigned by aligning them against nr from the BLAST Databases distributed by NCBI (version 25/10/2019) using Diamond [121] (v0.9.24, e-value ≤ 10−5). Then, for each predicted protein, the best match based on bitscore against RefSeq proteins is selected. If there is no match against RefSeq proteins, then the best match is kept from other matches.

Orthogroups and orthologous genes

First, we selected the proteins of 25 Cnidarian species: Acropora millepora, Acropora digitifera, Aiptasia sp., Amplexidiscus fenestrafer, Aurelia aurita from Pacific, Aurelia aurita from Atlantic, Clytia hemisphaerica, Dendronephthya gigantea, Discosoma sp., Fungia sp., Galaxea fascicularis, Goniastrea aspera, Hydra vulgaris, Montipora capitata, Morbakka virulenta, Nematostella vectensis, Orbicella faveolata, Pocillopora damicornis, Pocillopora cf. effusa, Pocillopora verrucosa, Porites evermanni, Porites lobata, Porites lutea, Porites rus, and Stylophora pistillata (Additional file 1: Table S5). Based on quality metrics, we excluded Pocillopora acuta because its number of annotated genes was higher (Fig. 1D) than expected based on comparison to other corals and only a small proportion contained domains (Fig. 1E). The proteomes were aligned against each other using DIAMOND [121] (v0.9.24, e-value ≤ 10−10, -k 0). Matches were kept only if 50% of the smallest protein length of each pair is aligned. Then, orthogroups (OG) and orthologous genes were built with OrthoFinder [122] (v2.3.11, default parameters). Additionally, OrthoFinder built gene trees for each OG and used them to reconstruct a rooted species tree, that is in agreement with the currently accepted phylogeny of cnidarians (Figs. 1A and 4B). At this stage, we noticed that Acropora digitifera and Montipora capitata datasets were of lower quality and decided to exclude them from subsequent analyses.

For each orthogroup, we listed the 5 most abundant domains detected with InterProScan on proteins from 25 Cnidarian species (Additional file 10), the 5 most abundant BLASTP hits on nrprot for Pocillopora cf. effusa proteins (Additional file 11) and the 5 most abundant BLASTP hits for Porites lobata proteins (Additional file 12). We inspected these lists manually to assign the most likely function for the 192 orthogroups amplified in corals (Additional file 2).

Coral synteny

For each genome comparison, OG were used to build syntenic clusters using orthodotter (https://www.genoscope.cns.fr/orthodotter/). Only genomic contigs containing at least 5 genes with orthologs are selected. Co-localized orthologous genes less than 15 other orthologous genes apart are considered as belonging to the same syntenic cluster. A cluster has to be formed by at least 5 syntenic genes. Dotplots for the analysis of synteny in coral genomes were built using orthodotter, and circular views of syntenic regions were generated using Circos [123].

OG consensus construction

For each of the 27,826 orthogroups (OG) containing at least one coral species, the following steps were applied. Coral proteic sequences were extracted and aligned using Muscle [124] (version 3.8.1551, default parameters). Then, the multiple alignment was filtered using OD-Seq [125] (version 1.0) to remove outlier sequences, with parameter –score—i.e., threshold for outliers in numbers of standard deviations—set to 1.5. For large gene families, where the consensus obtained contained a high proportion of gaps—i.e., where (consensus length − median length of sequences in the input) > 15% of median length of sequences, a second run of OD-Seq was performed. After the filtering of outlier protein sequences, the consensus was extracted from the multiple alignment using hmmemit in the HMMER3 package [120] (version 3.1b1, default parameters).

Then, OG consensus were aligned against each other using Blat [112] (version 36, default parameters), in order to detect unspecific regions, i.e., regions of 30 amino acids having a hit (with ≥ 85% identity) with another consensus (they typically correspond to common domains). The threshold of 85% was chosen because it corresponds to the average %identity observed when mapping reads from each coral species on consensus proteic sequences. Two thousand thirty-nine OG that contain unspecific domains were tagged. Additionally, transposon-like domains were looked for in the consensus and interproscan outputs from all genes in each OG, and 577 OG that were likely to correspond to transposable elements were also tagged. OG tagged as TE or unspecific domain-containing were not used for subsequent analyses.

Construction of gene trees

Orthofinder provides gene trees for each OG, containing the 25 cnidarian species that were used as input. We also needed to generate trees for various subsets of species (for instance, only Pocillopora cf. effusa, only Porites lobata, only 11 coral species…). For each set of species needed, we extracted the corresponding proteic sequences after removal of outlier sequences (as described in “ OG consensus construction” paragraph), and aligned them with MAFFT [126] (v7.464). Then we used FastTree [127] (2.1.11) with default parameters for construction of approximately maximum-likelihood phylogenetic trees. Trees were edited and visualized using Itol [128] and R ggtree package [129].

Calculation of Ks between P. lobata and P. cf. effusa gene pairs

Ks (rates of synonymous substitutions) were respectively calculated between pairs of P. lobata and P. cf. effusa paralogous genes after aligning protein sequences with muscle [124] (default parameters), and generating codon alignments with pal2nal [130] (V13). Then codeml (PAML package [131] version 4.8) was used to calculate dS values (i.e., Ks). The same procedure was used to calculate Ks between the 2 allelic versions of the codings sequence (CDS) of each protein (Best Reciprocal Hits between haplotype 1 and haplotype 2) in both species.

Mapping of short reads on OG consensus to estimate gene copy numbers

In order to estimate gene family copy number independently from assembly and annotation processes, we downloaded short-read datasets (Illumina paired-end) for 14 coral and 4 sea anemone species (Additional file 1: Table S7). We extracted 50 million sequences from pair1 files, trimmed to 100nt for consistency between analyses, and mapped those using diamond [121] on orthogroup coral consensus sequences. Unique hits were retained for each read, and depth of coverage was calculated on each consensus OG (25,210 OG after filtering TE and unspecific regions). Then the depth obtained for each OG was normalized for each species by dividing by the depth obtained on a set of conserved single-copy genes, in order for the final value obtained to be representative of the gene copy number. Indeed, the ratio obtained for single-copy genes is close to 1 (Additional file 1: Figure S32A).

Identification of a set of single-copy genes present in all corals

In order to normalize depths of mapping of short reads on each OG consensus, we needed a set of single-copy genes. We made a first attempt with BUSCO [107] version 4.0.2 (metazoa odb10 ancestral genes: 877 among the 954 consensus sequences were used, after discarding the ancestral genes that were not present in at least 10 coral species) but due to the divergence of corals with BUSCO metazoa ancestral sequences, few reads could be mapped and the depths obtained were low. Alternatively, we generated a set of genes that are present in exactly 1 copy in at least 13 coral species (among the 14 species listed in Additional file 1: Table S5). This conservative threshold avoids discarding genes that may have been missed or duplicated in the annotation of only one of the 14 genomes. The coral monocopy gene set contains 705 genes after removing OG with dubious transposon-related domains. For each species, after discarding a few outlier OG (with unexpectedly high or low coverages), we calculated the overall depth on the remaining single-copy OG. The values obtained were used to normalize the depths obtained by mapping the reads on each consensus OG. As expected, depths obtained after normalization on the set of 705 coral monocopy genes are tightly grouped around a value of 1. Contrastively, depths obtained on the same set of OG normalized with BUSCO metazoa are higher than 1 (due to the low coverage of mapping on BUSCO consensus), more heterogeneous between species (probably reflecting their distance to the consensus), and more variable inside species (Additional file 1: Figure S32). The set of 705 coral monocopy OG consensus could be used for other applications in coral comparative genomics. The consensus sequences are available in Additional file 13.

Detection of amplified/reduced gene families between clades

To detect gene families with significantly expanded/reduced gene numbers between corals and sea anemones we calculated, for each OG (total = 25,210), the total number of genes in the groups of species to compare (11 coral species: Fungia sp., Orbicella faveolata, Goniastrea aspera, Stylophora pistillata, Pocillopora cf. effusa, Pocillopora verrucosa, Porites evermanni, Porites lutea, Porites lobata, Galaxea fascicularis, Acropora millepora) vs 3 non-coral hexacorallia (Aiptasia sp, Amplexidiscus fenestrafer, Discosoma sp). We then performed a binomial test with a parameter of 11/14 to identify OG that display a proportion of coral genes that is significantly different from the expected proportion (11/14, under the null hypothesis where there are equal gene numbers in all species). Benjamini–Hochberg FDR correction for multiple testing was then applied and we selected the OG with corrected P-values < 0.001. The same procedure was performed to compare corals within the complex and robust clades, keeping only one species per genus (3 complex species: Porites lobata, Galaxea fascicularis, Acropora millepora, vs 5 robust species: Fungia sp, Orbicella faveolata, Goniastrea aspera, Stylophora pistillata, Pocillopora cf. effusa) with a parameter of 3/8 for the binomial test. Finally, to compare 4 massive (Orbicella faveolata, Goniastrea aspera, Porites lobata, Galaxea fascicularis) and 3 branched (Acropora millepora, Pocillopora cf. effusa, Stylophora pistillata) corals species, we used a parameter of 4/7. All calculations were performed with R (version 4.1.0).

History reconstruction of gene family copy number variations

The CAFE5 software [58] was used to estimate gene copy numbers in internal nodes of the phylogenetic tree and identify branches with significant amplification or reductions of gene families. It was applied on the OG detected as amplified in corals in comparison to sea anemones, on the 11 corals and 3 sea anemone species. Among the 192 OG detected, 120 were present in the common ancestor of all species and could thus be analyzed with CAFE, but one (OG0000004) was removed because CAFE failed to identify parameters, probably because of the very large number of genes (n = 3038). After testing several sets of parameters, the following parameters were used: -p (poisson distribution for the root frequency distribution) -k 2 (number of gamma rate categories to use). NB: using the parameter -e (to estimate the global error m

留言 (0)

沒有登入
gif