A novel approach to exploring the dark genome and its application to mapping of the vertebrate virus fossil record

A database-integrated approach to exploring the dark genome

We developed a robust, database-integrated approach to systematic in silico genome screening, referred to as database-integrated genome screening (DIGS). This approach integrates a similarity search-based screening pipeline with a relational database management system (RDBMS) to enable efficient exploration of the dark genome. The rationale for this integration is twofold: it not only provides a solid foundation for conducting large-scale, automated screens in an efficient and non-redundant manner but also allows for the structured querying of screening output using SQL, a powerful and well-established tool for database interrogation [41]. Additionally, an RDBMS offers advantages such as data recoverability, multi-user support, and networked data access.

The DIGS process comprises three key input data components:

Target database (TDb): A collection of whole genome sequence assemblies (or other large sequence datasets such as transcriptomes) that will serve as the target for sequence similarity searches.

Query sequences (Probes): A set of sequences to be used as input for similarity searches of the TDb.

Reference sequence library (RSL): The RSL represents the broad range of genetic diversity associated with the genome feature(s) under investigation. Its composition varies according to the analysis context (see Table 1). It should always include sequences representing diversity within the genome feature under investigation. It may also include genetic marker sequences and potentially cross-matching genome features. Probes are typically a subset of sequences contained in the RSL.

As illustrated in Fig. 2, the DIGS process involves systematic searching of a user-defined TDb with user-defined probes, merging fragmented hits, and classifying merged sequences through BLAST-based comparison to the RSL. The output—a set of non-redundant, defragmented hits—is captured in a project-specific relational database. Importantly, this integration allows database queries to be employed in real time, with SQL queries referencing any information captured by the database schema. SQL-based querying of screening databases facilitates the identification of loci of interest, which can then be explored further using comparative approaches (see Fig. 1b).

Fig. 2figure 2

The database-integrated genome screening (DIGS) process as implemented in the DIGS tool. (i) Screening. a On initiation of screening a list of searches, composed of each query sequence versus each target database (TDb) file is composed based on the probe and TDb paths supplied to the DIGS program. Subsequently, screening proceeds systematically as follows: b the status table of the project-associated screening database is queried to determine which searches have yet to be performed. if there are no outstanding searches then screening ends, otherwise it proceeds to step b wherein the next outstanding search of the TDb is performed using the selected probe and the appropriate BLAST+ program. Results are recorded in the data processing table (“active set”); c results in the processing table are compared to those (if any) obtained previously to derive a non-redundant set of non-overlapping loci, and an updated set of non-redundant hits is created, with each hit being represented by a single results table row. To create this non-redundant set, hits that overlap, or occur within a given range of one another, are merged to create a single entry. d Nucleotide sequences associated with results table rows are extracted from TDb files and stored in the results table; e extracted sequences are classified via BLAST-based comparison to the RSL using the appropriate BLAST program. f The header-encoded details of the best-matching sequence (species name, gene name) are recorded in the results table. g The status table is updated to create a record of the search having been performed, and the next round of screening is initiated. (ii) Reclassification: hits in the results table can be reclassified following an update to the reference sequence library

It is important to note that screening is usually an iterative discovery process, wherein initial results inform the development of subsequent screens. For instance, novel diversity detected by an initial screen can subsequently be incorporated into the RSL and hits within the screening database can be reclassified using the updated library (Fig. 2). Additionally, probe sets used in initial searches can be expanded to incorporate sequences identified during screening, broadening the range of sequences detected in subsequent screens [42]. However, care must be taken when using this approach, since it can potentially produce misleading results, or generate excessive hits (e.g., if highly repetitive sequences are contained within the new probes). Database integration allows screening results to be observed and interrogated in real time—as they are being generated. This means that configuration issues (e.g., badly composed RSL, inappropriate choice of probes) can be detected early on—potentially saving a significant amount of time and effort. Furthermore, it facilitates the implementation of agile, heuristic screening strategies, in which approaches are adjusted in line with results.

An open software framework for implementing DIGS

We constructed a software framework for implementing DIGS, called “the DIGS tool”. The DIGS tool is implemented using the PERL scripting language. It uses the BLAST + program suite [24] to perform similarity searches and the MySQL RDBMS (to capture their output). Accessible through a text-based console interface, it simplifies the complex process of large-scale genome screening and provides a versatile basis for implementing screens.

To initiate screening using the DIGS tool, researchers provide a project-specific command file (Additional file 1: Fig. S1) that serves as the blueprint for the screening process. This command file specifies parameters for BLAST searches, the user-defined name of the screening database, and file paths to the TDb, RSL, and probe sequences. When a screen is initiated, a project-specific database is created. This core schema (Additional file 2: Fig. S2) can subsequently be extended to include any relevant “side data”—e.g., taxonomic information related to the species and sequences included in the screen—increasing the power of SQL queries to reveal informative patterns (Additional file 3: Fig. S3).

Systematic screening proceeds automatically until all searches have been completed. If the process is interrupted at any point, or if novel probe/target sequences are incorporated into the project, screening will proceed in a non-redundant way on restarting. Thus, screening projects can expand as required to incorporate new TDb files (e.g., recently published WGS assemblies) or novel probe/reference sequences. The DIGS tool console allows reclassification of sequences held in the results table (e.g., following an RSL update). To increase efficiency, this process can be tailored to specific subsets of database sequences by supplying SQL constraints via the DIGS tool console.

BLAST algorithms emphasize local similarity and consequently tend to fragment contiguous matches into several separate hits if similarity across some internal regions of the match is low. The DIGS tool allows screening pipelines to be configured with respect to how overlapping/adjacent hits are handled, so that the screening process can be tailored to the specific needs of diverse projects. The DIGS tool also provides a “consolidation” function that concatenates, rather than merges, adjacent hits and stores concatenated results, along with information about their structure, in a new screening database table.

For program validation, we mined mammalian genomes for sequences disclosing similarity to the antiviral restriction factor tetherin [43, 44]. Tetherin provides a useful test case as it is a relatively distinctive gene and its evolution, distribution and diversity have previously been examined [43, 44]. Results were compared with those provided by two alternative, widely used genome mining pipelines: OrthoDB [45] and Ensembl [46] and found to overlap by > 99% (Additional file 4: Fig. S4).

The DIGS tool provides functionality for exporting FASTA-formatted sequences and managing screening database tables (e.g., add/drop tables, import table data). Further information regarding program installation and usage is provided online, in a repository associated website [47]. In the sections below, we illustrate the application of the DIGS tool to cataloging of EVEs in vertebrate genomes, focussing on both high and low copy number elements.

Use of DIGS to catalog RT-encoding endogenous retroviruses

Unusually among vertebrate viruses, retroviruses (family Retroviridae) integrate their genome into the nuclear genome of infected cells as an obligate part of their life cycle. As a result, retroviruses gain more opportunities to become a permanent part of the host germline. Furthermore, the initial integrated form of a retrovirus genome, called a provirus, is typically replication competent. ERVs can therefore increase their germline copy number through reinfection of germ line cells or (after adaptation) by intracellular retrotransposition [48, 49]. Accordingly, “endogenous retroviruses” (ERVs) are by far the most common type of EVE found in vertebrate genomes [7, 50].

Retrovirus genomes contain a pol coding domain that encodes a reverse transcriptase (RT) gene. The RT gene can be used to reconstruct phylogenetic relationships across the entire Retroviridae and hence provides the lynchpin for unraveling the evolutionary history and origins of ERV loci [51, 52]. We therefore implemented a screening procedure to detect RT-encoding ERV loci, based on an RSL comprised of previously classified exogenous retrovirus and ERV RT sequences (see “Materials and methods”). Screening involved more than 1.5 million discrete tBLASTn searches and resulted in the identification of 1,073,137 ERV RT hits. This set was filtered based on higher BLAST bitscore cutoff to obtain a high confidence set of 702,167 loci (Table 2).

Table 2 ERV RT loci identified via in silico screening

High confidence ERV RT hits were identified in all vertebrate classes. However, the frequency among classes was found to vary dramatically (Fig. 3). ERVs occur most frequently in mammals (class Mammalia) and amphibians (class Amphibia), and at relatively similar, intermediate frequencies in the genomes of reptiles (class Squamata) and birds (class Aves). By contrast, RT-encoding ERVs are relatively rare in the genomes of fish, including ray-finned fish (class Actinopterygii) and jawless fish (class Agnatha). Cartilaginous fish (class Chondrichthyes) represent a possible exception, although only a few genomes were available for this group (Fig. 3). These findings are broadly consistent with previous studies, conducted using a smaller number of species genomes [50, 53,54,55].

Fig. 3figure 3

Counts of ERV RT loci identified by identified via database integrated genome screening of 874 vertebrate species. Box plots show the distribution of endogenous retrovirus (ERV) reverse transcriptase (RT) counts in distinct vertebrate classes. Median and range of values are indicated. Circles indicate counts for individual species. Counts are shown against a log scale. Figure created in R using ggplot2 and geom_boxplot. RT hits identified as likely contaminants are not shown

ERVs have been taxonomically grouped into three clades (I, II, and III) based on their phylogenetic relatedness in the RT gene to the exogenous Gammaretrovirus, Betaretrovirus, and Spumavirus genera, respectively [1, 2]. We incorporated into our RT screening database taxonomic information for (i) host species examined in our screen and (ii) RSL RT sequences. We then used an SQL query referencing these tables to summarize the frequency of clade I, II and III ERVs in distinct vertebrate classes (Additional file 3: Fig. S3). Whereas clade I and III ERVs are found in all vertebrate groups, clade II ERVs appear to have a more restricted distribution, occurring only at low frequency in amphibians, and being completely absent from agnathans and cartilaginous fish (Table 2). A few clade II ERVs were identified in ray-finned fish, but these were very closely related to mammalian ERVs and likely represent contamination of WGS builds with mammalian genomic DNA. While RT-encoding ERV copy number is quite high in cartilaginous fish, RT diversity is relatively low, with the majority of ERV RT sequences belonging to clade III.

Use of DIGS to catalog non-retroviral EVEs vertebrate genomes

To identify non-retroviral EVEs, we first obtained an RSL representing all known viruses [56]. From this library, a set of representative probes was selected. Probes comprised representative proteomes of all known vertebrate viruses except retroviruses. Screening entailed > 1.5 million discrete tBLASTn searches, and initial results comprised 33,654 hits. However, many of these represented matches to host genes and TEs. We identified these spurious matches by interrogating screening results with a combination of SQL queries, BLAST-based comparisons to curated sequence databases, and ad hoc phylogenetic analysis.

We excluded hits that contained intact coding regions and lacked evidence of integration into host DNA, since these may be derived from contaminating exogenous viruses (Additional file 5: Table S1). We also excluded other virus-derived DNA sequences that appeared likely to represent diet-related contamination of WGS data. For example, SQL-generated summaries of our initial screen results revealed several sequences disclosing similarity plant viruses, including geminiviruses (family Geminiviridae) and potyviruses (family Potyviridae) (Additional file 3: Fig. S3). These sequences contained multiple stop codons and frameshifts, suggesting they might represent EVEs embedded within contaminating DNA, particularly since EVEs derived from both these virus groups are known occur in plant genomes [57, 58]. Other unexpected matches to plant virus groups were contained within large contigs and thus could not be dismissed as contaminating DNA. For example, a sequence identified in the genome of the pig-nosed turtle (Carettochelys insculpta) disclosed similarity to caulimoviruses (family Caulimoviridae). However, genomic analysis revealed this sequence in fact represents an unusual ERV (Additional file 6: Fig. S5).

Next we removed matches to recognized transposons that are wholly or partly comprised of virus-derived DNA, such as polintons/mavericks [59,60,61] and teratorns [62] (Additional file 3: Fig. S3). Once these EVE-like TEs had been removed, results comprised 6038 putative non-retroviral EVE sequences, representing 10 virus families (Table 3, [63]). We did not identify any EVEs derived from vertebrate viruses with genomes comprised of double-stranded RNA (e.g., order Reovirales) or circular single-stranded RNA (e.g., genus Deltavirus). However, all other virus genome “classes” were represented including reverse-transcribing DNA (DNArt) viruses, double-stranded DNA (DNAds) viruses, single-stranded DNA (DNAss) viruses, single-stranded negative sense RNA (RNAss-ve) viruses, and single-stranded positive sense RNA (RNAss + ve) viruses. Plotting the distribution of EVEs and exogenous viruses from distinct virus families and genera across vertebrate phyla revealed that many virus groups have had a broader distribution across vertebrate hosts than recognized on the basis of previously identified exogenous viruses (Fig. 4).

Table 3 Number of non-retroviral EVE sequence identified and estimated number of germline incorporation events in distinct vertebrate classesFig. 4figure 4

Exogenous versus endogenous distribution of virus families that have been incorporated into the vertebrate germline. Circles indicate the known presence of exogenous viruses in vertebrate groups, determined through reference to the NCBI virus genomes resource [56], supplemented with information obtained from recently published papers [64,65,66,67,68,69,70]. Shaded boxes indicate the presence of endogenous viral elements, as determined in the present study. RT retroviruses, DNArt reverse transcribing DNA viruses, DNAss single-stranded DNA viruses, DNAds double-stranded DNA viruses, RNAds double-stranded RNA viruses, RNAss-ve single-stranded negative sense RNA viruses, RNAss + ve single-stranded positive sense RNA viruses

We examined all EVE loci identified in our study to determine their coding potential. We identified numerous EVE loci encoding open reading frames (ORFs) > 300 amino acids (aa) in length (Additional file 7: Fig. S6). Among these, 4 encoded ORFs longer than 1000 aa. One of these—a 1718aa ORF encoded by an endogenous borna-like L-protein (EBLL) element in bats (EBLL-Cultervirus.29-EptFus) —has been reported previously [71]. However, we also identified an endogenous chuvirus-like L-protein (ECLL) element encoding an ~ 1400 aa ORF in livebearers (subfamily Poeciliinae). This element encodes long ORFs in two distinct livebearer species (P. formosa and P. latapina), indicating its coding capacity has been conserved for > 10 million years [72]. We also detected herpesvirus and alloherpesvirus EVEs encoding ORFs > 1000 aa, but as discussed below, the integration status of these sequences remains unclear.

Diversity of non-retroviral EVEs in vertebrate genomesEVEs derived from viruses with double-stranded DNA genomes

We detected DNA derived from herpesviruses (family Herpesviridae) in mammalian and reptilian genomes (Fig. 4, Table 3, [63]). DNA sequences derived from betaherpesviruses (subfamily Betaherpesvirinae) and gammaherpesviruses (subfamily Gammaherpesvirinae) have previously been reported in WGS assemblies of the tarsier (Carlito syrichta) and aye-aye (Daubentonia madagascensis), respectively [73]. In addition to these sequences, we detected gammaherpesvirus DNA in WGS data of red squirrels (Sciurus vulgaris) and the Amazon river dolphin (Inia geoffrensis), while betaherpesvirus DNA was detected in the stoat (Mustela ermina) WGS assembly, and DNA derived from an alphaherpesvirus (subfamily Alphaherpesvirinae) in the Oriximina lizard (Tretioscincus oriximinensis) WGS (Additional file 8: Fig. S7). Germline integration of human betaherpesviruses has been demonstrated [74, 75], and the presence of a betaherpesvirus-derived EVE in the tarsier genome EVE has been established [73]. However, herpesviruses can also establish latent infections and might be considered likely to occur as contaminants of DNA samples used to generate whole genome sequence assemblies. Due to the limitations of the WGS assemblies in which they were identified, it was not possible to confirm that the novel herpesvirus DNA sequences detected here represent EVEs rather than DNA derived from contaminating exogenous viruses.

DNA derived from alloherpesviruses (family Alloherpesviridae) was detected in fish and amphibians. In ray-finned fish, most of these sequences belonged to the “teratorn” lineage of transposable elements, which have arisen via fusion of alloherpesvirus genomes and piggyBac transposons, and have been intragenomically amplified in the genomes of teleost fish (infraclass Teleostei) [62]. Additional alloherpesvirus-related elements were identified in three amphibian species and five ray-finned fish species [63]. One of these elements, identified in the Asiatic toad (Bufo gargarizans) occurred within a contig that was significantly larger than a herpesvirus genome, demonstrating that it represents an EVE rather than an exogenous virus. Phylogenetic analysis revealed that alloherpesvirus-like sequences identified in amphibian genomes clustered robustly with amphibian alloherpesviruses, while those identified in fish genomes clustered with fish alloherpesviruses (Additional file 8: Fig. S7).

EVEs derived from viruses with single-stranded DNA genomes

EVEs derived from parvoviruses (family Parvoviridae) and circoviruses (family Circoviridae) are widespread in vertebrate genomes, being found in the majority of vertebrate classes (Fig. 4). Both endogenous circoviral elements (ECVs) and endogenous parvoviral elements (EPVs) are only absent in major vertebrate groups represented by a relatively small number of sequenced species genomes (i.e., between 1 and 6). No ECVs or EPVs were identified in the tuatara (order Rhynchocephalia) or in crocodiles (order Crocodilia). EPVs were not identified in agnathans, while ECVs were not identified in cartilaginous fish.

We identified a total of 1192 ECVs, most of which are derived from elements in carnivore (class Mammalia: order Carnivora) genomes that are embedded within non-LTR retrotransposons and have undergone intragenomic amplification (Additional file 9: Fig. S8). While many of the ECVs identified in our screen have been reported in previous publications [7, 32, 36, 42, 76], we also identified novel loci in mammals, reptiles, amphibians, and ray-finned fish [63]. Phylogenetic analysis (see Additional file 8: Fig. S7) revealed that a novel ECV locus in turtles groups with avian circoviruses, while amphibian ECV elements grouped with fish circoviruses, though bootstrap support for this relationship was lacking. A circovirus-like sequence detected in the WGS data of Allen’s wood mouse (Hylomyscus alleni) grouped robustly with exogenous rodent circoviruses, but integration of this sequence into the H. alleni genome could not be confirmed.

We identified 627 EPVs, representing two distinct subfamilies within the Parvoviridae and five distinct genera (see Fig. 4). The majority of these loci have been reported in a previous study [32] or are orthologs of these loci. However, we identified novel EPVs in reptiles, amphibians and mammals (Table 3, [63]). In reptiles the novel elements derived from genus Dependoparvovirus while the amphibian elements were more closely related to viruses in genus Protoparvovirus. Notably, the novel amphibian EPVs clustered basally within a clade of protoparvovirus-related viruses in phylogenetic reconstructions (Additional file 8: Fig. S7), consistent with previous analyses indicating that protoparvovirus ancestors may have broadly co-diverged with vertebrate phyla [32].

EVEs derived from reverse-transcribing DNA viruses

EVEs derived from hepadnaviruses (family Hepadnaviridae), which are reverse-transcribing DNA viruses, were identified in reptiles, birds and amphibians (Table 3, [63]). Most of these EVEs, commonly referred to as “endogenous hepatitis B viruses” (eHBVs), have been reported previously [35, 77]. However, we identified novel elements in the plateau fence lizard (Sceloporus tristichus) and others in vertebrate classes where eHBVs have not been reported previously. These include one element identified in a cartilaginous fish, the Australian ghostshark (Callorhinchus milii), and another identified in an amphibian, the common coquí (Eleutherodactylus coqui).

Phylogenetic analysis (see Additional file 8: Fig. S7) revealed that novel eHBV elements identified in lizards (suborder Lacertilia) group robustly with the exogenous skink hepadnavirus (SkHBV), while the amphibian element groups within a clade comprised of the exogenous spiny lizard hepadnavirus (SlHBV), Tibetan frog hepadnavirus (TfHBV) and eHBV elements identified in crocodile genomes. The eHBV identified in sharks was relatively short and not amenable to phylogenetic analysis but nonetheless provides the first evidence that hepadnaviruses infect this host group.

EVEs derived from viruses with single-stranded, negative sense RNA genomes

Screening revealed that vertebrate genomes contain numerous EVEs derived from mononegaviruses (order Mononegavirales), which are characterized by non-segmented ssRNA-ve genomes. These EVEs derive from four mononegavirus families: bornaviruses (family Bornaviridae), filoviruses (family Filoviridae), paramxyoviruses (family Paramyxoviridae) and chuviruses (family Chuviridae) (Fig. 4, Table 3, [63]). We did not detect any EVEs derived from other mononegavirus families that infect vertebrates (Pneumoviridae, Rhabdoviridae, Nyamiviridae, Sunviridae), nor any EVEs derived from virus families with segmented, negative sense RNA genomes (e.g., Peribunyaviridae, Orthomyxoviridae).

The majority of mononegavirus EVEs identified in our screen were derived from bornaviruses and filoviruses and have been described in previous reports [7, 32, 35, 36, 78]. However, we also identified novel EVEs derived from these groups, as well as previously unreported EVEs derived from paramyxoviruses and chuviruses (Table 3).

Germline integration of DNA derived from mononegaviruses can occur if, in an infected germline cell, viral mRNA sequences are reverse transcribed and integrated into the nuclear genome by cellular retroelements [79]. EVE loci generated in this way preserve the sequences of individual genes of ancient mononegaviruses, but not entire viral genomes. Among mononegavirus-derived EVEs, regardless of which family, elements derived from the nucleoprotein (NP) and large polymerase (L) genes predominate. However, other genes are also represented, including the glycoprotein (GP) genes of filoviruses, bornaviruses, and chuviruses, the VP30 and VP35 genes of filoviruses, and the hemagglutinin-neuraminidase (HA-NM) gene of paramyxoviruses.

Paramyxovirus-like EVEs were identified in ray-finned fish, amphibians, and sharks (Fig. 

留言 (0)

沒有登入
gif