Exploring the diversity and evolutionary strategies of prophages in Hyphomicrobiales, comparing animal-associated with non-animal-associated bacteria

Data mining

We curated the Hyphomicrobiales genomes that are available in the Bacterial and Viral Bioinformatics Resource Center (BV-BRC/ PATRIC database) [12]. The evaluation of the quality of the genomes was done according to the parameters of fine consistency and high coverage as provided by BV-BRC, such that only complete genomes of good quality (coverage > 90%) and fine consistency (≥ 95%) were analyzed; duplications or unclassified data were removed (Supplementary Fig. 1). Out of 11,511 genomes in the order Hyphomicrobiales that were deposited in BV-BRC (https://www.bv-brc.org/ accessed in October 1, 2022), only 560 genomes with a total of 895 contigs met our stringent quality criteria and were included in subsequent analyses (See Supplementary Table 1). The lifestyle and evolutionary traits of Hyphomicrobiales were identified based on the classification of Wang et al. [6].

Additionally, to classify APS, we followed the methodology proposed by Wang et al. (6, illustrated in Fig. 1). The primary genera associated with specific lifestyles, such as Hyphomicrobium, Methylobacterium, Bradyrhizobium, Devosia, Sinorhizobium, Agrobacterium, Rhizobium, Mesorhizobium, Bartonella, and Brucella, were selected for our analysis. These genera were then curated based on the criterion of ‘representative’ genomes available in BV-BRC (accessed on November 5, 2023, at https://www.bv-brc.org/). Consequently, the genomes were reclassified into 19 distinct genera, as detailed in Supplementary Table 2. Furthermore, we employed a dataset of 96 genomes comprising 265 contigs to classify the APS utilizing the Defense Finder server (https://defense-finder.mdmparis-lab.com/), which automatically identifies known antiviral systems in prokaryotic genomes [13]. All 265 contigs underwent analysis, and the results summarized the types of defense systems present in these genomes, enabling the characterization of APS frequency within the Hyphomicrobiales order.

Fig. 1figure 1

Percentage of prophages in host genomes and the PHASTER classification: (red) animal-associated bacteria (AAB) and (blue) non-animal-associated bacteria (NAAB) – other genera of Hyphomicrobiales. The numbers on the red and blue bars indicate the numbers of prophages of AAB and NAAB genomes in each PHASTER classifications. The stars indicate significant differences between the two groups of genomes at the 5% significance level, two-sided test, Bonferroni correction for three comparisons in each case

Prophage analysis

The genomes were submitted to the PHASTER server to annotate phage regions and to classify prophage sequences as intact, questionable, or incomplete [14]. These ‘completeness classes’ are determined by PHASTER using several factors, including prophage length, the number of phage-like genes identified within the sequence, and their identity [15]. The three categories reflect the certainty of the PHASTER algorithm in identifying a gene region as a full, functional prophage. Thus ‘intact’ prophages are considered very likely to be full prophages, whereas ‘incomplete’ prophages are likely partially degraded prophage remnants. Prophage sequences with scores between these two extremes are classified as ‘questionable’. This analysis allowed us to compute the number of prophages identified per genome, the fraction of these prophages in each completeness class, and the length distribution of prophage sequences, all of which were compared between prophages found in AAB and NAAB.

The genetic repertoire of prophages in these three completeness classes was also compared, as described in Khan et al. [16]. In brief, for each coding sequence within an identified prophage region, PHASTER annotates BLAST hits for that sequence, as described in Zhou et al. [15]. We recorded instances within these annotations of the following 13 classes of phage genes: injection (injn), plate (plat), flippase (flip), capsid (caps), terminase (term), head (head), lysin (lysn), portal (port), lysis (lyss), integrase (intg), tail (tail), transposase (tran), protease (prot). For these 13 phage gene classes, we counted the number of prophages identified as containing at least one gene of that class. We noted that some gene classes may not be observed in intact prophages in this dataset. This does not imply that these genes are necessarily absent in these prophages, but could be an observational bias based on the available annotated gene sequences used for BLAST comparisons, BLAST parameter settings, or simply small numbers. When a gene class was not observed in intact prophages, this class was excluded from further analysis.

We further examined the gene repertoire annotations based on whether the prophage sequence was classified as “intact”, “questionable” or “incomplete”. This allowed us to calculate the percent change in the frequency of specific classes of prophage genes in degraded prophage sequences, as compared to intact prophages [16]. The percentage change in gene frequency indicates whether certain genes are observed in incomplete prophages more or less frequently than expected, based on their frequency in intact prophages. Specifically, positive values indicate that a particular gene is enriched in incomplete prophages, or preferentially retained as the prophage sequence degrades. In contrast, negative values indicate that this gene group is lost more quickly than others. To evaluate the statistical significance of differences in percentage change, we randomly assigned the same total number of genes to one of the three completeness classes, preserving the proportion of genes assigned to each class [17]. We then computed the percentage change in gene frequency for these bootstrapped, randomly allocated data, and repeated this procedure 10,000 times to assess whether an observed change was significant at the 5% level, including Bonferroni corrections. Two categories of statistical significance were defined: (a) genes that are preferentially lost in incomplete prophages; and (b) genes enriched in incomplete prophages.

The prophage length distributions were also fit to a mathematical model to quantify any differences in the evolutionary rates affecting prophages in each bacterial category (AAB or NAAB). This model was developed to describe the prophage length distribution in two large prophage datasets [16] and later adapted to quantify differences in prophage content across classes of bacteria [17], as we do here. In this model, a partial differential equation describes changes to the length distribution of prophages over time:

$$\frac= \alpha f\left(x\right)+ \frac\left[rD\ x\ Q\left(x,t\right)\right]+\left[_S\left(x\right)-_I\left(x\right)\right]Q\left(x,t\right)-\delta \left(t\right)Q(x,t)$$

In the above equation, Q(x, t) is the density of prophages of length x (kb) at time t. The parameter α represents the rate of lysogeny, and f(x) is the length distribution of active phages entering bacterial genomes via lysogeny. The parameter rD is the rate at which prophage sequences decay in length over time. Thus, before the action of selection and induction, mutation acts at a constant and uniform rate across the prophage genome, and mutation is deletion-biased, gradually eroding prophages. The parameter rS is the selection coefficient (benefit or detriment) conferred to the host by an intact prophage, while S(x) is the fraction of this selective effect conferred by a prophage of length x. The parameter rI represents the rate of induction, while I(x) is the probability that a prophage of length x may be lost by induction. I(x) itself depends on the parameter nI, which is the number of genes required for the loss of the prophage to occur via induction, ie. the number of genes necessary for the prophage sequence to excise from the host genome. The function δ(t) is a normalizing constant which ensures that the density integrates to one at all times. In the steady-state solution, this function is a constant which we simply denote δ, reflecting the turnover rate of the prophage population.

As described in Pattenden et al. [16], we used maximum likelihood minimization to find parameter values such that the steady state solution of Eq. (1) best fits the prophage length distribution in our datasets. We note that the steady state solution does not depend independently on all of the rates in the model but only on the ratio of these rates. In other words, only four of the five rate constants in the model are identifiable, and the output of data fitting is not the rates themselves, but their relative ratios. We used the degradation rate as the normalizing factor for this ratio, since the rate of degradation, a mutational rate, is not expected to differ between host classes. This data fitting procedure thus yields four relative rates, α/rD, rS/rD, rI/rD and δ/rD, which correspond to: the relative rate of lysogeny (i.e. rate at which new prophages enter the genome), the selective effect (i.e. selective benefit to the host of carrying an intact prophage), the induction rate (rate at which fully competent prophages induce the lytic phase), and the turnover rate (loss of prophages from the population, independent of their length). This data fitting procedure also estimated nI, the number of genes required for induction.

Finally, we used a separate bootstrapping procedure to determine whether these five best-fit parameter values differed significantly between AAB and NAAB. Here, we pooled all the prophage sequence lengths from AAB and NAAB, drew samples from this pooled set with replacement, and computed the best-fit parameter values for each sample. We then compared whether the best-fit parameter values obtained for AAB or NAAB data differed significantly from the distribution of parameter values estimated from the pooled data. For each of AAB and NAAB, we drew 1000 samples from the pooled dataset, where in each case the number of prophage lengths in the sample was the same as the number in the AAB or NAAB datasets, respectively.

留言 (0)

沒有登入
gif