Genome entropy and network centrality contrast exploration and exploitation in evolution of foodborne pathogens

Modelling evolution of foodborne pathogens is crucial for mitigation and prevention of outbreaks. We apply network-theoretic and information-theoretic methods to trace evolutionary pathways of Salmonella Typhimurium in New South Wales, Australia, by studying whole genome sequencing surveillance data over a five-year period which included several outbreaks. The study derives both undirected and directed genotype networks based on genetic proximity, and relates the network's structural property (centrality) to its functional property (prevalence). The centrality-prevalence space derived for the undirected network reveals a salient exploration-exploitation distinction across the pathogens, further quantified by the normalised Shannon entropy and the Fisher information of the corresponding shell genome. This distinction is also analysed by tracing the probability density along evolutionary paths in the centrality-prevalence space. We quantify the evolutionary pathways, and show that pathogens exploring the evolutionary search-space during the considered period begin to exploit their environment (their prevalence increases resulting in outbreaks), but eventually encounter a bottleneck formed by epidemic containment measures.

Environmental reservoirs of foodborne pathogens harbour diverse microbial populations that evolve within their ecological niches. New strains associated with increased virulence or antimicrobial resistance pose a particular risk to human health. An understanding of the evolutionary dynamics of endemic foodborne pathogens is therefore important to enable effective management of threats to public health, food safety, and trade. To this end, new methods are required for identifying evolutionary phases and bottlenecks, and abrupt transitions in the dynamics that signal emerging threats.

Salmonella enterica causes a significant foodborne disease burden globally [1]. In Australia, S. enterica subspecies enterica serovar Typhimurium (STM) is responsible for most locally-acquired foodborne infections [25]. In New South Wales (NSW), STM surveillance includes routine whole genome sequencing (WGS) of bacterial isolates, which has identified genomic diversity in the endemic population including a multi-drug-resistant monophasic strain [6, 7]. The evolutionary dynamics of endemic strains can be modelled as a trade-off between an exploration phase where genetic diversity increases to explore the local fitness landscape, and an exploitation phase where a successful variant disseminates in its niche [810]. Pathogen surveillance targets human disease, and the majority of available genomic data is collected from clinical cases of salmonellosis. This renders visible only part of the STM population which infects human hosts (e.g. through contaminated food), whereas the dynamics of populations in reservoirs such as poultry flocks are largely invisible. Furthermore, traditional evolutionary tools such as phylogenetic analysis are limited in that they can only analyse bacterial sequences which are common to all isolates. However bacteria often have extensive accessory genomes, i.e. genes present only in a subset of strains. These accessory genes are not necessary to generate a fully functioning bacterial cell, but often encode virulence factors and mechanisms that provide antibiotic resistance. The accessory genome can also be rapidly mobilised and exchanged between and within bacterial populations therefore providing a rapid mechanism for genetic diversification [11]. The shell genome here is defined as the set of genes present in 15%–95% of isolates.

Previous studies identified an exploration-exploitation distinction in a space formed by two dimensions: structure (network centrality) and function (prevalence), using an STM surveillance dataset spanning a nine-year period in NSW [12, 13]. This disease predated routine genomic surveillance, employing multiple-locus variable-number tandem-repeats analysis (MLVA) instead of WGS. While MLVA often provides sufficient discriminatory power to support outbreak investigations [14, 15], WGS data offers a much higher resolution and the possibility to quantitatively verify the suggested distinction between exploration and exploitation [16, 17]. In this study, we aimed to apply network-theoretic and information-theoretic methods in order to trace STM evolutionary pathways and identify their salient phases and transitions, using a WGS dataset of 3939 isolates collected between 2016 and 2021 and sequenced by NSW Health Pathology.

We first applied an established network-theoretic approach [13, 18] to derive genotype networks for the new WGS dataset. This allowed us to identify a clear partition of pathogens in the constructed centrality-prevalence space. We then demonstrated, by computing the corresponding shell genome entropy, that pathogens with the low-to-mid centrality have a higher genetic diversity (identifying these with the exploration phase), while pathogens with the mid-to-high centrality have a lower genetic diversity (identifying these with the exploitation phase). A transition between these phases was quantified with the Fisher information in terms of the network centrality. Continuing with network analysis, we computed the probability density along evolutionary paths, contrasting the paths with increasing and decreasing connectivity. This analysis demonstrated that pathogens exploring the evolutionary search-space (i.e. with low-to-mid centrality) may move towards the exploitation phase (i.e. change their network centrality towards higher values) while increasing their prevalence and producing outbreaks. Importantly, however, we identified an evolutionary bottleneck in the observed population emerging in response to the epidemic containment measures taken by Health Protection NSW at that time.

2.1. Bacterial isolates and bioinformatic analysis

A total of 3939 STM isolates collected between 2016 and 2021 and sequenced at the Institute of Clinical Pathology and Medical Research (ICPMR), NSW Health Pathology were included in the analysis. Sequencing was performed as described in a previous study [18]. Demultiplexed sequencing reads with $1\times 10^7$ reads per isolate were trimmed [19], based on a minimum quality read score of 20, and then assembled de novo using SPAdes (version 3.13.0 [20]). The quality of assemblies was assessed with Quast (version 5.0.2 [21]); only assemblies with $200$ contigs and N50 values of $$50 000 bp were included in further analysis. Contigs were annotated with predicted genes using Prokka (version 1.13.3 [22]). The bioinformatic tool, Roary (the pangenome pipeline; version 3.12.0 [23], with default parameters) was used to partition the gene content of all isolates. The software defines the core genes as genes present in 99%–100% of isolates, soft core (in 95%–99% of isolates), shell (in 15%–95% of isolates) and cloud gene content (in 0%–15% of isolates). Single nucleotide polymorphisms (SNPs) were identified from the 3792 383 base pairs defined as the core genome, using SNP-sites [24] without recombination masking [25]. The SNP distance Gij between isolates i and j is defined as the number of core SNPs by which they differ. The median value of Gij across all 3939 STM isolates was 383 (range 0–2051).

2.2. Undirected genotype network

We first derived a complete undirected network in which nodes represent individual isolates and edges represent genetic distances. Specifically, following an earlier study [18], the edge weight between adjacent nodes i and j in the complete undirected network was set to the pairwise SNP distance, Gij . We then used the derived undirected network to construct a 2-dimensional space in terms of two quantities measured for each node i: the closeness centrality li , and the prevalence defined as the number of close neighbours within genetic distance $G_} = 20$ from node i. The constructed centrality-prevalence space relates a structural property of the pathogen (its network centrality) to the functional property (its neighbourhood prevalence). The latter can be interpreted as the pathogen's fitness capturing the average spread and virulence across its genetic neighbourhood [13]. Following prior studies [12, 13], we use the normalised node closeness centrality, computed as the inverse sum of the length of the shortest paths between the node of interest and other nodes, as described below.

2.2.1. Closeness centrality

Node closeness centrality captures the relative importance of node i to other nodes in the network by distance proximity. It is calculated as the inverse average shortest distance from node i to other nodes:

Equation (1)

where N is the number of nodes in the network, $d_$ is the length of the shortest path between any pair of nodes i and j, defined as the sum of the edge weights $G_$, where $k_1 = i$ and $k_m = j$, on the path with $(m-1)$ edges. If m = 2, then $d_ = G_$.

If we take inverse of li , $\lambda_i = 1 / l_i$, and average λi across the entire network, the result yields the average path length L of the network:

Equation (2)Equation (3)Equation (4)

In this study, we employ the average length of the shortest paths to measure how individual nodes contribute to the average genetic variation in the population. The relationship between the closeness centrality and the average path length suggests that the closeness centrality is a more suitable characteristic of genetic diversity of the population on average than alternative centrality measures, such as betweenness centrality [26] or eigenvector centrality [27].

2.2.2. Thresholded undirected network

Using the undirected network, we then constructed a thresholded undirected network (figure 1) where the edges are only inferred between genetically similar isolates, within the genetic distance threshold $G_} = 20$. When comparing Salmonella core genomes, distances of 0–5 SNPs are highly indicative of STM isolates collected from the same point source outbreak—if they are collected within a three month period—whereas distances of more than 20 SNPs suggest that isolates are epidemiologically unrelated [2830]. The choice of $G_} = 20$ therefore produces edges between pairs of plausibly-related isolates while excluding edges where the number of core SNPs would imply the divergence of strains occurred well outside of the study period. In our previous study of Salmonella Enteritidis genotype networks [18], we showed that the network analysis was robust to choices of Gmax between 10 and 30.

Figure 1. The thresholded undirected SNP sub-networks produced for the complete SNP network using SNP distance threshold $G_} = 20$. Total number of nodes $N = 3,939$, total number of edges $M = 336,522$. Singletons (193 isolates) are removed. Total number of components: 173 (excluding singletons). See figure A1(a) for the degree distribution. Node colours: (blue) low-to-mid centrality with low prevalence; (cyan) low-to-mid centrality with high prevalence; (orange) mid-to-high centrality with low prevalence; (yellow) mid-to-high centrality with high prevalence. Colour coding is in accordance with figure 5. See section 3.1 for more details.

Standard image High-resolution image

We highlighted nodes by applying a centrality threshold of $3.3\,\times\,10^$: nodes with centrality below the threshold are coloured blue or cyan and nodes with a higher centrality are coloured orange or yellow, as shown in figure 1. This centrality threshold is determined by an observed partition of isolates in the centrality/prevalence space (see section 3.1 for more details). The centrality-prevalence space is used for identifying patterns and phases in the evolutionary dynamics of the pathogens.

2.3. Directed genotype network

We also derived a directed SNP sub-network that captures both genetic and temporal proximity, with a directed edge only inferred between isolates if they are (i) genetically close ($G_} = 20$), and (ii) collected within a fixed time window ($T_} = 30$ days), with the edge direction pointing from the predecessor to the successor. In cases where the isolates are collected on the same day, two directional edges are inferred. We chose the collection date as a proxy to the date of infection because foodborne salmonellosis is characterised by acute symptoms with the onset periods varying between 6 hours to 6 days. In addition, the time window, Tmax, is an approximation of the median shedding period commonly associated with foodborne salmonellosis [31]. A visual representation of the directed SNP sub-network is shown in figure 2.

Figure 2. The thresholded directed SNP sub-networks derived using SNP distance threshold $G_} = 20$ and time window $T_} = 30$ days. The inset shows a zoomed-in view of one of the network components. Total number of nodes $N = 3,939$, total number of edges $M = 55,440$. Singletons (824 isolates) are removed. Total number of components: 338 (excluding singletons). Node colours: (blue) low-to-mid centrality with low prevalence; (cyan) low-to-mid centrality with high prevalence; (orange) mid-to-high centrality with low prevalence; (yellow) mid-to-high centrality with high prevalence. Colour coding is in accordance to figure 5. See section 3.1 for more details.

Standard image High-resolution image 2.4. Probability density of directed paths

A directed path is formed by connecting directed edges and may be considered as an approximation of pathogen adaptations over time. Previous studies of directed MLVA networks [13] suggested a distinction between paths leading to 'successful' or 'unsuccessful' adaptation, by associating these types with the increasing or decreasing connectivity along the path. In this work, we applied this distinction to the directed network derived using the WGS dataset. In order to trace genetic changes over time, we identified $38,698$ sufficiently long (m > 5) directed paths formed by connecting a sequence of directed edges. For each path we compared changes of the neighbourhood prevalence between the start and end nodes, distinguishing positive (successful adaptation) and negative (unsuccessful adaptation) changes. Some examples of successful and unsuccessful paths are shown in figure 3.

Figure 3. 'Successful' and 'unsuccessful' paths in the directed network. For the shown successful path, the end node has a higher neighbourhood prevalence than the start node (i.e. gaining neighbours with prevalence increasing from 144 to 387). The end node of the shown unsuccessful path, on the other hand, has a lower neighbourhood prevalence than the start node (i.e. losing neighbours with the prevalence decreasing from 88 to 74). The directed network shown on the right is the one depicted in figure 2.

Standard image High-resolution image

These directed paths were projected onto the centrality-prevalence space for the complete undirected SNP network, and for each point (i.e. each node) the prevalence changes were averaged across all paths originating from this node. For each point in the centrality-prevalence space (i.e. for each node), we quantify the average changes in prevalence along both successful and unsuccessful paths originating from this node. The resultant estimated probability density of directed paths distinguishes between the pathogens which tend to develop towards a higher or lower neighbourhood prevalence [13]. Thus, the probability density captures the expected changes in prevalence during a subsequent pathogen adaptation (measured via the likelihood of either expanding or contracting genetic neighbourhood).

2.5. Shell genome entropy

The pangenome comprises the entire set of $13,555$ genes from all $3,939$ isolates. After excluding the core and soft core genome (genes present in greater than 95% of isolates) and the cloud genome (genes present in fewer than 15% of isolates), the remaining shell genome consisted of 713 genes. The high-resolution WGS dataset enabled us to further explore the evolutionary pathways by looking at the genetic composition of different groups of isolates (e.g. isolates in the low-to-mid and mid-to-high centrality groups). We computed normalised entropy of the shell genome for each group.

The shell genome entropy is computed using gene presence/absence data over a fixed number of temporally ordered isolates (n = 10, note that collection date is not uniformly spaced). See figure 4 for a schematic of the shell genome entropy computation. The normalised Shannon entropy, S, is then calculated as follows [32]:

Equation (5)

where Q is the number of genes, pi is the probability of the gene i being present in the considered group of isolates, and $H_} = (Q)$ the maximum entropy.

Figure 4. A schematic of the shell genome entropy computation defined in equation (5). Note that the shell genome data shown in step (B) is a subset of the shell genome data for the full population shown in figure B1.

Standard image High-resolution image 2.6. Fisher information in the centrality space

In order to verify that the two proposed types of the evolutionary dynamics (exploration and exploitation) can be characterised as phases, separated by a properly defined phase transition, we considered the pathogen adaptation as a thermodynamic process developing in response to changes in the control parameter defined as the node closeness centrality.

We employed the Fisher information to capture the sensitivity of the shell genome probability distribution to changes in the closeness centrality. In general, Fisher information [33] measures the amount of information that an observable random variable $\mathcal$ contains about an unknown parameter θ, given the probability function $p(x|\theta)$ over $x \in \mathcal$ with respect to parameter θ:

Equation (6)Equation (7)

where $\mathrm [\cdot | \theta]$ denotes the conditional expectation with respect to θ.

There is a well-established association between phase transitions and the Fisher information [3441]: the Fisher information diverges at critical points, being proportional to the rate of change of the corresponding order parameter [37]. In the context of evolutionary dynamics of pathogens, the system possibly undergoing a transition is a pathogen which may change its adaptation behaviour from explorative to exploitative in response to changes in its node centrality (i.e. control parameter). In general, the node centrality may be expected to change in response to the underlying evolutionary dynamics. Hence, in this study, we consider changes in the node centrality as a process corresponding to varying the control parameter. On the other hand, the prevalence may correspond to the order parameter characterising the outcome of the evolutionary process. The advantage of using Fisher information in identifying possible phase transitions is that the computation of the corresponding order parameter is not required—instead, this model-independent method seeks specific critical points in the centrality space at which the Fisher information peaks.

We computed normalised Fisher information using gene presence/absence data over the isolates ordered by centrality with a fixed centrality increment of $1 \times 10^$. Following [4244], the normalised discrete Fisher information is calculated as follows:

Equation (8)

where R is the number of centrality increments (steps), and F0 is a normalisation constant:

Equation (9)3.1. SNP centrality-prevalence space

The constructed centrality-prevalence space (figure 5) shows a clear structure with at least two groups separated by their closeness centrality values: the isolates with low-to-mid centrality ($ \leqslant 3.3 \times 10^$, shown in blue/cyan), and the isolates with mid-to-high centrality ($3.3 \times 10^-3$, shown in yellow/orange). We draw an additional distinction by separating the isolates by neighbourhood prevalence (set at 100 isolates marked by the dashed line), above which isolates are considered high prevalence.

Figure 5 shows that (i) isolates in the same centrality group (low-to-mid, shown in blue; or mid-to-high, shown in orange) typically belong to the same network component, and (ii) isolates with high neighbourhood prevalence (shown in yellow and cyan) usually cluster together within the network components with some peripheral nodes from the same centrality group.

Figure 5. Centrality-prevalence space for the complete undirected SNP network. The neighbourhood prevalence (shown in log scale) of an isolated is measured by the number of neighbours within $G_} = 20$. Blue/cyan and orange/yellow colours distinguish the groups of isolates with lower and higher centrality, split at $3.3 \times 10^$. Grey nodes are singletons. The dashed line, set at 100 isolates, separates lower and higher neighbourhood prevalence (shown with lighter and darker colours).

Standard image High-resolution image

We also related the neighbourhood prevalence and the local clustering coefficient. This allowed us to investigate the role of clusters and hubs which are known to contribute to epidemic propagation [45]. However, the partition of isolates observed in the clustering-prevalence space (see figure A1(b) in appendix A) is not as structured as the one in the centrality-prevalence space.

3.2. Shell genome entropy and exploration-exploitation distinction

Previous studies distinguished between the explorative and exploitative pathways in the STM genetic space and argued that a mid-to-high centrality region contains higher-risk STM pathogens [12, 13]. These studies relied on coarse-grained MLVA datasets, and these results have not been verified with high-resolution WGS data. In this work, in order to associate the two groups separated in the centrality-prevalence space (figure 5) with the exploration-exploitation distinction, we compared the shell genome entropy between the groups.

Figure 6(a) shows that the isolates with low-to-mid centrality tend to have a higher normalised shell genome entropy, thus demonstrating a greater genetic diversity, while the normalised shell genome entropy for the isolates with mid-to-high centrality tends to be smaller, showing a reduced genetic diversity (figure 6(b)). This difference confirms the exploration-exploitation distinction: the isolates with low-to-mid centrality tend to diversify, while the isolates with mid-to-high centrality are more stable in their shell genome. Figure B1 in appendix B illustrates the corresponding shell genome data.

Figure 6. Normalised shell genome entropy (713 genes) over time for the two groups identified in figure 2. The normalised entropy is higher for (a) isolates with low-to-mid centrality (mean 0.873, std. dev. 0.064) than for (b) isolates with mid-to-high centrality (mean 0.709, std. dev. 0.044). Mean is marked by solid black line. Note that time intervals are not evenly distributed: each bin contains the same number of isolates (i.e. 10) but time intervals vary according to the temporal density of isolates.

Standard image High-resolution image 3.3. Fisher information and evolutionary dynamics

We also examined a possible transition between two phases (exploration and exploitation) in the centrality space by tracing the normalised Fisher information computed for all isolates ordered by their centrality values. Figure 7 shows that a small change in the centrality value from $2.6 \times 10^$ to $2.7 \times 10^$ results in an abrupt spike in the Fisher information. This indicates that the probability distribution of genes in the isolates around this centrality value is most sensitive to the centrality change. In thermodynamic terms, when the closeness centrality is considered as a control parameter, the observed spike in the Fisher information indicates that there is a phase transition between the lower and higher centrality regions. We interpret the resulting evolutionary dynamics as exploration and exploitation respectively. The critical regime is observed towards the end of exploration phase before the population self-organises into more exploitative dynamics.

Figure 7. Normalised shell genome Fisher information (713 genes) in the centrality space. The normalised Fisher information has two peaks: (i) a large peak for the isolates around critical centrality value between $2.6 \times 10^$ and $2.7 \times 10^$, and (ii) a smaller peak for isolates with a high centrality greater than

留言 (0)

沒有登入
gif