A global SARS-CoV-2 phylogeny [70] dated 20 May 2022 was downloaded from Global Initiative on Sharing All Influenza Data (GISAID) EpiCoV™ [71, 72] on 24 May 2022. This phylogenetic tree is based on the masked alignment [73] of 7,603,547 high coverage (i.e. having < 1% undefined bases (Ns)) SARS-CoV-2 genome sequences (Additional file 1: Fig. S1).
Introductions SARS-CoV-2 sequence metadataMetadata including the date and country of sample collection for all GISAID sequences was downloaded on 4 June 2022. We have employed the GISAID nomenclature to attribute samples to countries.
SARS-CoV-2 confirmed casesThe number of daily confirmed SARS-CoV-2 cases for NI and RoI were downloaded from the coronavirus (COVID-19) in the UK dashboard [74] and Ireland’s COVID-19 Data Hub [75] respectively.
Passengers arriving to Ireland by airOur study analysed aviation data pertaining to passengers travelling to Ireland between March 2020 and June 2022 as sourced from NISRA [76] for NI and CSO [77] for RoI. While CSO data for RoI was directionally resolved, allowing raw use of arriving passenger numbers, monthly arrival passenger populations for NI were estimated by halving the published passenger flow values (i.e. \(\text = \frac}\)).
Passengers arriving to Ireland by seaTo monitor the arrival of short sea ferry passengers travelling from Great Britain to the island of Ireland, we considered routes originating from Cairnryan (Scotland), Fishguard (Wales), Holyhead (Wales), Liverpool (England), or Pembroke (Wales) and terminating at Belfast (NI), Dublin (RoI), Larne (NI), or Rosslare (RoI). We also compiled data on international sea passenger routes to Ireland, which included routes departing from Bilbao (Spain), Cherbourg (France), and Rosscoff (France). These routes led to Cork, Dublin, and Rosslare in RoI. To obtain the sea passenger numbers for RoI, we used data from CSO. Specifically, we utilised the sum of the air and sea monthly directionally resolved passenger data [78] and then subtracted the corresponding figures for air travel [77]. For NI, sea passenger data, which derives from the forms available publicly (SPAS0107 and SPAS0201) [79], was obtained upon request from the UK Department for Transport Aviation and Maritime Analysis division. This data includes only passengers between Great Britain and NI, excluding Crown Dependencies like the Isle of Man, and encompasses both leisure passengers and Heavy Goods Vehicle (HGV) freight drivers.
Population density for government districts in IrelandPopulation data within RoI and NI were obtained from CSO [2] and NISRA [1], respectively. To compute the population densities, the population within each government district was divided by the corresponding geographic area of the district, as sourced by Ordnance Survey of Northern Ireland (OSNI) for NI [80] and Tailte Éireann for RoI [81]. See Additional file 1: Tab. S1 for the values used in the computation of population density.
Deprivation for government districts in IrelandThe population data of subgroups in NI are primarily available at the level of Super Output Areas (SOAs). These areas were developed by NISRA with the aim of improving small-area statistics, and they were first produced for the 2011 census outputs [82]. Multiple deprivation measures in SOAs are presented as rankings, with SOAs in NI ranked from 1 (most deprived) to 890 (least deprived) [83]. Ranked-based data is deemed unsuitable for measurements; thus, an indirect calculation of the level of deprivation is necessary using alternative data such as income level.
SOA data can be disaggregated into grid cells using the dasymetric technique [84]. As such, this method can be applied to disaggregate deprivation data (e.g. income level) from the SOA level into grid cells, generating grid-based deprivation data across NI. These outcomes can be subsequently merged with administrative areas to estimate the level of deprivation within those regions.
In NI, the proportion of the population residing in households with equivalised income below 60% of the NI median [83] at the SOA level is utilised as a metric for measuring deprivation (%). For this analysis, WorldPop data for the entirety of NI in 2020 is also utilised. The dataset is obtainable at a granular level of grid cells (100 m cell at the Earth’s equator) for both the UK and Ireland, with the units representing the number of individuals per cell [85]. To determine the number of individuals inhabiting each SOA (2020), population grid cells are utilised. Subsequently, the % of individuals living in deprivation is converted into the ‘number of deprived individuals in SOA’, and a straightforward algebraic expression is employed to devise a weighted overlay technique for disaggregating the number of deprived individuals into grid cells:
$$\begin \frac}} \times \text \end$$
The results, which represent the number of deprived individuals in grid cells, are superimposed onto administrative boundaries to estimate the number of deprived individuals within each area. Subsequently, these figures are transformed into a percentage of deprived individuals within administrative areas, utilising total population data in each SOA.
Deprivation data (deprivation score) for RoI is accessible at the administrative level of county councils [86]. It is crucial to note that, in this study, the calculation of deprivation in NI and RoI is based on distinct methodologies, and therefore, direct comparisons between the two cannot be made.
Country centroid positionsWorld country centroid positions were downloaded from gavinr’s GitHub repository ‘World Countries - Centroids’ [87]. In addition to these, centroid positions were added individually for England (latitude (Lat): 52.561928 °N, longitude (Lon): 1.464854 °W), Scotland (Lat: 56.816738 °N, Lon: 4.183963 °W), Wales (Lat: 52.33022 °N, Lon: 3.766409 °W), and NI (Lat: 54.607577 °N, Lon: 6.693145 °W) [88] as well as for Taiwan (Lat: 23.973861 °N, Lon: 120.982 °E) [89], Hong Kong (Lat: 22.3453 °N, Lon: 114.1372 °E) [90], Réunion (Lat: 21.114444 °S, Lon: 55.5325 °E) [91], Canary Islands (Lat: 28.47707 °N, Lon: 15.745798 °W) [92], Palestine (Lat: 31.4 °N, Lon: 35 °E) [93], Crimea (Lat: 45.25 °N, Lon: 34.6 °E) [94], and Kosovo (Lat: 42.55 °N, Lon: 20.85 °E) [95].
Irish SARS-CoV-2 genomes and metadata for substitution rate estimationThe entire GISAID FASTA and metadata databases were downloaded on 18 February 2023 to analyse population-wide SARS-CoV-2 genomics in Ireland. These data were then filtered to extract Irish samples (i.e. country=‘Ireland’ or country=‘NorthernIreland’) using augur filter v21.0.1 [96, 97].
Identifying and removing tree outliersTo identify samples that potentially have incorrect collection dates in their associated metadata or deviate substantially from their expected position in the tree under a molecular clock, the GISAID phylogenetic tree and sample collection dates were input to Chronumental v0.0.50 [98, 99] to estimate a time-tree from the phylogeny. The difference between sample collection dates and the date predicted by Chronumental was calculated and a z-score (i.e. the statistical standard score) was calculated for each sample. Samples with a z-score greater than + 3 or less than − 3 were excluded from further analysis. These z-scores equate to roughly a ± 60 days difference in date. This step excluded 0.6% (46,278/7,603,547) of samples in the tree (Additional file 1: Fig. S2).
Creating subtreesGoTree v0.4.3 [100, 101] was used to prune the GISAID phylogenetic tree to subtrees used in subsequent analyses.
Filtering Northern Irish sample metadataDue to inconsistencies between country and finer-grain location metadata, some NI samples were suspected to be erroneously labelled as being from NI. Inclusion of these samples could overestimate the number of identified introductions. Samples with an NIRE prefix identifier (ID) (e.g. NIRE-000107) should all be collected and sequenced in NI and correctly attributed a country location of NI. Samples with NIRE prefix IDs comprise 84.3% of NI samples in the phylogenetic tree, while RAND IDs comprise 8.2%, NORT IDs 2.7%, and 15 other prefixes total 4.8% of NI samples. Therefore, identified NI introductions were filtered to consider introductions that consisted of NIRE IDs only. This excluded 5,427 samples.
Reconstructing ancestral location of lineagesTo infer ancestral locations of samples, a phylogenetic tree and the country of each sample in the tree was provided to PastML. Locations, at the level of countries, were inferred for each ancestral node of the phylogenetic tree (internal points between tips and the root of the tree that represent an ancestral state) using the DELTRAN (delayed transformation) method [102] in PastML v1.9.34 [103, 104]. This method prioritises minimising state ambiguities by implementing character changes as close to the tips of the phylogenetic tree as possible, thus favouring parallel mutations. Additionally, the –resolve_polytomies PastML option was enabled; this optional algorithm iteratively identifies nodes with more than two children (‘polytomies‘) and introduces new internal nodes to group children by their predicted ancestral location states, ensuring the tree structure more accurately reflects evolutionary relationships.
To assess the robustness of our selected ancestral state reconstruction method, we conducted supplementary analyses comparing the DELTRAN maximum parsimony method with TreeTime [105], a maximum likelihood method. Comparisons of these different ancestral state reconstruction approaches are presented (Additional file 1: Fig. S3), demonstrating consistency in introduction allocations among the methods tested. Detailed results are provided in within Additional file 1 (i.e. Tables S2–S7 may be cross-referenced with Tables S8–S13 for a comprehensive comparison).
Clustering of sequences to identify independent introductions to IrelandDendroPy v4.5.2 [106, 107] was used to traverse the output trees from PastML to identify Irish tips of the tree that were descended from ancestral nodes with a non-Irish country label assigned by PastML. We define all Irish tips linked to an importation event to comprise an introduction cluster.
The following periods were selected to span from prior to the first detected sequence of each major lineage in Ireland until they reached dominance.
Period A-Initial introductionsTo identify initial introductions to Ireland, the tree was pruned to include only sequences from the start of the pandemic up to the end of May 2020 and rooted using Wuhan-Hu-1 (GISAID: EPI_ISL_402125) as an outgroup. Rooting a tree is the process of deciding which part of the tree is the oldest point and gives the direction of evolution moving from the root to the tips of the tree (Fig. 1). An outgroup, a tip or lineage outside the group of tips of interest yet related to the the group, allows the root to be determined. Here, we use one of the earliest SARS-CoV-2 sequences as the outgroup to the rest of sequences in the tree. This tree included 103,316 tips of which 714 are from RoI and 633 are from NI (Additional file 1: Fig. S4). Ancestral location of lineages was inferred by PastML as above. Location could not be resolved for 64 nodes (0.05%) and 2,053 new internal nodes were created by resolving polytomies.
Period B-B.1.177To identify Pango [68] lineage B.1.177 introductions to Ireland, the tree was pruned to include only B.1.177.* sequences (and renamed sublineages of B.1.177.*: AA.1, AA.2, AA.3, AA.4, AA.5, AA.6, AA.7, AA.8, Z.1, Y.1, W.1, W.2, W.3, W.4, V.1, V.2, U.1, U.2, U.3) from the first non-outlier occurrence of B.1.177 to the end of October 2020 and rooted using Wuhan-Hu-1 (GISAID: EPI_ISL_402125) as an outgroup (Additional file 1: Fig. S5). This tree included 31,869 tips of which 215 are from RoI and 235 are from NI. Ancestral location of lineages was inferred by PastML as above. Location could not be resolved for 4 nodes (0.01%) and 364 new internal nodes were created by resolving polytomies.
Period C-B.1.1.7 - AlphaTo identify Pango lineage B.1.1.7 introductions to Ireland, the tree was pruned to include only B.1.1.7 sequences (and renamed sublineages of B.1.1.7.*: Q.1, Q.2, Q.3, Q.4, Q.5, Q.6, Q.7, Q.8) from the first non-outlier occurrence of B.1.1.7 to the end of February 2021 and rooted using Wuhan-Hu-1 (GISAID: EPI_ISL_402125) as an outgroup (Additional file 1: Fig. S6). This tree included 183,092 tips of which 2850 are from RoI and 461 are from NI. Ancestral location of lineages was inferred by PastML as above. Location could not be resolved for 45 nodes (0.02%) and 2181 new internal nodes were created by resolving polytomies.
Period D-B.1.617.2/AY - DeltaTo identify Delta introductions to Ireland, the tree was pruned to include only B.1.617.2/AY sequences from the first non-outlier occurrence to the end of July 2021 and rooted using Wuhan-Hu-1 (GISAID: EPI_ISL_402125) as an outgroup (Additional file 1: Fig. S7). This tree included 512,420 tips of which 4923 are from RoI and 1613 are from NI. Ancestral location of lineages was inferred by PastML as above. Location could not be resolved for 287 nodes (0.05%), and 6363 new internal nodes were created by resolving polytomies.
Period E- B.1.1.529/BA.1 - OmicronTo identify Omicron introductions to Ireland, the tree was pruned to include only B.1.1.529/BA.1 sequences from the first non-outlier occurrence to the end of January 2022 and rooted using Wuhan-Hu-1 (GISAID: EPI_ISL_402125) as an outgroup (Additional file 1: Fig. S8). This tree included 1,313,634 tips of which 7888 are from the RoI and 3615 are from NI. Ancestral location of lineages was inferred by PastML as above. Location could not be resolved for 497 nodes (0.03%) and 22,546 new internal nodes were created by resolving polytomies.
Period F- BA.2 - OmicronTo identify BA.2 introductions to Ireland, the tree was pruned to include only B.2 sequences from the first non-outlier occurrence to the end of March 2022 and rooted using Wuhan-Hu-1 (GISAID: EPI_ISL_402125) as an outgroup (Additional file 1: Fig. S9). This tree included 644,494 tips of which 1235 are from RoI and 5136 are from NI. Ancestral location of lineages was inferred by PastML as above. Location could not be resolved for 338 nodes (0.05%) and 12,722 new internal nodes were created by resolving polytomies.
Downsampling tips from EnglandTo determine the effect of the relative over-representation of English samples in the tree on the high level of English introductions identified to Ireland, we randomly downsampled tips from England in the tree. Between 10 and 90% of English samples were randomly removed from the tree in 10% increments, with three replicates at each increment. Downstream analysis to identify introductions to Ireland was conducted as before.
Downsampling tips from IrelandTo determine the effect of the level of sequencing in Ireland on the number and size of Irish introductions for the island as a whole and for each jurisdiction, we randomly downsampled tips from NI, RoI, and Ireland in the tree. Between 10 and 90% of samples from Ireland were randomly removed from the tree in 10% increments, with one replicate at each increment. Downstream analysis to identify introductions to Ireland and clusters size was conducted as before.
Estimation of total vs. identified introductionsA goal of this study was to evaluate the number of genetic sequences that would be required to identify all SARS-CoV-2 introductions to Ireland. To determine the additional number of introductions (\(I\)) that could be identified by analysing more sequences (\(S\)), we utilised ordinary least-squares (OLS) linear regression function scipy.stats.linregress from the Python library SciPy [108].
For each period, let \(\mathbb \) be the total number of available sequences and \(\mathbb \) be the corresponding number of introductions to Ireland identified. During downsampling, it is appropriate to select \(N\) (e.g. 10) evenly spaced downsampled sequencing values \(s_i\) such that
$$\begin s_i = i\bullet \frac} \quad \text \ i=1,2,\ldots ,N, \end$$
which yields an equivalent array
where \(\Delta S \equiv \frac}\).
Through performing randomised downsampling of Irish tips in the phylogeny to each of the levels in array \(S\) and detecting introductions for each, a corresponding array for introductions, \(I\), will be obtained:
where \(I_1\) is the introductions predicted by downsampling to level \(s_1\), etc.
A numerical difference array (with size of \(N-1\)) can be calculated from \(I\):
which enumerates the successively decreasing additional introductions identified upon increasing sequencing by \(\Delta S\).
We suppose that these diminishing returns of increased sequencing will continue to hold, such that at a sequencing saturation point \(\mathbb _}\), all predicted introductions, \(\mathbb _}\), would be identified and there would be limited additional value to increasing sequencing levels.
To estimate \(\mathbb _}\), the sequencing level required to predict all introductions, we perform OLS linear regression to assess the relationship:
$$\begin\Delta I = m \bullet S + c,\end$$
where \(m\) represents the slope of the linear regression line and \(c\) represents the intercept of the linear regression line. Note that as \(\Delta I\) has one element fewer than \(S\), we must prepend an element to \(\Delta I\), such that the first element contains the value \(I_1\), as the number of additional introductions identified from zero sequencing to the level of \(s_1\) is \(I_1\).
To apply the regression equation for extrapolation, values for \(S\) greater than the number of sequences obtained (\(\mathbb \)) may be input to compute new values for \(\Delta I\). Therefore, increased sequencing will provide additional introductions until \(\Delta I = 0\), where \(S=\mathbb _}\), which is the intersection of the regression line with the \(S\) axis: \(\mathbb _} = \frac\) :
$$\begin \Delta I = 0 = m \bullet \mathbb }} + c \rightarrow \mathbb _}=\frac, \end$$
thereby creating an upper bound for the extrapolation of \(S\).
To calculate \(\mathbb _}\), extrapolated \(S\) values (\(S_}\)) are firstly generated in an array from \(\mathbb +\Delta S\) to \(\mathbb _}\):
Inputting these (\(S_}\)) values into the regression equation, we similarly obtain \(\Delta I_}\):
$$\begin\Delta I_} = m \bullet S_} + c.\end$$
Each extrapolated \(\Delta I\) value within \(\Delta I_}\) is summed with \(\mathbb \) to predict \(\mathbb _}\), the total number of introductions to Ireland within the period.
Note that our analysis assumes that the number of additional introductions is directly proportional to the total number of sequences. However, this assumption may not hold true, and future research in this area will need to be undertaken to determine the ideal functional form with which to perform such extrapolations.
Geospatial mappingGeoPandas v0.11.1 [109, 110] was used to facilitate geospatial visualisations of SARS-CoV-2 introductions to Ireland. As source data, two geospatial vector data files (i.e. shapefiles) were downloaded from the GADM Database of Global Administrative Areas [111], which aims to provide high-quality maps of the administrative areas of all countries, at all recognised levels of sub-division to be utilised freely for non-commercial use. The shapefile gadm41_IRL_1.shp as found within the IRL shapefile package [112] was used to map the counties of RoI, while gadm41_GBR_2.shp as found within the GBR shapefile package [113] was used to map the local government districts (LGDs (2014) [114]) of NI. The geographic resolution of RoI county and NI LGD is identical to the metadata fields accessible in GISAID or COG-UK metadata, allowing each sample to be associated to a distinct geographic region within Ireland. To create a map for the entire island of Ireland, NI was extracted from the shapefile containing the entirety of the UK and was merged with the shapefile containing RoI. To annotate the maps, Troendle–Rice–Simpson–Skvortsov 2-letter abbreviation codes were devised for each local government district as indicated in Additional file 1: Tab. S1. Samples with incomplete or ambiguous geographic metadata were excluded from geospatial mapping, resulting in 95% (31,529/33,131) of the Irish sequences being retained for these analyses.
Spatial statistical analysis of Geographic Information System (GIS) dataSpatial statistical analyses of Geographic Information System (GIS) data were automated by the Spatial Statistics toolbox in the ArcGIS Pro v3.0.2 (Build 3.0.2.36056) software developed by Esri [115]. Geospatial data was input to the Spatial Statistics toolbox in ArcGIS Pro in the form of shapefiles, geodatabases, or feature classes, containing both the attribute and location data of the features being analysed.
The ordinary least squares (OLS) functionalities used within the ArcGIS Pro Spatial Statistics toolbox provide the results of bivariate regression of two independent variables: population density, and deprivation in correlation with the dependent variable: the proportion of Irish tips sequenced in the region linked to introduction and spreading events within the periods studied. The specific statistical measures revealed by the OLS functionalities are p-values, robust p-values, adjusted R2s, variance inflation factors (VIFs), and Koenker (BP) statistic (i.e. Koenker’s studentised Bruesch-Pagan statistic) probabilities. The number of regions used to analyse spatial statistics in RoI was 26 (one per county, with the four administrative areas in the Dublin region—Dublin City, Dún Laoghaire-Rathdown, Fingal, and South Dublin—merged into one, ‘County Dublin’), while for NI, there were 11 regions (one per LGD).
Before conducting bivariate regressions, we performed OLS regression to examine the relationship between population density and deprivation in NI and RoI. The results indicated minimal risk of overadjustment bias [116], as the correlation between population density and deprivation was weak (regression coefficient = 0.00085, p-value = 0.04 for RoI; regression coefficient = 0.002963, p-value = 0.0031 for NI) at the geospatial resolution studied. These correlations are visualised as bivariate sequential choropleth maps (9-class) (Additional file 1: Fig. S10).
SARS-CoV-2 genomic trends in IrelandWe separately aligned each Irish genomic sequence from GISAID with Pango lineages corresponding to those of the six periods studied (N = 140,121) to the Wuhan-Hu-1 reference genome (GISAID: EPI_ISL_402125) using mafft v7.453 [73]. The number of substitutions in each Irish genome compared to the Wuhan-Hu-1 reference genome was determined by calculating the Hamming distance [117], which tallies the minimum number of substitutions required to change the reference sequence into a subject sequence of equal length, while excluding the contributions of insertions, deletions, and Ns in the subject sequence.
To analyse substitution rates per period, we conducted ordinary least squares (OLS) linear regression using the LinearRegression model from scikit-learn v1.1.3 [118, 119]. The data in Additional file 1: Tab. S14 includes error estimates and p-values from regressions that were nearly identical but conducted using the OLS functionality in statsmodels v0.13.5 [120, 121].
留言 (0)