Natural variables separate the endemic areas of Clonorchis sinensis and Opisthorchis viverrini along a continuous, straight zone in Southeast Asia

Study design

We conducted an integrative modelling study using a mix of data sources to map the niche boundaries and model the divergent epidemiology of clonorchiasis and opisthorchiasis in Asia. The integration of multiple data modalities aimed to provide novel insights into the differing ecology and transmission dynamics of these two liver fluke infections. The modelling framework incorporated three main components: (1) a systematic literature review of prior epidemiological studies reporting prevalence and risk factors; (2) compilation of environmental, socioeconomic, and disease burden data from regional databases; and (3) implementation of species distribution modelling algorithms to delineate environmental niches. The models enabled predictive mapping of each disease's ecological niche across Asia based on inferred associations between disease occurrence and environmental conditions.

Data sourcesLiterature review

We systematically searched major biomedical, regional, and grey literature databases, including PubMed, Scopus, EBSCOhost, Web of Science, Cochrane Library, Cairn, OpenGrey, and Scielo from database inception through December 2018. The search strategy utilized Medical Subject Headings (MeSH) terms including ["Clonorchis sinensis", "Clonorchis sinenses", "Clonorchiasis", "Opisthorchis sinensis", "Opisthorchis sinenses","Liver fluke"], AND ["Asia", "Epidemiology", "Prevalence"], and relevant variants in [All Fields]. Equivalent subject headings and keywords were used for searches in other databases.

Inclusion criteria were cross-sectional surveys, cohort studies, or case–control studies reporting primary prevalence data or risk factors for clonorchiasis and/or opisthorchiasis in Asia. Studies were required to have laboratory diagnostic testing for infection.

Exclusion criteria were case reports, reviews, opinion pieces, policy documents, animal studies, and studies without primary prevalence data. Two independent reviewers screened all titles, abstracts, and full texts for eligibility. Data on prevalence, diagnostics, location, sample size, demographics, and risk factors were extracted from included studies into a standardized form using Zotero version: 6.0.31 (The Roy Rosenzweig Center for History and New Media, Fairfax, USA). Any discrepancy was resolved by consensus.

This comprehensive literature search aimed to compile all relevant epidemiological data on clonorchiasis and opisthorchiasis prevalence and risk factors needed to inform model development.

Databases

For C. sinensis infection data in the Mekong River region, we supplemented the literature review by compiling primary data on population infection rates from databases in Vietnam and Guangxi Zhuang Autonomous Region of China. Cross-sectional surveys, conducted between 2000 and 2018 in Vietnam, were systematically searched to extract geolocated presence/absence data based on faecal egg detection at the survey point and the regional level. Infection status was coded as Yes (positive) or No (negative) for C. sinensi in the databases.

For Guangxi Zhuang Autonomous Region, China, population-level data on C. sinensis infections were obtained from the 3rd National Survey on Key Parasitic Diseases conducted between 2014 and 2016, which covered 31 provinces (municipalities, autonomous regions) in rural and urban areas of China. Apart from C. sinensis, testing included tapeworms, intestinal protozoa and other key parasitic infections via faecal examination in the sampled population. A stratified cluster sampling method was used, classifying China into 5 endemic zones for C. sinensis and sampling within each zone. All individuals in the selected clusters underwent testing. Stool specimens were examined by the Kato-Katz thick smear technique using two smears per specimen to detect intestinal helminth eggs.

By compiling primary epidemiological records from these standardized national surveys in China and Vietnam, we obtained geolocated C. sinensis infection data needed to parameterize niche modelling and epidemiological comparisons between the two liver flukes. The original data of O. viverrini infection in endemic countries of Southeast Asia is extracted from the 113 studies and combined with the reported data from WHO (Department of Neglected tropical diseases of WHO Western Pacific), and details of system review screening were shown in Additional file 1.

Environmental data

We compiled 26 natural climatic, and socio-cultural predictor variables, including distance to water bodies, elevation, slope, normalized difference vegetation index (NDVI), land cover, 19 bioclimatic variables (Bio1–Bio19), human influence index (HII), human footprint index (HFP), based on the variables used in Zhao’s study [17], and Zheng’s study[18] for modelling with liver fluke and snails. We additionally included local habit of raw fish consumption-eating as a predictor variable. Our approach was to use a comprehensive set of environmental and socio-economic factors to capture these fine-scale differences. Factors may similarly increase overall risk, but specific values pinpoint geographic boundaries. The machine learning framework integrated with ecological data successfully learned these distinct signatures, enabling accurate discrimination for mapping. All databases used for these 26 predictor variables are shown in Table 1.

Table 1 Environmental and climatic variables influencing liver fluke infection

Topographic variables, such as water distance, elevation, slope, NDVI, and land cover were extracted from the Shuttle Radar Topography Mission (SRTM) at 5 km-resolution (http://srtm.csi.cgiar.org/). Water distance calculates the Euclidean distance from each grid cell to the nearest wetland, including lakes, wetlands, and river floodplains, representing proximity to water bodies (in meters). Elevation denotes altitude above the mean sea level (in meters). Slope describes the rate of change in elevation. NDVI is an index of green vegetation density ranging from -1 to 1, with values below 0 indicating water, cloud, snow; near 0 barren land; and above 0 vegetation cover increasing with density. Land cover was defined using the Moderate Resolution Imaging Spectroradiometer (MODIS) MCD12Q1 product (https://lpdaac.usgs.gov/products/mcd12c1v006/), aggregated and reprojected to match the 15-class University of Maryland scheme.

Geospatial data layers were extracted for each liver fluke survey location to assess environmental factors associated with C. sinensis and O. viverrini transmission. Univariate comparisons were conducted between survey points for each variable using Mann–Whitney U tests.

Climatic data were obtained from WorldClim v.1.4 at 5 km-resolution (http://www.worldclim.org), interpolated from global weather station data from 1955 to 2000 for China. The 19 bioclimatic variables represent annual trends, seasonality, and limiting factors calculated from monthly temperature and rainfall. These are more biologically meaningful than temperature/rainfall alone.

To represent anthropogenic effects on the environment, we extracted two human influence indices: HII quantifies direct human pressures on ecosystems using population density, built environments, transportation networks, land use/land cover, and nightlights (https://sedac.ciesin.columbia.edu). HII ranges 0–64, with higher values indicating greater human environmental impacts. HFP shows relative human pressure, with red indicating more intense activity.

Eating habits of raw fish

The study considered the use of data on the consumption of raw fish because eating raw or undercooked fish is a well-known risk factor for infections with C. sinensis and O. viverrini. These liver flukes can infect humans who consume freshwater fish containing the larval stages of the parasites. This dietary habit directly relates to the transmission dynamics of these parasites, making it a critical factor to examine in understanding the geographical distribution and risk of infection.

By understanding where and how often people consume raw fish, we define the eating habits with reference to raw fish were recorded for mapping sections by provinces and municipalities in all countries based on literature review, coded as 1 if present the eating habits with raw fish or 0 if absent the eating habits with raw fish. Also, we collecting data from affected populations through direct questioning about their dietary habits through the help of local disease control centres and experts from each country as we have consulted with, specifically the dietary habits for consumption of raw or undercooked freshwater fish with specific municipalities areas. Finally, provincial polygons were rasterized to assign presence across each province.

Assessment and extraction of variable data

The compiled databases were separated into two groups based on human infection with C. sinensis and O. viverrini for comparative analysis. We statistically summarized and mapped the locations of the two parasite infections. For normally distributed continuous variables, means and standard deviations were calculated, with t-tests used to compare groups. As land cover comprised 15 categorical classes, non-normal variables were summarized using median and interquartile range (IQR) and compared between groups with non-parametric tests.

All data processing and analyses were conducted in R V.4.0.2 (Lucent Technologies, Jasmine Mountain, USA). Variables were assessed for collinearity and eliminated if the variance inflation factor (VIF) exceeded 5. We then used random forest (RF) models to rank predictor importance based on mean decrease in accuracy when excluded from the models. The top 10 most important variables for each parasite were retained for further niche modelling. This process filtered the database variable data to retain only relevant non-redundant predictors characterizing the fundamental and realized niches of the two liver flukes under study. It also allowed statistical comparisons to identify similarities and differences in their ecological and environmental constraints. These ‘curated’ database variable values provide the inputs for ensuing distribution modelling and mapping of the transmission risk.

Model development

We developed predictive models to classify and discriminate human infections of O. viverrini versus C. sinensis based on environmental variables, following the framework Y = f(x). The binary response variable Y would indicate O. viverrini (Y = 0) or C. sinensis (Y = 1) infection, the aim was delineating the potential transition zone where the probability for both species shifts between 0 and 1, indicating possible co-endemicity. The predictor variables (X) comprised 26 environmental, climatic and socio-cultural factors. To account for class imbalance, we used the SMOTE algorithm from the DMwR package to synthesize additional minority class examples. The models were constructed and evaluated using the Caret package in R. To enable consistent comparison across algorithms and assessment of variable importance, we selected six commonly used machine-learning classification methods to model environmental suitability for O. viverrini and C. sinensis transmission: linear regression (LM), decision trees (DT), neural networks (NNET), RF, gradient boosting machines (GBM) and extreme gradient boosting (XGBOOST). Details on each algorithm can be found at the Caret documentation (https://topepo.github.io/caret/index.html). All models were trained using tenfold cross-validation repeated 5 times, with hyperparameter tuning to optimize model performance. Model fitting performance, prediction accuracy, variable contributions, marginal response plots, and projected distribution maps were analysed and evaluated for each approach.

The fitted machine learning models were applied to an independent testing dataset to evaluate generalizability. Liver fluke presence/absence predictions were generated for each testing location and compared to observed outcomes to assess model discrimination. Testing performance was quantified using AUC, accuracy, Kappa value, sensitivity, and specificity metrics.

We evaluated and compared models based on the area under the receiver operating characteristic curve (AUC) by sensitivity, specificity, and Cohen's Kappa statistic. The optimal model was selected based on having the highest cross-validated AUC. This model was then finalized by refitting on the full dataset to generate the final prediction equation.

Model development aimed to maximize discrimination accuracy in predicting O. viverrini versus C. sinensis infections based on ecological and environmental factors relevant to their transmission dynamics and geographic distributions. The resulting model could then be applied to mapping transmission risk and predicting changes under climate change scenarios. Model development and validation followed a rigorous workflow for tuning, testing and application. The compiled database of values for the two infections was randomly split into a training set (70% of the data) for model calibration and a testing set (30% of the data) for independent evaluation. The training data underwent fivefold cross-validation, whereby the data were divided into 5 equal partitions. In each fold, models were fitted on 4 partitions and predictions generated for the held-out fold. This process was repeated, holding out each partition in turn to identify the optimal hyperparameters that minimized the cross-validation error. This was done as cross-validation prevents model overfitting and provides a realistic estimate of performance on new data.

Following cross-validation-based tuning, the final models were refit on the full training set using optimal hyperparameters. Model skill was quantified on the training set using the AUC as mentioned above. The tuned models were then applied to the previously held-out testing set to evaluate performance on new data. Variable importance was calculated by excluding each predictor and quantifying loss in testing AUC. Marginal effects of key predictors were generated from the finalized models to quantify variable-outcome relationships. Model predictions were mapped across the study region based on environmental inputs to predict risk areas for each species. Finally, an ensemble approach was taken by integrating predictions across algorithms to leverage model strengths.

Model assessment and prediction

Model calibration was assessed using calibration plots to evaluate agreement between predicted and observed outcomes. Classification metrics including AUC, accuracy, Kappa value, specificity, and sensitivity were calculated at the optimal probability threshold to quantify model discrimination ability. Variable importance was determined using the varImp function in the Caret package, which quantifies the decrease in model AUC with variable exclusion. This approach includes all predictors and ranks importance based on change in performance. Marginal effects of key variables were visualized using partial dependence plots (PDPs) from the pdp package. PDPs show the functional relationship between a predictor and the outcome while accounting for effects of other variables. To reduce computation time, PDPs were generated for the top three important variables.

The finalized models were applied to predict the probability of C. sinensis infection across gridded environmental data in China's Guangxi Zhuang Autonomous Region and the south-eastern Laos, Thailand, Cambodia, and Vietnam Regions. Predictions were mapped to visualize the geographic distribution of estimated risk. Any predictions of C. sinensis in Guangxi Zhuang Autonomous Region of China were considered erroneous given known distributions. To delineate species boundaries, we focused on areas of Vietnam and Laos where both species are endemic. Grid cells with a predicted probability of C. sinensis of 100% were classified as high risk for that species. Areas with intermediate probabilities of 0–1 were considered potential hybrid zones with sympatric transmission.

留言 (0)

沒有登入
gif