Driving role of climatic and socioenvironmental factors on human brucellosis in China: machine-learning-based predictive analyses

Spatial and temporal distributions of human brucellosis

From 2014 to 2020, 327,456 HB cases were reported in China. In general, the incidence rate of HB had shown a downward trend since 2014 (57,480 cases, 0.35/100,000 people), with the lowest in 2018, when 37,467 cases (0.22/100,000) were reported. Thereafter, the incidence increased slightly, and 46,884 cases (0.28/100,000) were reported in 2020. From 2014 to 2020, the average annual incidence of brucellosis in the Inner Mongolia Autonomous Region was the highest (3.47/100,000), followed by Ningxia (2.78/100,000), Xinjiang (1.93/100,000), Shanxi (1.13/100,000), and Heilongjiang (1.09/100,000). There are 50 cities with an annual average incidence rate greater than 1/100,000, all of which are in northern China. The incidence rate ranges from 7.71/100,000 in Tacheng, Xinjiang Uygur Autonomous Region, to 1/100,000 in Chengde, Hebei Province. Other cities with high incidence rates include Xing’an League (7.14/100,000), Xilingol League (6.13/100,000) and Tongliao City (5.50/100,000) in Inner Mongolia, Hami City (5.62/100,000) in Xinjiang, Altay Region (5.27/100,000) and Changji Hui Autonomous Prefecture (5.11/100,000), and Wuzhong City (5.32/100,000) in Ningxia (see Additional file 1). Compared with the annual average incidence rate of HB in 2014–2017, the annual average incidence rate of HB in 2018–2020 in some regions of the Qinghai Tibet Plateau, most regions of Xinjiang, Shaanxi, Shanxi, Henan, and Hebei in the middle, and Shandong, Beijing, and Tianjin in the east has significantly decreased. However, the incidence rate of brucellosis in eastern Tibet, central Gansu, and most parts of the Inner Mongolia Autonomous Region increased significantly (see Additional file 1).

The results show an apparent variation due to the vast size of the country and a large number of cities. The highest incidence of brucellosis in China was more than 200 times the lowest in prefecture-level cities. In terms of climate regions, from 2014 to 2020, the annual average incidence rate of HB in the arid region was the highest (1.88/100,000), followed by the continental climate zone (0.47/100,000). The incidence rate of temperature and tropical climate zones was low, at 0.048/100,000 and 0.003/100,000, respectively. In the economic belt, the annual average incidence rate of HB in the northeast economic belt is the highest at 0.68/100,000; the second highest in the western economic belt, 0.45/100,000; the incidence rate of the central economic belt and the eastern economic belt is relatively low (0.18/100,000 and 0.14/100,000, respectively; Fig. 1).

Fig. 1figure 1

Spatial distribution of brucellosis in China by a climatic and b economic zones. Incidence rates are calculated for 2014–2020 per 100,000 people. The purple line is the Qinling Mountains-Huaihe River line divided between northern and southern China

Most cases occur from March to August every year, with May being the peak point. As the incidence of brucellosis is significantly higher in northern China than in southern regions, we analyzed northern and southern China separately in our temporal distribution study. From 2014 to 2020, the incidence rate of HB in northern China was 0.65/100,000, which was much higher than that in southern China (0.02/100,000). The results are shown in Fig. 2. Northern and southern China showed opposite results. The yearly decreasing trend in the incidence of brucellosis in northern China is reflected in the results, and the incidence in southern China shows an increasing yearly trend. It should be noted that the order of magnitude of incidence rates in the North is, on average, approximately 40 times higher than that in the South, resulting in an upward trend in the South being greater than the downward trend in the North, although the slope of the trend line is the same. This is reflected in the results, as the average incidence rate in the North decreased by approximately 20% from 2014 to 2020, whereas this increased by nearly 100% in the South.

Fig. 2figure 2

Temporal distribution of brucellosis in a northern and b southern China, divided by the Qinling Mountains-Huaihe River line. Incidence rates are calculated for 2014–2020 in units per 1 million people. The black line is the trend line

Correlation and seasonality between brucellosis and climate

Before modeling, we performed a correlation analysis of the data. We found that all climatic, socioenvironmental, and brucellosis data did not satisfy the normality condition (see Additional file 1). Therefore, it was necessary to exclude the Pierce correlation in the correlation analysis and use the Spearman and Kendall correlations in the rank correlation. The results are shown in Fig. 3. Taking the MAI of brucellosis as a base, 60% of the weather data were negatively correlated and 40% were positively correlated. There was clear collinearity between the individual weather data, with some correlation coefficients being even more significant in absolute values than between them and the MAI. Compared to the other weather factors, only MAS and MAH had Spearman correlation coefficients above 0.5, which lies within the moderate correlation interval and is more significant than the other factors in the subsequent modeling analysis.

Fig. 3figure 3

Correlation between climatic factors and incidence of brucellosis. Correlation coefficients and heat map matrices for climatic factors and incidence of brucellosis. a Spearman correlation, and b Kendall correlation. * In the heat map part of the figure represents \(P<0.05,\) which indicates that the corresponding correlations are statistically significant. MAP refers to monthly average precipitation, MAS refers to monthly average sunshine, MAH refers to monthly average humidity, MAWS refers to monthly average wind speed, MAT refers to monthly average temperature and MAI refers to monthly average incidence

The incidence of brucellosis was distinctly seasonal (Fig. 4), with a high incidence in spring and summer. Overall, the average quarterly incidence rates were winter, fall, spring, and summer. The four seasons did not show a wide disparity, with a difference of approximately 30% in the incidence rate per 1 million persons. Zhangjiakou City, Hebei Province, was the top prefecture-level city in the eastern region in terms of incidence rate, far surpassing the second and subsequent cities in terms of incidence rate. Except for Zhangjiakou City, the incidence rates of the top 10 prefecture-level cities in the eastern region are slightly lower than those in the central and northern regions and far lower than those in the western region (average incidence rate per million people: 15.03 in northern China, 30.23 in western China, 13.53 in central China, 4.83 in eastern China), which is consistent with the distribution of animal husbandry in China.

Fig. 4figure 4

Incidence of brucellosis in different geographical regions of China by season between 2014 and 2020. The top 10 prefecture-level cities in each of the 4 geographic regions using economic division criteria for the average incidence of brucellosis are presented in the Figure colored dots represent the quarterly average incidence of brucellosis between 2014 and 2020

Classical statistics and SARIMAX prediction models

The weather and brucellosis data used in this study were monthly compilations, and the social and environmental data were annually compiled, all of which were time series spanning 6 years (see Fig. 2). The Kolmogorov–Smirnov, Shapiro–Wilk, and Jarque–Bera normality tests were not strictly satisfied (see Additional file 1). However, considering that the absolute value of the kurtosis was less than 10 and the absolute value of the skewness was less than 3, although the data were not absolutely normally distributed, they were basically accepted as normal distributions. Many models were built and screened based on the statistical nature and seasonality of climatic, socioenvironmental, and brucellosis data. The indicators of the models with excellent performances are shown in Table 1. The output results of these traditional statistical regression models were monthly brucellosis cases, and the input variables were climatic and socioenvironmental data.

Table 1 Classical statistical model performance summary

Table 1 shows that none of the traditional statistical regression models fit the data very well. Stepwise, ridge, and robust regression have similar model superiority, with ridge regression having the ability to handle linear data. The PLS regression models' GOF performs better in these models, but cannot handle data collinearity, which results in less objective results. Although they are all significant, none of the adjusted \(^\) exceeds 0.6 and are unsuitable as predictive models.

Machine-learning models may exhibit better analytical performance than classical statistical regression models. SARIMAX is a machine learning model suitable for seasonal time-series forecasting with exogenous variables. In terms of model parameterization, we first observed the brucellosis data to determine the seasonal period, \(S=12\), and found that the model \(p, d, q=(\mathrm,1)\) of all input variables was the most appropriate through the automatic optimization algorithm. Subsequently, we used the seasonal decomposition sequence diagram to determine the \(P, D, Q\) values of different input variables. After obtaining these results, we use AIC, BIC, and HQIC to screen the optimal model. The results after the application to the dataset used in this study are shown in Fig. 5 and Table 2.

Fig. 5figure 5

Predicted MAI of brucellosis in a Baicheng (in Northeast China), b Datong (in Central China), c Jinchang (in Western China), and d Zhangjiakou (in Eastern China) based on SARIMAX model. The four prefecture-level cities in the figure are the cities with the highest average incidence of brucellosis among the four major economic regions in China that are used as typical data for analysis. The data from 2015 to 2019 was used as the model training set, and the data from 2020 was the prediction set. The black line represents the data as a comparison in the prediction set. The colored lines represent the SARIMAX results after different climatic data are input as exogenous variables

Table 2 SARIMAX model performance summary

The five types of climatic data entered as exogenous variables in the SARIMAX model had different effects on the prediction results. The standard errors of MAP, MAS, MAH, MAWS, and MAT in the prediction model of prefecture-level cities in the four geographical regions were 0.10725, 0.1145, 0.35325, 4.8195, and 1.1295, respectively. Among them, the performance of the prediction model for MAP, MAS, and MAH was much higher than that of the other two climatic datasets, and the accuracy of the results was higher, consistent with the findings illustrated in Fig. 6. Most of the SARIMAX prediction models that we constructed passed the white noise test and proved to be non-autocorrelated. The results of all models satisfied the normal distribution and showed no heteroscedasticity properties.

Fig. 6figure 6

Copula a three-dimensional contours and b two- dimensional joint distribution of sunshine and humidity. The range of all axes is 0–1, representing probability values 0–100%. MAS refers to monthly average sunshine and MAH refers to monthly average humidity

Figure 5c shows a significant deviation in the forecast results for 2020 for Jinchang, Gansu Province, as reflected in the model with all climatic data as the input. The prediction model failed due to an unexpected brucellosis pandemic in Jinchang in the summer of 2020. The average monthly incidence in July 2020 peaked in 2014–2020, to nearly twice the second place.

Copula extreme weather model

The five types of climatic data used in this study had significant covariance, and the rank correlation coefficients between them are shown in Fig. 3. These interdependent data in extreme weather analysis can affect the accuracy and objectivity of the results. For statistical significance, we chose to have both high performances of the predictive model input variables and high-rank correlation coefficients for sunshine and humidity as climatic data for extreme weather analysis.

We first performed a copula modeling analysis of the overall data, regardless of region and period, to filter out the marginal and joint distribution functions. The results are presented in Table 3 and Fig. 6. Second, we performed year-by-year modeling for the data regardless of region, and the results were not significantly different. The model performance is presented in Table 3, and the joint distribution figures are shown in Additional file 1. Based on the previous results, we performed copula modeling analysis on year-by-year climatic data from different climatic regions and explored the correlation between extreme weather and brucellosis incidence according to the quantile threshold method. The results are shown in Fig. 7.

Table 3 Sunshine and humidity copula model performance, 2014–2020Fig. 7figure 7

Trends in the sunshine and humidity extremes and incidence of brucellosis in a arid and continental, b temperate and tropical climatic zones after copula-processing. We normalized the differences due to order-of-magnitude gaps, which may thus lead to an unclear presentation in the figure; \(rs\) represents the Pearson correlation coefficient for the year-to-year difference between sunshine and the incidence of brucellosis in the corresponding climate zone; \(rh\) represents humidity. The rationale for selecting Pearson for the correlation coefficients is that the data all conform to a normal distribution (see Additional file 1) but have not been tested for statistical significance because the amount of data is too small (n = 6) to qualify for the test

Table 3 shows that the most suitable marginal distribution function for insolation and humidity is Weibull, and the copula joint distribution function is Frank. These results remain constant in all years. The performance parameters' excellent values demonstrated the copula model's positive effect in eliminating the covariance between climatic data, and the influence of other weather factors on this can be excluded in subsequent studies analyzing single-factor extreme weather.

We conducted a correlation analysis between copula-processed sunshine and humidity data classified using the quantile threshold method and the difference in the incidence of brucellosis. The results showed a significant negative correlation between sunshine and humidity extremes above the 75% percentile and a trend of variation in the incidence of brucellosis. For the sunshine data, a moderate-to-high negative correlation is reflected in the arid, temperate climatic zones. For the humidity data, a high degree of negative correlation is reflected in the arid, temperate, and tropical climatic zones.

留言 (0)

沒有登入
gif