Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

Source and description of the T2DM-NFHS-4 dataset

Data preparation and pre-processing are the key aspects of approaching a problem from a Machine Learning perspective. In this section, we provide the details on the pre-processing approach adopted to generate the T2DM-NFHS-4 dataset.

The NFHS-4 dataset was downloaded from The Demographic & Health Surveys (DHS) Program website. NFHS-4 is the fourth version of the national health survey conducted under the supervision of the Ministry of Health and Family Welfare, Government of India with the International Institute for Population Sciences (IIPS), Mumbai serving as the main nodal agency for all the surveys. The sampling procedure followed in NFHS-4 was stratified two-stage sampling covering all the 640 districts of India. The survey was successfully conducted with 601,509 households. In those interviewed households 112,122 men and 699,686 women could be successfully interviewed. Four survey questionnaires (Household Questionnaire, Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire) were implemented in 17 local languages to collect information on basic demographic information, socio-economic parameters, family planning issues, nutritional status, health indicators, contact with community health workers, etc. The uniqueness of the NFHS-4 study was that it collected data on Diabetes status and performed a Random Blood Glucose for individuals (15–54 years) using a finger-stick blood specimen. As a result, the biomarker measurements and tests besides anthropometric measurements like anaemia testing, blood pressure measurement, blood glucose testing and HIV testing were included in the survey.

Dataset preparation

For dataset preparation and cleaning, the three questionnaires were merged: Woman’s Questionnaire, Man’s Questionnaire and Biomarker Questionnaire. The first two contained information about background characteristics (location, age, sex, religion, social group, literacy, wealth status, etc.), nutritional practices, addictions and co-morbidities while the biomarker questionnaire contained information on height, weight, blood pressure and random blood glucose. A unique code was generated for all individuals in all the three questionnaires by appending the country code and phase, cluster number, household number and line number. The three datasets were joined by the unique code to prepare a single dataset of 810,971 individuals consisting of all men and women between 15–54 years of age. Pregnant women were next excluded to discard the possibility of Gestational Diabetes Mellitus. Individuals with missing diabetic and blood pressure status were also excluded. Variables known to be risk factors for DM (body mass index (BMI), age, place of residence, wealth index, smoking frequency, alcohol intake frequency, hypertension), socio-economic factors (sex, religion, social group, educational status), Dietary frequencies and haemoglobin level were selected for final analysis. BMI, age and haemoglobin level were taken as continuous variables and the rest as categorical variables. Outliers were removed separately for all the three continuous variables to obtain the final dataset with 610, 498 individuals (526, 678 females and 83, 820 males).

Dataset pre-processing

We were interested in detecting significant T2DM sub-populations in the data and further sought to characterize these sub-populations based on the socio-demographic and co-morbid conditions. For this purpose, we extracted patients with a known history of diabetes from the dataset: a total of 10 125 patients. We considered a diverse collection of socio-demographic and co-morbid conditions as ‘features’ in our dataset. Qualitatively our features can be divided into several categories:

1.

Co-morbid conditions: This class of features considers the co-morbid diseases among T2DM patients. We considered whether a T2DM patient had medical conditions such as asthma, thyroid disorder, heart disease, cancer, tuberculosis and hypertension. Thus, there were six features in this category. These features are binary in nature denoting whether a T2DM patient suffered from a given comorbidity or not.

2.

Food habits: This class of features considered the food habits of T2DM patients. The features considered here were how frequently the patient took the food items: milk or curd, pulses or beans, dark leafy vegetables, fruits, eggs, fish, chicken, fried food and aerated drinks. Thus, there were nine features in this category. Features were categorical and ordinal in nature having four possible values: ‘daily’, ‘occasionally’, ‘weekly’ and ‘never’.

3.

Addiction history: This class of features considered the addiction pattern of T2DM patients. There were two features in this class, both binary in nature encoding whether a patient is a smoker or whether a patient takes alcohol.

4.

Socio-demographic features: These included features such as sex, age, wealth index, education level, religion and caste along with BMI and haemoglobin level of the patient. There were eight features in this category.

5.

Living conditions: This class of features quantifies the living conditions of the patients. The features in this class considered whether a patient lives in a household possessing refrigerator, bicycle, motorbike, four-wheeler vehicle and livestock. Moreover, there were features denoting the type of residence, household structure, frequency of household members smoking inside the house, type of cooking fuel used, source of drinking water and time to reach the nearest drinking water source. Thus, there were eleven features belonging to this category.

For our study, 36 features or factors are considered to investigate significant patient populations among the diabetes patients into consideration. Note that there are both continuous and categorical features among these thirty-six features. Among the categorical features, there are both ordinal features and nominal features. Ordinal features have a sense of order among them, such as the features from the ‘food habits’ category as described before. The nominal features are categorical features with no sense of order such as the sex of a patient. Note that for our dataset the continuous features are: age, BMI, haemoglobin level and time to get to drinking water source, whereas the nominal features are: sex, religion, caste, household structure, type of place of residence, type of cooking fuel and source of drinking water. The rest of the features are ordinal features. The categorization of features into continuous, nominal and ordinal is of utmost importance in our clustering paradigm which we discuss in the section “Clustering paradigm using UMAP”.

Identification of T2DM sub-populations using UMAP and DBSCAN

From our detailed description of our dataset, we pointed out that our dataset has a variety of features including continuous and categorical features. Further, there are both ordinal and nominal features among the categorical features in our dataset. A simple UMAP on the entire dataset is depicted in Fig. 2a, revealing two broad clusters. For this clustering of UMAP parameters, n_neighbours have been chosen to be 30, whereas the metric parameter has been chosen to be Euclidean. However, we have a number of important nominal and ordinal categorical features whose effect would not be apparent from such a clustering. Moreover, the Euclidean distance does not always make sense on categorical features, especially if they are nominal in nature. For example, observe Fig. 2d, where we have used UMAP considering only the nominal features with metric parameter hamming (based on hamming distance). This reveals a completely different picture of the dataset, showing several small clusters. Our clustering paradigm is designed to optimise this effect and find a balance in the clustering where a particular type of feature does not have an overpowering effect on the clustering process.

Fig. 2: The low dimensional UMAP visualisations of data for several data types.figure 2

a UMAP clusters for all the features with the Euclidean metric. b UMAP clusters for continuous features with Euclidean metric. c UMAP clusters for ordinal features with Canberra metric. d UMAP clusters for nominal features with Hamming metric.

Clustering paradigm using UMAP

Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories, we create a lower-dimensional embedding of the dataset. Finally, we integrate the lower-dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. One advantage of this algorithm is that one does not need to specify the number of clusters beforehand. DBSCAN considers closely or densely located points, as clusters [24]. For UMAP, we use the same values for the parameters n_neighbours = 30 and min_distance = 0.1 for all the feature types.

For the continuous features, we use the metric measure to be Euclidean. The Euclidean distance between two vectors is given by:

$$d\left( \right) = \sqrt \nolimits_^n }_i} \right)^2} }$$

(1)

For the nominal features, we use the metric measure to be Hamming. Hamming distance is defined as:

$$d\left( \right) = \mathop \nolimits_^n \right)}$$

(2)

where δ(xi, yi) = 1, if xi = yi and δ(xi, yi) = 0 otherwise. Recall that nominal features are also a type of categorical features that do not have a sense of order associated with them. For such features, Hamming distance is widely used as a similarity measure between data points [23].

For the ordinal features, we use the metric measure to be Canberra. It is a weighted version of the Manhattan measure. The Canberra distance is given by:

$$d\left( \right) = \sqrt \nolimits_^n \right|}} \right| + \left| \right|}}} }$$

(3)

Ordinal features are also a type of categorical feature. However, the Hamming metric cannot capture the inherent ordered relationships and statistic information from categorical values [23]. We thus tried using UMAP for several metric measures and noticed that the Canberra distance measure retains a high variance in the lower dimensions. Thus we chose the Canberra distance measure as a similarity metric for ordinal features.

For the categorical and ordinal features, we thus produce a two-dimensional representation of each data point by taking into consideration the first two UMAP coordinates. For the nominal features, we consider we produce a one-dimensional representation since the data points are too scattered in this case as shown in Fig. 2d and thus can lead to too many clusters. Thus, we reduce every data point into a five-dimension representation, two for each of the continuous and ordinal features and one for the nominal features. Finally, we look for clusters in the five-dimensional representation using DBSCAN (eps = 1, min_points = 200). After selecting the final clusters, we characterized them by summarizing all the 36 variables separately for each cluster. The continuous variables were summarized as their mean and the standard error of the mean. The categorical variables were summarized as their frequency distribution and the proportion of each value within each cluster.

Extraction of T2DM sub-populations using DBSCAN

Using our clustering paradigm described before, we can detect seven sub-populations among the patients where 261 patients are considered as outliers. We show the distribution of clusters in Fig. 3a. We further perform a UMAP on the five-dimensional reduced representation of our data to visualize the clusters detected by DBSCAN. For this, we label the data points using the DBSCAN clustering labels and colour code them in the UMAP representation of the five-dimensional reduced data as shown in Fig. 3b. This provides validation to the fact the clustering done by DBSCAN makes sense. Note that, from our clusters, we can detect four significant patient sub-populations containing 2898, 2301, 2226 and 1315 data points.

Fig. 3: The information on clusters detected in the data.figure 3

a Distribution of clusters detected by DBSCAN on the five-dimensional reduced representation of the data. b UMAP clusters for five-dimensional reduced representation of the data annotated by the DBSCAN generated clusters.

留言 (0)

沒有登入
gif