Extending electronic medical records vector models with knowledge graphs to improve hospitalization prediction

Predicting hospitalization from text-based representations of electronic medical records

Our prediction task can be defined as follows: Let R be a representation of an EMR from the PRIMEGE Database P. Let C be the set of classes to predict C=. We learn the mapping M: M(R)=L, where M is a classification algorithm that predicts a class L∈C for an EMR R.

Before we can consider the enrichment of an EMR representation R with ontological knowledge, the first question to be answered is to determine which EMR representation is best suited to predict a patient’s hospitalization. Since EMRs are essentially based on text data (i.e., the observation field, personal history, family history, etc.), we considered text-based representations. Another important focus with regard to text representations is to retain control over the interpretability of the decisions made by the machine learning algorithms used so that they can be justified and presented to the referring physicians.

Vector models of text data in electronic medical records

EMRs present in the PRIMEGE corpus contain a highly specialized terminology in French with abbreviations, which means that the vocabulary used is adapted to general medicine with sometimes references to specialists who may have been consulted by the patient. This led us to adopt our own vector representation and in particular, we use a bag-of-words (BOW) representation to avoid a lack and misuse of specialized terms from which other approaches (e.g., word embeddings) suffer. This representation has the advantage that it does not require a large amount of data and allows to identify the contribution of the features in the hospitalization (or not) of a patient. More advanced representation models experience a loss of information (by compressing the training data), they may also require a larger corpus, and we were concerned to provide GPs with the closest possible details of their patient records as feedback.

Temporal models of electronic medical records. There is a great deal of variability in the patient-physician relationship, with some people seeing their doctors on a regular basis over many years and others coming to see them only occasionally. In order to take this temporal dimension into consideration, medical records can be studied under two representations, a sequential representation and a non-sequential representation, that we compared.

We evaluated the alternatives on a balanced dataset DSB containing 714 patients hospitalized and 732 patients who were not hospitalized over a 4-year period. These data are from between 2012 and 2015, therefore before the SARS-Cov2 pandemic. This detail is important because the recent pandemic introduces a major bias that would require modifying the models by adding hospitalization weighting factors, or otherwise address this particular issue.

Non sequential modelling of electronic medical records. The PRIMEGE database is structured with different text fields, so we introduced prefixes in the creation of the bag-of-words to track the respective contributions. Thus, it is possible to trace the fields used to generate the features and to distinguish them in the vector representation of EMRs, e.g., a patient’s personal history vs. his family history.

Our non sequential representation of EMR is as follows. Let \(V^=\left \^,w_^,..., w_^\right \}\) be the bag-of-words obtained from the textual data in the EMR of the ith patient. To consider this non sequential representation, we had to aggregate all the consultations occurring before a hospitalization. For patients who have not been hospitalized, all their consultations are aggregated. On the one hand, it contains consultation notes on the reasons for the consultation, diagnoses, prescribed drugs, observations. On the other hand, it contains textual information conveyed throughout the patient’s life including, for instance, familial history, personal history, personal information, past problems, the environmental factors as well as allergies. We are in the presence of two classes, thus the labels yi associated with Vi used for this representation are either ‘hospitalized’ or ‘not hospitalized’.

Sequential modelling of electronic medical records. For a sequential modelling of EMRs, we chose to represent the different consultations of a patient as a sequence (t1,...,tn). This n-tuple contains all his consultations in chronological order, with t1 his first consultation and tn, his last consultation present in the database. Each consultation ti contains both persistent patient data and data specific to the ith consultation. Similarly to the non sequential representation of EMRs, for patients who have not been hospitalized, all their consultations are integrated in the sequential representation of EMRs whereas for patients who have been hospitalized only their consultations occurring before hospitalization are integrated.

Thus ti=(xi,yi) where xi contains two broad types of information about the patient, general information about the patient and information obtained during a consultation, as described in the section about non sequential modelling of EMRs, the Fig. 4 shows how this data is handled in this representation.

Fig. 4figure 4

Diagram illustrating the sequential representation of an electronic medical record

Textual information carried throughout the patient’s life is thus repeated across all xi of ti.

Selected machine learning algorithms

For non sequential classification algorithms, we focus on three different machine learning algorithms which are frequently used in the literature: the logistic regression (LR) [11], random forests (RF) [12], and support vector machine (SVM) [13]. These algorithms, in particular logistic regression and random forests are widely used in the prediction of risk factors from EMR [14]. Moreover, they are natively interpretable in their decision: they provide both the features that are involved in a prediction and the weights learned for the features in a vector representation, except for SVMs where this is the case only for models with a linear kernel.

Markovian models are sequential machine learning algorithms that share the particularity of being interpretable since it is possible to obtain the weights of the state and transition features. Among them, Hidden Markov models (HMMs) are generative models, so they assume that the features are independent, which is not our case with EMRs (e.g., drug interactions, relations drugs-diseases, etc.). This leaves us with two candidate methods: maximum entropy models (MEMMs) and conditional random fields (CRFs). Both are discriminative models, however MEMMs have label bias issues [15]: they proceed to a normalization at each state of the sequence whereas CRFs normalize the whole sequence. This is the reason why we opted for CRFs.

Experiments on the two models

We used the Ftp,fp metric [16], which definition is given in Equation 1, to assess the performance of the tested machine learning algorithms on both sequential and non-sequential representations towards the hospitalization prediction task.

Let TN be the number of negative instances correctly classified (True Negative), FP the number of negative instances incorrectly classified (False Positive), FN the number of positive instances incorrectly classified (False Negative) and TP the number of positive instances correctly classified (True Positive). Let K the number of folds used to cross-validate (in our experiment K=10), and f the notation used to distinguish a fold related metric like the number of true positives from the sum of true positives across all folds.

$$TP_ = \sum\limits_^TP^ \quad FP_ = \sum\limits_^FP^ \quad $$

$$FN_ = \sum\limits_^FN^ $$

$$ F_=\frac}+FP_+FN_} $$

(1)

We rely on state of the art non-sequential algorithms available in the Scikit-Learn library [17] and in the CRF implementation of sklearn-crfsuiteFootnote 2. The optimized hyperparameters determined by nested cross-validation are as follows (hyperparameters search space is detailled between brackets, the continuous random variable was generated by scipy.stats.exponFootnote 3):

SVC, C-Support Vector Classifier, which implementation is based on libsvm [13]: The penalty parameter C ([continuous random variable]), the kernel used by the algorithm [linear, radial basis function kernel -RBF- or polykernel] and the kernel coefficient gamma [continuous random variable].

RF, Random Forest classifier [12]: The number of trees in the forest [integer between 10 and 500], the maximum depth in the tree [integer between 5 and 30], the minimum number of samples required to split an internal node [integer between 1 and 30], the minimum number of samples required to be at a leaf node and the maximum number of leaf nodes [integer between 10 and 50].

LR, Logistic Regression classifier [11]: The regularization coefficient C [continuous random variable] and the penalty used by the algorithm [l1 or l2].

CRFs, Conditional Random Fields algorithm [18]: The regularization coefficients c1 and c2 [continuous random variable for both] used by the solver limited-memory BFGS (the default algorithm used in this library).

We evaluated our representations following the K-Fold method (with a K fixed at 10), a cross-validation strategy which allows us to test a classification algorithm across all the considered data. We optimized the hyperparameters of the machine learning algorithms used in this study with nested-cross validation [19] in order to avoid bias, and the exploration was done with random search [20]. The inner loop was executed with L fixed at 2 over 7 iterations, which corresponds to 14 fits by machine learning algorithms. This process ensures that the hyperparameters are optimized without introducing new biases, since the training, validation and testing sets are distinct at each step. This hyperparameter optimization step aims to improve the predictive power of the algorithms to better distinguish patients to be hospitalized from others. The different experiments were conducted on a Precision Tower 5810, 3.7GHz, 64GB RAM with a virtual environment under Python 3.5.4.

Table 3 presents the values of Ftp,fp obtained with the above described state of the art machine learning algorithms on the dataset DSB shaped with our sequential and non sequential representations. The training time of CRFs with this model was expensive (22 hours with our protocol) and since it does not outperform logistic regression (best score with 0.85), we decided to consider only non-sequential EMR representation in our following experiments on the enrichment of vector representations with ontological knowledge.

Table 3 Ftp,fp of the selected classifiers on the balanced dataset DSBPredicting hospitalization from ontology-augmented representations of electronic medical records

Electronic medical records contain both structured data with fields relating to prescriptions and reasons for consultations, and also unstructured data such as free text. This section presents the different experiments we have conducted to perform a semantic enrichment of this data and the methods we designed to determine the relevant concepts in the assessment of hospitalization risk.

Ontology-augmented vector models of medical records

We reused the dataset DSB to generate vectors as well as the non sequential text representations discussed in the previous section. Compared to the previous representation, here we proceed to the concatenation of the bag-of-words vector representations with a vector of concepts:

Let \(V^=\^,w_^,..., w_^\}\) be the bag-of-words obtained from the textual data in the EMR of the ith patient. Let \(C^=\^,c_^,..., c_^\}\) be the bag-of-concepts (BOC) belonging to knowledge graphs and extracted from the EMR of the ith patient. The data subject to extraction include both text fields listing drugs and pathologies with their related codes, and unstructured data from free texts such as observations. The vector representation of the ith patient is the concatenation of Vi and Ci: xi=Vi⊕Ci. More details about this representation can be found in [21]. The different machine learning algorithms that we tested to predict hospitalization from the enriched representation of EMRs will exploit these aggregated vectors. The resulting representations built are dense, most patients (instances) do not share the same features.

Concepts from knowledge graphs are considered as a token in a textual message. When a concept is identified in a patient’s medical record, this concept is added to the concept vector. This attribute will have as value the number of occurrences of this concept within the patient’s health record. For instance, the concepts ‘Organ Failure’ and ‘Medical emergencies’ (from DBpedia) are identified for ‘acute pancreatitis’, and the value for these attributes in our concept vector will be equal to 1.

Similarly, if a property-concept pair is extracted from a knowledge graph (like in Wikidata and NDF-RT cases -features sets: +wa,+wi,+wm and +d-), it is added to the concept vector. For instance, in vectors exploiting NDF-RT (enrichment with +d), we find the couple consisting of CI_with as a property -contraindicated with- and the name of a pathology or condition, for instance ‘Pregnancy’ (triple found for the drug ‘Tahor’, main molecule ‘Atorvastatin’). The resulting feature of the BOC vector will be named after the property-concept pair. This example is depicted in Fig. 5 where we show how to concatenate the Vi and Ci vectors.

Fig. 5figure 5

Concatenation of a bag-of-words representation V and a bag-of-concepts representation C of EMRs. In this example, we use the drug tahor whose main molecule is atorvastatin and we show how we extract and use one of these contraindicated effects (property CI_with) from the NDF-RT ontology

Extraction of relevant knowledge for prediction

In this section, we detail how to extract knowledge from both structured and unstructured data in EMRs referring to both specialized and cross-domain knowledge graphs. The knowledge extracted will be used to build the BOC. The workflow is shown in Fig. 6.

Fig. 6figure 6

Workflow to link ATC codes, ICPC-2 codes and named entities in the EMRs with medical domain ontologies and with the knowledge graphs Wikidata and DBpedia

Knowledge extraction based on specialized ontologies. We leveraged structured data to query OWLFootnote 4 and SKOSFootnote 5 representations of domain-specific ontologies and thesaurus. From the ICPC-2Footnote 6 codes linked to reasons of consultations and the ATCFootnote 7 codes used for the drugs prescribed to patients present in the PRIMEGE database we generate links to the corresponding resources in the ICPC-2 and ATC ontologies available through BioPortal. We also generate links to the NDF-RTFootnote 8 ontology which contains specifications about drug interactions. The choice of these ontologies came naturally since the ATC and ICPC-2 codes are adopted in the PRIMEGE database, and NDF-RT contains additional information on drugs that capture interactions between drugs, diseases, mental and physical conditions.

For each ATC or ICPC-2 code present in a medical record, we extracted its super classes in its corresponding ontology, by using a SPARQL query with a rdfs:subClassOf property path. For instance, ‘tenitramine’ (ATC code: C01DA38) has as super class ‘Organic nitrates used in cardiac disease’ (ATC code: C01DA) which itself has as super class ‘VASODILATORS USED IN CARDIAC DISEASES’ (ATC code: C01D) which has for super class ‘CARDIAC THERAPY DRUGS’ (ATC code: C01). As for ICPC-2 code, the ontology does not have a high level of granularity, so it is only possible to extract one super class per diagnosed health problem or identified care procedure.

The link to NDF-RT resources was achieved via the CUI codes retrieved in the ATC ontology (with property umls:cui). The successor of NDF-RT is MED-RTFootnote 9 (Medication Reference Terminology), but there is not yet a Semantic Web formalization.

Knowledge extraction based on cross-domain knowledge graphs.DBpedia knowledge graph. DBpediaFootnote 10 is a crowdsourced extraction of knowledge pieces from Wikipedia articlesFootnote 11 and formalized with Semantic Web languages. DBpedia’s applications are varied and can range from organizing content on a website to uses in the domain of artificial intelligence.

We identified named entities in free-text fields of EMRs by using both a dictionary based approach to handle abbreviations and the semantic annotator DBpedia Spotlight [22]. We focused on the subject of the resources identified by DBpedia Spotlight (retrieved by querying DBpedia for the values of property dcterms:subject).

Initially, together with domain experts, we carried out a manual analysis of the named entities detected on a sample of approximately 40 consultations with complete information and selected 14 SKOS top concepts designating medical aspects relevant to the prediction of hospitalization, as they relate to severe pathologies. These concepts are listed in Table 4.

Table 4 List of manually selected concepts to determine a hospitalization. These concepts are translated from French to English (the translation does not necessarily exist for the English DBpedia chapter)

We now propose an automated and more integrative approach to limit the scope of possible entities identified by DBpedia Spotlight and bind them to the medical field. To do so, we formalized and executed two constraints modeled by a federated SPARQL query shown in Listing 1. Figure 7 represents the workflow using this query.

Fig. 7figure 7

Workflow to extract candidate subjects from EMRs using DBpedia

The first SERVICE clause of the SPARQL query carried out on the French chapter of DBpedia retrieves entities identified by DBpedia Spotlight and belonging to the medical domain: they are the labels (property skos:prefLabel) of resources having as subject (property dcterms:subjectFootnote 12) a concept that belongs to the SKOS hierarchy (property skos:broader) of one of the French terms for disease, health, medical genetics, medicine, urgency, treatment, anatomy, addiction and bacteria.

The second SERVICE clause of the query further refines the set of retrieved entities by constraining them to be equivalent (property owl:sameAs) to English entities belonging to at least one of the following medical classes (property rdf:type): dbo:Disease, dbo:Bacteria, yago:WikicatViruses, yago:WikicatRetroviruses, yago:WikicatSurgicalProcedures, yago:WikicatSurgicalRemovalProcedures. We empirically restricted to these few classes and discarded many other medical classes that would introduce noise. For instance dbo:Drug, dbo:ChemicalCoumpound, dbo:ChemicalSubstance, dbo:Protein, or yago:WikicatMedicalTreatments allow to retrieve entities related to chemical compounds, thus entities that can range from drugs to plants or fruits. Types referring to other living things such as umbel-rc:BiologicalLivingObject, dbo:Species or dbo:AnatomicalStructure would select entities describing a wide range of species since the scope of these types is not restricted to humans, and includes bacteria, viruses, fungus or parasites affecting humans. Likewise, the class dbo:AnatomicalStructure was used for describing different things in the previous versions of DBpedia (i.e., ‘Barrier layer (oceanography)’, ‘Baseball doughnut’, etc.). We also discarded biomedical types in the yago namespace defined in DBpediaFootnote 13 which URI ends by an integer (e.g., http://dbpedia.org/class/yago/Retrovirus101336282) because they are too numerous and too semantically close to each other.

In the end, the entities retrieved by this SPARQL query on DBpedia are used to build the vector representation of EMRs from the features extracted from their text fields.

Table 5 presents two examples of observations with their extracted DBpedia concepts. In the first one, the expression ‘insuffisance cardiaque’ (heart failure) leads to the entity dbpedia-fr:insuffisance_cardiaqueFootnote 14 (cardiac insufficiency) which has for dcterms:subject category-fr:Défaillance_d’organeFootnote 15 (organ failure) and category-fr:Maladie_cardiovasculaire (cardiovascular disease). In the second observation, the expression ‘kyste’ (cyst) leads to the entity dbpedia-fr:Kyste_(médecine) which has for dcterms:subject category-fr:Anatomo-pathologie_des_tumeurs (neoplasm stubs).

Table 5 Examples of concepts extracted from free text in EMRs with our approach using a dictionary to handle abbreviations (brackets indicate corrections including typos and abbreviations), using DBpedia Spotlight to recognize entities, and querying DBpedia to retrieve relevant medical concepts

Wikidata knowledge graph. WikidataFootnote 16 is an open knowledge base, collaboratively edited, that centralizes data from projects of the Wikimedia FoundationFootnote 17. For specific datasets in the biomedical domain, Wikidata also benefits from automatic laboratory submissions of the latest research works. For Wikidata, we focused on augmenting our data with information extracted from the properties linked to drugs as we did with the NDF-RT and ATC ontologies. To link to Wikidata, we used the ATC (property wdt:P267), CUI UMLS (property wdt:P2892) and CUI RxNorm codes (property wdt:P3345), since Wikidata contains at least one of them for each drug. To use RxNorm, we proceed in a similar way as for NDF-RT with the CUI codes contained in the ATC ontology. Thus, we queried the SPARQL endpoint of WikidataFootnote 18 to extract knowledge related to drugs, by using three properties: ‘subject has role’ (property wdt:P2868), ‘significant drug interaction’ (property wdt:P2175), and ‘medical condition treated’ (property wdt:P769).

Inter-rater reliability of concept annotation. Now that we have shown how we extracted knowledge from knowledge graphs, we investigate the particular case of the relevance of DBpedia concepts in predicting hospitalization. We aim to distinguish knowledge that introduces noise from knowledge beneficial for the prediction and establish a strategy to improve decision making.

285 concepts from DBpedia were extracted from the query in Listing 1 and were independently annotated by two general practitioners and one biologist. The different annotations were compared with the Krippendorff’s alpha metric [23]. We also used the correlation metricFootnote 19 to compare pairs of vectors from human or machine annotation.

The initial Krippendorff’s α score between the three annotators is 0.51, and the score between the two GPs is 0.27. Some expressions were problematic because they are compound (composed terms) creating terminological conflict by including one or several other terms. As a result they were annotated in the same way by an annotator. It was for instance the case for compounds starting with ‘Biology’ (i.e., ‘Biology in nephrology’, ‘Biology in hematology’, etc.), ‘Screening and diagnosis’ (i.e., ‘Infectious disease screening and diagnosis’, ‘Screening and diagnosis in urology’, etc.), ‘Pathophysiology’ (i.e., ‘Pathophysiology of the cardiovascular system’, ‘Pathophysiology in hematology’, etc.), ‘Psychopathology’ (i.e., ‘Psychoanalytical psychopathology’, ‘Psychopathology’), ‘Clinical sign’ (i.e., ‘Clinical signs in neurology’, ‘Clinical signs in otorhinolaryngology’, etc.), ‘Symptom’ (i.e., ‘Symptoms in gynecology’, ‘Symptom of the digestive system’, etc.) and ‘Syndrome’ (i.e., ‘Syndrome in endocrinology’, ‘Syndrome in psychology or psychiatry’, etc.). Even by excluding these compounds from the considered concepts, which brings us back to 243 concepts, the three annotators obtained a Krippendorff’s α score of 0.66, and 0.52 for the inter-rater reliability between the two GPs.

From the 285 concepts, on average 198 were estimated as relevant to the study of patients’ hospitalization risks by experts: the two GPs estimated respectively 217 and 181 concepts as relevant, and the biologist 196 concepts.

Artstein and Poesio [24] states that such a score is insufficient to draw conclusions. This shows to what extent this annotation task is more difficult than it may seem, in particular because identifying the entities involved in the hospitalization of a patient is subjective and it is therefore hard to find an agreement.

Automatically selecting these concepts can be a way to find a consensus based on data. This is the reason why in the following sections, we generated vectors where knowledge was selected by machine annotations through feature selection and we compare them to the results of human annotations.

Experiments

Experimental protocol. Vector representations were evaluated by nested cross-validation [19], with an external loop with a K fixed at 10 and for the internal loop a L fixed at 3. The exploration of hyperparameters was performed with random search [20] with 150 iterations. The HP EliteBook was used to generate vector representations and to deploy DBpedia Spotlight as well as domain-specific ontologies with the Corese Semantic Web FactoryFootnote 20 [25].

The different experiments were conducted on a HP EliteBook 840 G2, 2.6 hHz, 16 GB RAM with a virtual environment under Python 3.6.3 as well as a Precision Tower 5810, 3.7GHz, 64GB RAM with a virtual environment under Python 3.5.4. Like in the experiment reported in the previous section, we rely on the algorithms available in the Scikit-Learn library, with SVC, RF, LR and we optimized the same hyperparameters.

We used the Ftp,fp metric [16], defined in Equation 1, to assess the performance of selected machine learning algorithms using our vector representations of EMRs enriched with ontological knowledge. We also computed PRavg,REavg,F1avg,AUCavg and their standard error variations for LR, the algorithm that performs best.

Since our experimental protocol uses cross-validation, the training sets overlap, which violates the independence assumption in many statistical tests in the literature [26]. Thus, we opted for the correction of dependent Student’s t test [27] that addresses this issue to confirm the statistical impact of the features extracted from knowledge graphs. It is defined as follows:

$$t=\frac\sum_^x_}+\frac}}\right)\widehat^}} $$

where xj=Aj−Bj, with Aj the metric obtained at the jth fold in the set of metrics A and Bj an another metric in B, A and B are the vectors of size n produced by the two compared methods. Thus xj represents the difference between two evaluations in the fold j (here we used the metrics obtained with the baseline against the metrics of other features sets), n2 is the number of testing folds (in our case n2=1), n1 is the number of training folds (in our case n1=9) and \(\widehat ^\) is the sample standard deviation on x.

Feature sets variations and notation. We aimed to measure the impact of enriching the vector representations of EMRs with different features extracted from knowledge graphs when predicting hospitalization. We detail below the notations used to refer to the different vector representation evaluated in our experiments:

baseline: bag-of-words representation of EMRs, no ontological enrichment is made on EMR data.

+t : refers to an enrichment with concepts from the OWL-SKOS representation of ICPC-2.

+c: refers to an enrichment with concepts from the OWL-SKOS representation of ATC, the number or number interval indicates the different hierarchical depth levels used.

+wa: refers to an enrichment with Wikidata’s ‘subject has role’ property (wdt:P2868).

+wi: refers to an enrichment with Wikidata’s ‘significant drug interaction’ property (wdt:P769).

+wm: refers to an enrichment with Wikidata’s ‘medical condition treated’ property (wdt:P2175).

+d: refers to an enrichment with concepts from the NDF-RT OWL representation, prevent indicates the use of the may_prevent property, treat the may_treat property and CI the CI_with property.

Here, we detail the additional notations to refer to vector representations built from the different methods of selection of concepts from DBpedia. For features sets other than +s∗ and +s, we evaluated the impact of the selection of concepts extracted from DBpedia, whether this feature selection process is performed by machines or humans. This is to observe whether various feature selection methods are relevant to improve the prediction of hospitalization and thus have an impact on reducing the noise that knowledge graphs can bring:

The +s∗ notation refers to an approach using the enrichment of representations with concepts among the list of the 14 manually selected concepts (see Table 4) from DBpedia. This approach does not exploit all text fields to extract knowledge from DBpedia, these fields are related to the patient’s own record with: the patient’s personal history, allergies, environmental factors, current health problems, reasons for consultations, diagnosis, drugs, care procedures, reasons for prescribing drugs and physician observations.

The +s notation refers to an approach using the enrichment of representations with concepts among the list of the 14 manually selected concepts (see Table 4) from DBpedia. This approach uses all text fields to identify entities with: the patient’s personal history, family history, allergies, environmental factors, past health problems, current health problems, reasons for consultations, diagnosis, drugs, care procedures, reasons for prescribing drugs, physician observations, symptoms and diagnosis.

+s∗T refers to an enrichment with the labels of concepts automatically extracted from DBpedia with the help of the SPARQL query in Listing 1, 285 concepts are thus considered with this approach. Like all representations starting with prefix +s∗, concepts were extracted from fields related to the patient’s own record: history, allergies, environmental factors, current health problems, reasons for consultations, diagnosis, drugs, care procedures, reasons for prescribing drugs and physician observations.

+s∗∩ refers to an enrichment with a subset of the labels of concepts automatically extracted from DBpedia acknowledged as relevant by at least one expert human annotator. This approach uses the same text fields as the previous features set.

+s∗∪ refers to an enrichment with a subset of the labels of concepts automatically extracted from DBpedia acknowledged as relevant by all the expert human annotators. This approach uses the same text fields as the previous features sets.

+s∗m refers to an enrichment with a subset of the labels of concepts automatically selected by using a feature selection algorithm. We chose the Lasso algorithm [28] and we executed it within the internal loop of the nested cross-validation (with L, the number of folds fixed at 3) in the global machine learning algorithm chosen to predict hospitalization. This approach uses the same text fields as the previous features sets.

+sm uses the same enrichment procedure of +s∗m to automatically select a subset of the labels of concepts. Contrary to the other features sets built with DBpedia, this one uses all text fields, so in addition to the ones from s∗, we consider: family history, past health problems, symptoms.

+sm∩ uses a subset of +sm with concepts selected by feature selection in all the 10 folds (external loop). This approach uses the same text fields as the previous features set. In total, it considers 14 different concepts (or 19 concepts if we consider that 2 concepts with the same name but different prefixes are different).

+sm∪ uses a subset of +sm with concepts selected by feature selection in at least one fold out of 10 (external loop). This approach uses the same text fields as the previous features sets. In total, it considers 51 different concepts (or 63 concepts when taking into account prefixes).

Results. First, we compared human and machine annotations with the generalization of the vectors (U1 or +sm∪ approach) produced through machine annotations, since the concepts selected with feature selection and nested cross validation may differ from one training set to another. Table 6 displays correlation metric values between experts and machine annotators (its value ranges from 0 to 2, meaning that 0 is a perfect correlation, 1 no correlation and 2 perfect negative correlation). We compare pairs of vectors in this table, if they are deemed relevant, irrelevant or not annotated (in the case of human annotation) to study the patient’s hospitalization risks.

Table 6 Correlation metric (\(1-\frac ).(v-\bar )}\vert \vert }_\vert \vert }_}\), with \(\bar \), the mean of elements of u, and respectively \(\bar \), the mean of elements of v) computed on the 285 concepts. A1 to A3 refers to human annotators and M1 to M10 refers to machine annotators through feature selection annotation on the +sm approach (considering the 10 K-Fold). U1 (or +sm∪) is the union of subjects from the sets M1 to M10. Cells in are strictly superior to 0.5, cells in are between 0.25 and 0.5, cells in are strictly inferior to 0.25

Then, Table 7 reports the results for each representation we tested on the DSB dataset with the Ftp,fp metric. Table 8 shows the average metrics we computed and their standard deviation errors to give more details on the behavior of the enriched vectors on the best performing machine learning algorithm, the logistic regression.

Table 7 Ftp,fp for the different vector sets considered on the balanced dataset DSB under logistic regressionTable 8 PRavg,REavg,F1avg,AUCavg and their standard error variations computed between each folds for the different vector sets considered on the balanced dataset DSB under logisitc regression

Figure 8 shows the average F1 score (average between the different F1 scores obtained by cross-validation) and standard deviations associated to the vector sets under logistic regression considered in Table 7. By comparing this figure with the above-mentioned table, it appears that, contrary to the trend shown in the table, there is no approach that performs better than another. Overall, in 6 to 8 out of 10 folds for SVMs a linear kernel was chosen, and in 2 to 4 out of 10 folds an RBF kernel was selected.

Fig. 8figure 8

Histograms that represent the average F1 score (y-axis) and standard deviations under logistic regression for most of the vector sets considered in Table 

留言 (0)

沒有登入
gif