Linking common human diseases to their phenotypes; development of a resource for human phenomics

Disease–phenotype datasets

We generated a total of four datasets covering disease–phenotype associations (see Fig. 1) by using text mining and semi-automatic collection of associations from public data resources. The “Text Mined” and “Semi-automated” datasets contain all of the text mined and semi-automatically gathered ICD-10-phenotype associations respectively. “Text Mined (UKB)” is the subset of “Text Mined” which covers the associations of only common diseases found in UK Biobank. On the other hand, “Semi-automated (UKB)” covers further manually curated known associations of these common diseases. Table 1 presents the distribution of the associations in the generated datasets based on their provenance. Our aim is specifically to associate common diseases in ICD-10 with phenotypes so that we can map datasets using ICD-10 to phenotypes.

Table 1 Distribution of disease-phenotype associations in the generated datasets by provenance

The Text Mined dataset covers a total of 2,755,333 positive disease–phenotype associations (NPMI score > 0) between 13,610 distinct phenotype classes (from either MP or HP) and 6,263 distinct diseases (from ICD-10) from the literature. A total of 985,511 out of 2,755,333 disease–phenotype annotations can be linked to 1,557 of 2,106 common ICD-10 codes (Text Mined (UKB)). For the remaining 549 diseases, we could not find any positive association from the literature based on our approach.

The Semi-automatic dataset covers a total of 57,671 ICD-10–HPO associations among 7,610 distinct ICD-10 classes and 6,741 distinct phenotypes obtained by integrating a number of manually curated datasets (see Table 1). Out of the 57,671 ICD-10–HPO associations, we gathered the majority of the associations (37,810 out of 57,671 associations, linked to 4,207 of the 7,610 ICD-10 classes) through resources covering rare or common diseases. We obtained a total of 1,838 association from Wikidata, 32,323 associations from the HPO database through OMIM–ICD-10 links from Wikidata and 2,362 through OMIM–ICD-10 links from UMLS; we also obtained 1,287 associations directly from UMLS. We gathered the remaining 19,861 associations (linked to 3,403 of the 7,610 ICD-10 classes) by propagating phenotype annotations of diseases from their subclasses in the ICD-10 hierarchy. We obtained 10,201 out of 19,861 associations by propagating phenotypes from their superclass based on the ICD-10 hierarchy; we obtained the remaining 9,660 out of 19,861 associations by lexical match between the superclass labels and the phenotype labels in HPO.

We sub-selected 2,106 distinct ICD-10 diseases from the Semi-automatic dataset covering all the common ICD-10 codes within UK Biobank. We curated their phenotype associations manually and filtered out the false positives. This curated dataset (Semi-automatic (UKB)) contains a total of 7,576 disease–phenotype associations gathered in a semi-automated way (see Materials and Methods) between 1,995 (of 2,106) common ICD-10 diseases and 2,757 distinct phenotypes linked to HPO. We gathered the majority of phenotype associations (4,337 out of 7,576 associations) for 334 distinct ICD-10 codes from HPO through ICD-10–OMIM links in either Wikidata (3,914/4,337 pairs) or UMLS (423/4,337 pairs). We gathered 541/7,576 associations linked to 473 distinct ICD-10 codes through direct mappings of ICD-10 and HPO in UMLS. We gathered 295/7,576 associations for 43 distinct ICD-10 codes from Wikidata. We generated 1,214/7,576 associations for 335 distinct ICD-10 codes by propagating phenotypes from their superclass based on the ICD-10 hierarchy.

We manually curated 433/7,576 disease–phenotype associations for 433 ICD-10 codes. We generated a total of 756/7576 associations linked to 483 ICD-10 codes by propagating phenotypes from their superclasses when we found a lexical match between the superclass labels and the phenotype labels in HPO.

Phenotypic similarity of text mined and known associations

We measured the semantic similarity between our text mined and the known phenotypes of the diseases. There are 296 diseases in our dataset that are contained both in ICD-10 and OMIM and for which we can obtain phenotype associations both from our text mining approach and from curated data in the HPO database. We measured the semantic similarity between the phenotype profiles of a given disease by using cosine similarity between the ontology embeddings of the disease’s phenotype profiles generated through OWL2Vec* [49].

Our Text Mined dataset consists of disease–phenotype associations and each association has a score that determines the association strength. Among the diseases in our dataset, between 1 and 2,592 phenotypes are positively associated. We assume that not all positive associations may be relevant but only the stronger associations provide useful information about a disease. We test this hypothesis by ranking phenotypes for each disease by their association (NPMI) score. We then include phenotypes in a disease–phenotype profile using varying thresholds for the number of phenotypes to include (based on the association score). To determine a threshold that yields a phenotype profile similar to manually curated ones, we compare the semantic similarity of the thresholded phenotype profiles to the manually curated profiles for the same disease; we evaluate the similarly using receiver operating characteristic (ROC) curves [51]. We find that a threshold of 76 phenotypes results in maximal similarity to the manually curated disease–phenotype associations (ROCAUC 0.95). Figure 4 shows the results of our experiment.

Fig. 4figure4

AUC values obtained for the phenotypic similarity of text-mined and known diseases from HPO at different NPMI ranks

Predicting gene-disease associations

We further evaluated whether our Text Mined and Semi-automatic (UKB) datasets are useful in identifying gene–disease associations based on phenotype similarity. We found 53 diseases in ICD-10 that can be mapped directly to OMIM and are also present in our Text Mined and Semi-automatic (UKB) datasets. These 53 diseases are associated with 216 genes in our gene–disease dataset gathered from MGI.

Utilizing the text mined disease-phenotype associations with their association score, we followed a similar procedure as before and rank phenotypes for each disease based on their association score and vary the rank as threshold parameter. We then compared these phenotype profiles to phenotypes resulting from loss of function mouse models using the cosine similarity between their ontology embeddings, and evaluated how well this method recovers known gene–disease associations. Figure 5 shows the resulting ROCAUC at different NPMI ranks. We find the maximal ROCAUC value at rank 74 (ROCAUC 0.62).

Fig. 5figure5

AUC values obtained for the phenotypic similarity of text-mined diseases and known genes from MGI at different NPMI ranks

We further used different datasets to find gene–disease associations through phenotype similarity: our Text Mined dataset with a threshold of 74 per disease; our Semi-automatic (UKB) dataset collected from multiple databases; the phenotypes associated with the 53 diseases in the HPO database; and combinations thereof. Figure 6 shows the ROC curves resulting from this comparison. The ROCAUC values range from 0.79 for combining Text Mined and Semi-automatic (UKB) datasets to 0.62 for only the Text Mined dataset.

Fig. 6figure6

Comparison of ROC curves for predicting gene–disease associations using cosine similarity

Comparison to expert-curated disease–phenotype associations

We created an expert-curated disease–phenotype association dataset to use for validation. This validation dataset consisted of 830 disease–phenotype associations for 53 diseases. To generate this dataset, we first gathered the semi-automatically curated ICD-10–HPO associations for these 53 diseases from our dataset. False positive HPO terms were filtered out and missing associations were added by an expert; 269 annotations were added. Because the HPO database contains mainly annotations to rare Mendelian diseases, most of the phenotype annotations contained in it are predicated on single gene, oligogenic, recurrent CNV or chromosome structural, disease etiology. While much of the phenotype annotation we need for common disease may be obtained from these annotations, the HPO data includes many phenotypes that are only found in the genetic syndromic disease and not in sporadic occurrences; this is discussed below. Consequently, in putting together the validation dataset, phenotypes which are not found in sporadic disease were treated as false positive unless the ICD class explicitly referred to an OMIM disease. In addition, high level terms such as HP:0002664Neoplasm, were excluded as being of low information content.

We used this corpus to evaluate the datasets we generated by comparing phenotype classes associated with diseases directly, using two types of evaluation, “strict” and “soft”. We called an evaluation strict if we ignored the hierarchy and semantics of phenotype ontologies and only compared whether phenotype classes matched exactly between our dataset and our benchmark. In the soft evaluation, we first propagated disease–phenotype associations over the phenotype ontology hierarchy and then evaluated on all levels of the ontology.

Our semi-automatically curated dataset covered a total of 649 disease–phenotype associations for those 53 diseases. 568/649 of the associations were true positives, 81/649 were false positives. We missed a total of 262/830 annotations (false negatives). We estimated the Precision as 0.88, Recall as 0.68 and F-score as 0.77.

Figure 7 shows the performance analysis of the text mining extracts against the validation dataset. The performance of the text mining process varied over different NPMI ranks. Max F-score value of 0.21 was achieved at NPMI rank 16.

Fig. 7figure7

Performance analysis of text mining against the validation dataset over different NPMI ranks (strict)

We have a total of 3,499 disease-phenotype annotations in the validation dataset when we propagate annotations based on the PhenomeNET ontology. On the other hand, our semi-automatically curated dataset covers a total of 2,830 disease-phenotype annotations after the propagation process. In the “soft” settings, we found that 2,454/2,830 associations are true positive, 376/2,830 are false positive, and 1,045/3,499 are false negative. We estimated the Precision as 0.87, Recall as 0.70 and F-score as 0.78.

Figure 8 shows the performance analysis of the text mined extracts against the validation dataset under the “soft” settings. The performance of the text mining process varies over different NPMI ranks. The best F-score is achieved at the NPMI rank of 27 as a value of 0.44.

Fig. 8figure8

Performance analysis of text mining against the validation dataset over different NPMI ranks (soft)

Coverage of the generated datasets

There are a total of 19,133 distinct ICD-10 codes. We linked 6,263 and 7,610 ICD-10 codes to their phenotypes by using text mining and the semi-automatic strategy, respectively. While we linked 4,118 ICD-10 classes to their phenotypes by both of the methods (overlap); 9,755 (51%) ICD-10 classes were linked to their phenotypes by either methods. Hence, we were unable to link 9,378 (49%) ICD-10 classes to their phenotypes. We discuss the main reasons of being unable to link these ICD-10 classes to their phenotypes in detail in the Discussion section.

Error analysisSemi-automatically curated data

We identified a total of 1,369 false positives during the semi-automatic curation of the associations from all of the 2,106 common diseases. We found that, while 963/1,369 false positives were due to the associations from existing resources, the remaining 406/1,369 false positives were due to the propagation of the annotations. 170/406 false positives are due to their lexical superclass matches in the HPO dataset and 236/406 false positives are due to their ICD-10 superclass-based annotation propagation. For example, ICD-10:C43.5Malignant melanoma of the trunk produced the annotation to HP:0007716Uveal melanoma, due to propagation from ICD-10:C43, Malignant melanoma of skin. We gathered the association between ICD-10:C43 and HP:0007716 from the HPO database through the mapping between OMIM:155600–ICD-10:C43 from UMLS.

Further breaking down the 963 false positives generated from the known data, we found that 12/963 false positives were from the Wikidata set, 3/963 false positives were due to the ICD-10–HPO direct mappings in UMLS, 19/963 false positives were due to incorrect associations found during the manual expert curation due to inclusion of syndromic phenotypes as discussed above, and the remaining 929/963 false positives were due to the use of the asserted disease–phenotype annotations in the HPO database. We further investigated these 929 false positives. As the diseases and phenotypes are mapped to their OMIM and HPO identifiers, respectively, to obtain ICD-10 identifiers for the OMIM diseases, we investigated the portions of the false positives introduced through OMIM–ICD-10 mappings in UMLS and Wikidata. We found that 44/929 false positives were introduced due to OMIM–ICD-10 mappings from UMLS and the remaining 885/929 false positives, which constitute the majority, were introduced due to the OMIM–ICD-10 mappings from Wikidata.

For example ICD-10:I77.1, Stricture of artery, is annotated to HP:0002036, Hiatus hernia, because Wikidata maps this ICD-10 class to OMIM:208050, Arterial tortuosity syndrome, which has a wide clinical phenotype spectrum among which is Hiatal hernia. Phenotypes that would not normally be considered a manifestation of sporadic non-syndromic arterial stricture, such as Arachnodactyly or Hiatus hernia were considered false positives. However, correct annotations to HPO were obtained directly from UMLS, which provides a correct annotation HP:0100545, Arterial stenosis. In general, ICD-10 to OMIM mappings through Wikidata-generated candidate HPO annotations are associated with Mendelian, syndromic disease, accounting for the high number of false positives through this route. These had to be manually removed on a case-by-case basis using expert judgement, where sporadic disease would not be expected to have these associations.

False negatives, i.e. missing annotations, were called usually when the annotation was sparse but there are clear associated phenotypes available in HPO. The causes of this are interesting. For example HP:0000979, Purpura, was missing from the annotation to ICD-10:M31.3Wegener granulomatosis [52] and HP:0025188, Retinal vasculitis missing from systemic ICD-10:M32.9Lupus erythematosus [53]. In the former case, although Wegener granulomatosis is in OMIM (OMIM:608710), there is no clinical synopsis and it was therefore not possible to gather annotations from the HPO database. For the latter, Systemic lupus erythematosus, HP:0002725 is treated as a “bundled term” phenotype in the HPO database and therefore no more granular phenotype annotations are available. There are no direct HPO annotations for Systemic lupus erythematosis in UMLS. We cannot provide any assurance that all of the possible missing annotations have been added to the dataset, but have provided best efforts with the resources available. We hope that users might over time request the addition of phenotypes to their diseases of interest.

Text mined data

For the analysis of the text mined associations, we used the extracts generated based on the NPMI rank 16 which gave us the best result on the validation dataset by using the strict evaluation (precision 0.25, recall 0.17, and F-score 0.21). We have a total of 568 ICD-10–HPO pairs in this text-mined dataset. We found that 143/568 are true positives and 425/568 are false positives. We missed a total of 687 associations (false negatives). Our manual analysis on the 425 false positives show that only a small portion of them (47/425) are false positives and the majority of them (376/425) are actually true positive associations which are not covered by our validation dataset. Our validation dataset includes only the obvious and distinguishing phenotypes of diseases. These 376 associations are the associations of the diseases with the high level of HPO classes. For example, Malignant neoplasm of stomach, unspecified (ICD-10:C16.9) is associated with Neoplasm (HP:0002664) according to our text mining extracts. This is a true positive by manual analysis but was counted as a FP since it is not covered within our validation dataset as Neoplasm is a high level phenotype for all malignant and benign proliferative lesions and of low information content. The false positives are mainly due to the co-mentions of associated disease concepts, or negations in the publications (X is not a Y). Some examples of such associations include Acute myeloid leukaemia (ICD-10:C92.0) and Chronic myelomonocytic leukemia (HP:0012325) as well as Primary open-angle glaucoma (ICD-10:H40.1) and Angle closure glaucoma (HP:0012109). Analysis of the 687 false negative samples showed that actually 473 of 687 pairs (69%) have been extracted from the literature but they do not rank in the top 16 based on their NPMI scores of association strength. The other missing ones are mainly due to weak or no evidence in the literature. For example, there are no publications mentioning Marfan syndrome (ICD-10:Q87.4) and Decreased muscle mass (HP:0003199); there are only 2 publications mentioning Parkinson’s disease (ICD-10:G20) and Macrocephaly (HP:0000256) in title or abstract together in PubMed (search was done on 15th April 2021). One of the publications is published in 2021 which is not covered by our current dataset. Therefore, there is no significant supporting evidence in the literature to infer a positive association between the given disease–phenotype pairs. Other false negatives could be due to the missing disease/phenotype synonyms. Altogether, we estimated the actual performance of the text mining method (at the NPMI rank 16) as an F-score value of 0.59, a precision of 0.92 and a recall of 0.43.

留言 (0)

沒有登入
gif