Elucidating the semantics-topology trade-off for knowledge inference-based pharmacological discovery

We study the relationship between knowledge graph relation quality and network topology by conducting pre-processing perturbations of the KG before inference time and analyzing the downstream effect on performance. This framework elucidates the relative importance that the model places on relational knowledge versus relying primarily on topology. We measure this effect by evaluating the drop in performance when corrupting relations under different network topologies. We provide a schematic illustrating the graph perturbation pipeline and subsequent evaluation for a downstream pharmacological task in Fig. 1.

Fig. 1figure 1

Overview of the KG processing and evaluation pipeline. Input KGs are first pre-processed by altering their topology via degree-based downsampling or hub removal. The semantics of KG relations are experimentally perturbed via corruption or flattening down to a single edge type for non-whitelist triples. After pre-processing, the KG is used for downstream tasks by predicting links using KG embedding methods. The performance under different experimental conditions is evaluated

Data

We define a knowledge graph, \(\mathcal \), as a collection of triples of the form, \((h, r, t) \in \mathcal \subseteq \mathcal \times \mathcal \times \mathcal \), where \(h, t \in \mathcal \) are entities (or equivalently, nodes) and \(r \in \mathcal \) are relations (or equivalently, edges). Two knowledge graphs were used in this study: GNBR [4], which is NLP-derived, and Hetionet [25], which is derived from structured databases.

Table 1 Knowledge graph statistics. Med. ND = median node degree. Max ND = maximum node degree. EE = entity entropy (see “Metrics” section)GNBR

The Global Network of Biomedical Relations (GNBR) is a knowledge graph of relationships between drugs, genes, proteins, and diseases extracted from PubMed abstracts [4]. Sentences containing co-occurring pairs of drugs, genes, proteins, and diseases identified via named entity recognition (NER) were clustered together based on dependency parsing and co-occurrence frequency. Common dependency paths were assigned one of 32 high-level semantic themes by annotators, which defined 32 relations.

Hetionet

Hetionet is a biomedical knowledge graph comprised of structured databases from 29 sources [25]. The full KG contains 11 types of nodes and 24 types of edges describing interactions between genes, compounds, diseases, side effects, symptoms, pathways, and other entity types. We restricted the graph to chemicals, genes, proteins, and diseases to enforce comparable mechanistic-based knowledge to drive repurposing inference. Data for GNBR and Hetionet were downloaded from the compiled Drug Repurposing Knowledge Graph (DRKG) network [26]. In both KGs, genes and proteins are treated as a single entity type and not disambiguated as is standard in the field. Network statistics for GNBR and the Hetionet subset used in this work are described in Table 1.

KG pre-processing perturbations

We evaluated the effect of four knowledge graph perturbation strategies, two changing the topology of the graph and two ablating the semantics of KG relations.

Topology perturbation via degree-based downsampling

Knowledge graph topology was perturbed by downsampling entities or triples based on degree before embedding and evaluation. We define the degree of a node in the knowledge graph as the sum of the in- and out-edges adjacent to the node:

$$\begin deg(i) := |\| \end$$

In the entity downsampling condition, a fraction, \(f_\), of entities with degree above the \(p^\) percentile, \(deg_p\), were removed uniformly at random.

In triple-based downsampling, triples were removed from the graph based on degree until a fraction of the initial triples, d remained. We define the degree of a triple, e, as the sum of the degrees of its two entities:

$$\begin deg(e) = deg((h,r, t)) := deg(h) + deg(t). \end$$

To account for the correlation of triples’ degrees, whereby removal of one triple might effect the degree of another, downsampling was done iteratively in batches using Algorithm 1 where

figure a

Algorithm 1 Degree-based KG triple downsampling protocol

$$\begin u^(e) := (1+deg(e))^ \end$$

and

$$\begin p^(e) := u^(e) / \sum \limits _} u^(e). \end$$

The degree strength parameter, \(\alpha\), informs how degree is used for downsampling, as the magnitude of \(\alpha\) controls the strength of the degree-based selection and the sign controls whether high-degree triples (positive \(\alpha\) values) or low-degree triples (negative \(\alpha\) values) have greater probability mass for downsampling.

Relation perturbation experiments

Two pre-processing procedures were employed to ablate biologically meaningful semantics of triples in the input knowledge graphs: flattening and corrupting. In the corrupting condition, a fraction of non-whitelist triples, \(f_\), were corrupted, where corrupting is defined as resampling the triple’s relation to another relation, \(r' \in \mathcal \), uniformly at random. The flattening procedure is analogous: the relations of a fraction, \(f_\), of non-whitelist triples, were mapped to a single arbitrary relation, “relates”.

ModelsKnowledge inference models

In this study we considered four knowledge graph embedding models for knowledge inference, TransE [27], DistMult [28], ComplEx [29], and RotatE [30]. These models map concepts and relations to discrete numerical embeddings in vector space such that knowledge graph triples have a meaningful geometric interpretation in the learned space. This representation enables downstream tasks including knowledge inference by measuring the plausibility of inferred triples, those not seen in training. In this work, embeddings are used for our knowledge reconstruction task where we infer known but obscured whitelist relationships.

In TransE, entities and relations are mapped to k-dimensional vectors vectors such that triples, (h, r, t), in the KG can be represented as translations from \(}\) to \(}\) via \(}\), where \(}, }, } \in \mathbb ^k\). The TransE score function is:

$$\begin f(h, r, t) = -||} + } - }||_2 \end$$

The notion of learning embeddings to optimize for translation is conceptually simple but fails to capture properties that may be intrinsically semantically important like symmetry.

DistMult [28] learns embeddings using a semantic matching approach, optimizing for embeddings of head, relation, and tail nodes in KG triples to point in the same direction in the real plane. The scoring function for DistMult is:

$$\begin f(h, r, t) = }^T diag(}) }, \end$$

where \(}, }, } \in \mathbb ^k.\)

ComplEx [29] uses a semantic approach like DistMult, but in the complex plane, \(}, }, } \in \mathbb ^k\):

$$\begin f(h, r, t) = Re(}^T diag(}) }). \end$$

Finally, RotatE [30] learns embeddings such that the relation embedding represents a rotation of the head vector to the tail vector in the complex plane, \(}, }, } \in \mathbb ^k\):

$$\begin f(h, r, t) = ||} \circ } - }||^2_2, \end$$

where \(\circ\) denotes the Hadamard product. This model has been shown to be the most expressive of the four methods with the ability to capture symmetric, antisymmetric, inversion, and composition properties in relations. We focused our investigation on TransE, the model with the simplest geometric interpretation, and RotatE, the model that is the most expressive and consistently outperforms the other three on KG prediction tasks [30].

Implementation details

Model training and evaluation was done using the PyKEEN package [31]. Hyperparameter values were set based on existing work on hyperparameter tuning of KG embeddings for biomedical link prediction, particularly for Hetionet [17]. Models were trained for 500 epochs with learning rate = 0.02, and 50 negative samples generated per positive. The PyKEEN default embedding dimensions were used: \(k = 50\) for TransE and DistMult, \(k = 200\) for ComplEx and RotatE. In all experiments, we used the negative sampling loss with self-adversarial training [30] with AdaGrad [32] for optimization.

Table 2 Task-specific whitelist relationsEvaluation

Performance was evaluated on a held-out test set using a typical KG embedding evaluation framework based on concealing and inferring head and tail nodes in test triples [31].

Pharmacological evaluation tasks

We evaluate three different biomedical knowledge inference tasks: drug-disease prediction (drug repurposing), disease-gene association, and drug-target (equivalently “drug-gene”) interaction. For each task, a set of relations from each dataset are considered whitelist relations which are candidates for test set sampling. Whitelist relations are listed in Table 2. These comprise the standard set of whitelist relations for various pharmacological knowledge inference tasks [15].

Test set sampling

To split triples into training and test sets, candidate test triples were first determined after all network pre-processing. For a given task, a triple, (h, r, t), is eligible for inclusion in the test set if it satisfies two criteria: a) the triple’s relation, r, is in the whitelist set of relations for the task, and b) \(min(deg(h), deg(t)) \ge 4\). We sampled 5% of permissible triples to create a test set. All other triples, including those consisting of a whitelist relation, comprised the training set.

Metrics

As in [23], we calculated entity entropy (EE) as a global metric of network topology. The intuition for this metric is that hubbier networks will have lower EE and networks where each node has approximately the same degree will have high EE. EE is calculated as:

$$\begin EE(\mathcal ) = \sum \limits _} -P_(n) \log P_(n), \end$$

where \(P_\) is the entity selection probability distribution, which describes the probability that an entity appears in a triple sampled uniformly from \(\mathcal \). \(P_\) is calculated as:

$$\begin P_(n) := \frac|}|}. \end$$

Lastly, we define normalized entity entropy, \(EE_\), as \(EE_(\mathcal ) := \frac)}|)}\) such that \(EE_: \mathcal \times \mathcal \times \mathcal \rightarrow [0,1]\).

Knowledge inference performance was evaluated using adjusted mean rank index (AMRI) scores as in [16]. AMRI is a metric that considers the expectation of where entities would rank under a random uniform distribution, which is a more faithful representation of the quality of embedding-based predictions. AMRI scores lie in the range \([-1, 1]\) where a value of 1 indicates perfect performance (i.e. the obscured entity is always ranked at the top of the predicted list) and 0 indicates random-like predictions.

Fig. 2figure 2

AMRI evaluation of drug-disease inference for two knowledge graphs (GNBR and Hetionet) and two KG embedding inference methods (TransE and RotatE). Results are reported for varying entity entropy conditions induced by downsampling high-degree triples (\(\alpha = 2)\). Relation semantics were perturbed by two procedures: corrupting, or shuffling relations randomly, and flattening, or mapping all non-whitelist relations to a single arbitrary relation

留言 (0)

沒有登入
gif