Synthesizing evidence from clinical trials with dynamic interactive argument trees

Our method generates a conclusion from the existing evidence with respect to the superiority of a given treatment in comparison to another treatment. While we focus on the direct comparison between two treatments in our exposition of the method, the approach can be extended to comparing multiple treatments.

The conclusion generated has the form of a tree in the sense that it consists of an overall conclusion about the superiority of the treatment at the root level which points to several children representing the (interim) conclusions for specific comparison criteria. Take the following example of an automatically generated conclusion comparing two types of insulin, the Neutral Protamine Hagedorn insulin (NPH) and insulin glargine (IGlar) as treatments for Type 2 Diabetes Mellitus (T2DM):

IGlar is overall superior to NPH insulin in terms of safety (considering nocturnal hypoglycemia) and efficacy (considering HbA1c reduction) when weighted equally.

IGlar is superior to NPH insulin in terms of efficacy.

Benedetti et al. [9] show that IGlar is superior to NPH insulin in reducing HbA1c.

Hsia et al. [10] show that IGlar is NOT superior to NPH insulin in reducing HbA1c.

“n other arguments from corresponding studies” show that IGlar is superior to NPH insulin in reducing HbA1c.

IGlar is superior to NPH insulin in terms of safety.

Benedetti et al. [9] shows that IGlar is superior to NPH insulin in terms of nocturnal hypoglycemia.

No study shows that IGlar is NOT superior to NPH insulin in terms of nocturnal hypoglycemia.

“n other arguments from corresponding studies” show that IGlar is superior to NPH insulin in terms of nocturnal hypoglycemia.

The overall conclusion (pointed with ↦) claims the superiority of IGlar with respect to NPH insulin when efficiency and safety are weighted equally. As a justification of this overall conclusion, we have the (interim) conclusions/arguments claiming superiority of IGlar with respect to NPH insulin in terms of safety and efficacy, respectively (pointed with ⇒). As a child of the (interim) conclusion claiming the superiority of IGlar compared to NPH insulin with respect to efficacy, we have an argument claiming superiority of IGlar compared to NPH insulin in terms of higher effectiveness in reducing HbA1c. As a child of the (interim) conclusion regarding the superiority of IGlar compared to NPH insulin regarding safety, we have an (interim) conclusion that IGlar is superior to NPH insulin regarding the reduction of nocturnal hypoglycemia. Finally, the children of the last two (interim) conclusions point to claims in specific publications backing up the claim of superiority with respect to higher effectiveness in reducing HbA1c as well as reducing cases of nocturnal hypoglycemia. Each node in the argumentation tree thus represents an (intermediate) conclusion that is justified by the nodes below, until reaching the claims of specific publications. The specific conclusions derived from claims of specific publications are called Atomic Arguments while the arguments generated by our method and aggregating the results across clinical trials are called Aggregated Arguments.

The method relies on a knowledge base in which all relevant trials have been semantically described in the Resource Description Framework (RDF) following the C-TrO Ontology [11]. We note that any other correspondingly expressive ontology could be used. The argumentation tree is computed using a recursive procedure starting from the root of the tree, invoking procedures to generate the children arguments recursively. Thus, the first arguments/conclusions that are generated are the atomic arguments, with information flowing up to higher levels of the tree where the information is aggregated.

In the following, we first describe the C-TrO ontology and how it is used in our approach to semantically capture the results from clinical trials in a knowledge base. We further describe the procedure for automatically generating the Dynamic Interactive Argumentation Tree (DIAeT) representing the hierarchical conclusion on the basis of the given knowledge base. We present the relevant definitions and other important concepts needed to expose our approach before describing the method formally. We also hint at requirements that NLP methods that automatically extract evidence from publications need to fulfill.

The C-TrO ontology and knowledge base

In order to provide a proof-of-concept for our method, we have manually populated an RDF knowledge base following the structure of the C-TrO ontology [11]. Existing clinical ontologies [1215] have been designed to support the searching, question formulation, and retrieval of evidence from the scientific literature, and focus on a coarse-grained representation of the PICO elements. For example, in the PICO ontology [14], the outcomes are represented as textual descriptions but not in more detail as numerical values for each result of the interventions. Although the Study Cohort Ontology (SCO) [15] considers some pertinent entities for clinical trials such as diseases, drugs, and populations, it does not include all the entities and relationships useful for clinical trial synthesis (e.g., quantitative results of endpoints). In contrast, C-TrO was designed to support the aggregation/synthesis of clinical evidence. It describes fine-grained information about results comparing a certain interventional group (or arm) to a baseline condition and allowing to claim differences from the mean, reductions, size-of-effect, etc. Figure 1 shows the schema of C-TrO used in this work.

Fig. 1figure 1

Diagram of the main classes of C-TrO. Data properties are in green and Object properties in black. The arrows start at the domain classes and end at the range classes

C-TrO has been developed as a general schema to represent the design and results of clinical studies, and it is independent of a particular data source. We used Protégé [16] to populate the C-TrO knowledge base by manually extracting the information from the clinical trials studied in the meta-analyses on glaucoma and on T2DM that are included in the use cases presented later. As a result, the information of the relevant clinical trials is captured in the form of RDF triples in the knowledge base. The example in Fig. 2 illustrates part of the description of the results in a published clinical trial on glaucoma [17] (PMID 8628544) that has been formalized in the knowledge base. An excerpt of the triple representation describing the corresponding study in RDF is given in Table 1. The full RDF file can be downloaded from the repository indicated in “Availability of Data and Materials”. Once the information is in the knowledge base, the method, implemented as a tool, retrieves the information with a SPARQL query formed according to the parameters selected in the user interface (see Table 2). The retrieved information is the base evidence used in the construction of the DIAeTs.

Fig. 2figure 2

Annotated excerpt from a glaucoma clinical trial. Only the pieces of information related to the latanoprost intervention and one of its outcomes are annotated for illustrative purposes. (The annotations were made with INCEpTION [31])

Table 1 Triples corresponding to some information from the clinical trial PMID 8628544Table 2 SPARQL query to retrieve clinical evidence from the C-TrO knowledge base. The values for variables ?drugName1,?drugName2,?endpointDesc, and ?AEName are passed from the systemNatural language processing (NLP) requirements

While we have modeled the evidence manually for this work, the option of applying NLP methods to extract the evidence from publications automatically is appealing. However, there are a number of requirements to be fulfilled by such NLP methods to be applicable in our context. Such methods should be able to generate a machine-readable representation of a publication that comprises the study design, population characteristics, in particular the condition, inclusion and exclusion criteria, age of participants, duration of a study, and most importantly the arms of the study with the corresponding treatment information including dosage information, frequency of application, etc. Further, the central outcomes including values and units need to be extracted for every endpoint, primary and secondary, comparing the different arms. Corresponding semantic medical vocabularies such as the Medical Subject Headings (MeSH) or the International Classification of Diseases (ICD) should be used to normalize treatments, conditions, etc.

Definition of concepts

Arguments Structured arguments consist of a set of premises and a conclusion or claim in which the premises are statements that support the conclusion. In our approach, arguments represent a valid conclusion about the superiority of a therapy/intervention that can be reached on the basis of the clinical trial evidence available in a given knowledge base. The arguments can be nested in the sense that each argument consists of a set of premises and a conclusion where each premise itself can be an argument. In this context, we define an argument as a 5-tuple (C,t,,d,) where:

C is a conclusion about the superiority of therapy t compared to other therapies ,

d is a dimension (i.e., a clinical endpoint) along which therapy t is compared to the alternative therapies,

is a set of premises from which the conclusion follows. A premise pi can be an argument or a set of facts from a knowledge base.

For demonstrative purposes, in the remainder of this article we only consider a singleton set for the competing therapies, i.e., . We distinguish between two types of arguments: Atomic Arguments (AtAs) and Aggregated Arguments (AgAs).

Atomic Arguments (AtA) represent a single result from a published clinical trial that warrants a superiority conclusion with respect to a specific dimension d. An example of an atomic argument is in the annotated statement taken from a published clinical trial (PMID 12734781 [9]) depicted in Fig. 3. This statement claims that insulin glargine (IGlar) is superior in reducing HbA1c to NPH insulin, since it decreases the HbA1c levels in a significant amount from the baseline (i.e., 0.46 vs 0.38, where “-” refers to reduction). In this example, the comparative dimension d is HbA1c reduction.

Fig. 3figure 3

Example of an annotated statement that involves an atomic argument. “%" refers to the Diabetes Control and Complications Trial (DCCT) unit used to measure HbA1c levels. (Annotations made with INCEpTION [31])

Aggregated Arguments (AgA) are arguments whose premises are atomic arguments or other aggregated arguments, and their conclusion is an aggregated claim. An example of an aggregated argument would be an argument generated by considering the results from multiple papers comparing the IGlar therapy to the NPH insulin therapy, claiming that in a certain percentage (e.g., 80%) of studies, it has been demonstrated that IGlar is superior to NPH insulin in terms of HbA1c reduction.

The dimension tree is a tree that hierarchically encodes the relevant dimensions to be used to compare to treatments in a tree representation. In the dimension tree, each node corresponds to a certain dimension (i.e., clinical endpoint) that can be used to compare therapies with each other. The dimensions are hierarchically ordered along the tree in the sense that there is a specialization/generalization relation between children and parent nodes. For example, the dimension safety for a given treatment, could have the sub-dimensions “risk of mortality”, “mild/high pressure”, and “nausea”. Each dimension is associated with a weight according to the importance given to the corresponding clinical endpoint.

The dimension tree is specific to a certain therapeutic area or indication, representing the community consensus on which endpoints are relevant and accepted as evidence in clinical trials. An example of a dimension tree is depicted in Fig. 4.

Fig. 4figure 4

Example of a dimension tree for glaucoma. The tree contains the dimensions: efficacy, safety, IOP reduction and conjuntival hyperaemia, and their respective weights

Degree of confidence Since the clinical trial evidence may be affected by inconsistencies or contradictions (i.e., called ‘attacks’ in the computational argumentation literature [18]) by other pieces of evidence, the conclusion about the superiority of one therapy over other therapies may not be unanimously warranted. To address this, we indicate the degree of confidence to which the conclusion of an argument is warranted by the premises. This is the certainty/confidence that a certain claim holds by quantifying the number of studies in which the given results have been shown in relation to the overall number of studies.

Being \(\llbracket \mathcal \rrbracket \) the degree of confidence of an argument \(\mathcal \), we compute the degree of confidence for a specific claim as follows:

For atomic arguments, the degree of confidence ⟦AtA⟧ is 1 if a certain study claims superiority of t compared to t′, and 0 otherwise. That is, 1 denotes a supporting statement, and 0 a contradictory one. For example, when comparing IGlar to NPH insulin, for the atomic argument AtA1 “Benedetti et al. [9] show that IGlar is superior to NPH insulin in reducing HbA1c”, ⟦AtA1⟧=1, while for the atomic argument AtA2 “Hsia et al. [10] show that IGlar is NOT superior to NPH insulin in reducing HbA1c”, ⟦AtA2⟧=0.

For aggregation arguments, the degree of confidence written as ⟦AgA⟧ is computed as follows:

$$ \llbracket AgA \rrbracket = \frac \sum_ \in \,\dots,A_ \}} w_} * \llbracket A_ \rrbracket $$

(1)

Where is the set of arguments to be aggregated, \(\phantom \!}w_}\) is the weight of the corresponding dimension (assigned in the dimension tree) for the argument Ai being aggregated, and the normalization factor Z is:

$$ Z = \sum_ \in \,\dots,A_ \}} w_} $$

(2)

Note that the weights are non-negative values and ⟦AgA⟧∈[1,0] since the weights are normalized.

Confidence acceptance threshold As in the general case the evidence can not be assumed to be homogeneous with studies having contradictory findings, our method introduces a confidence threshold τ that needs to be reached or surpassed by the confidence of an aggregation argument to be accepted. The interpretation of the threshold corresponds to the relative share of clinical studies that need to agree on a certain result (e.g. superiority of therapy A compared to B for a specific outcome).

If a user wants to consider only results for which no contradictory evidence exists, then the threshold has to be set to 1. In the general case, a user can set the threshold to a value corresponding to the inconsistency he/she is willing to accept regarding the conclusion. The default value for the threshold is 0.5 (or 50%), indicating that at least half of clinical trials need to agree on a certain outcome. A user can set the threshold higher to impose a stricter requirement on the homogeneity of the evidence.

Construction of a DIAeT

The DIAeT is a tree where the nodes represent arguments and the edges connect arguments with sub-arguments. The atomic arguments correspond to the leaf nodes and the aggregated arguments to the inner nodes. The children of a node are sub-arguments (or sub-conclusions) that occur in the premises of the given argument node.

The construction of the DIAeT is driven by a given dimension tree and follows a recursive procedure. Each node recursively calls the procedure that generates sub-arguments that support the conclusion at the node in question. The procedure starts at the general conclusion located at the root node of the argument tree and stops at the leaf nodes that correspond to atomic arguments.

The end of the recursion coincides with the generation of as many atomic arguments =(,⊔,⊔′,⌈,\!}AgA_=(C_,t,\,d_,\},...,A_}\})\), where \(\phantom \!}A_}\) are atomic arguments (if di is a leaf dimension) or aggregated arguments (else). In both cases \(\phantom \!}A_}\) claims superiority of treatment t over treatment t′ with respect to dimension di.

An aggregated argument is accepted if its degree of confidence ⟦AgA⟧ is not less than the user-defined (or default) acceptance threshold τ. Thus, if the degree of confidence ⟦AgA⟧≥τ for t, then the conclusion (C) will state that treatment t is superior to treatment t′ w.r.t. dimension di. Afterwards, the generated arguments are verbalized by domain-specific templates. The procedure to construct a DIAeT is summarized in Algorithm 1.

Example of the construction of a DIAeT

Figure 4 depicts a dimension tree for glaucoma. We can see that the dimensions IOP reduction and conjunctival hyperemia have weights of 1 because they are leaf nodes and therefore there are no other dimensions with which they could be compared. Next, both efficacy and safety have weights of 0.5 meaning that both dimensions are equally important in this example.

Figure 5 depicts the construction of a DIAeT derived from the dimension tree in Fig. 4. The weight of all the atomic arguments is 1. The next level in the recursive process corresponds to the leaf nodes of the dimension tree (i.e., d4 and d5). For IOP reduction (d4), there are 11 out of the 11 clinical trials that state that latanoprost is more effective in reducing IOP than timolol, such that \(\phantom \!}\llbracket \mathcal _} \rrbracket =1\) (i.e., (11/11)). For conjunctival hyperemia (d5), only in one of the six clinical trials that report this adverse effect, it was found that fewer patients suffered conjunctival hyperemia when applying latanoprost, such that \(\phantom \!}\llbracket \mathcal _}\rrbracket = 0.17\) (i.e., (1/6)). Further, \(\phantom \!}\llbracket \mathcal _} \rrbracket =1\) and \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.17\) because the weights of their children nodes (d4 and d5 respectively) are 1. Finally, \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.59\) is the result of the weighted sum of \(\phantom \!}\llbracket \mathcal _} \rrbracket + \llbracket \mathcal _}\rrbracket \) (i.e., (0.5∗1)+(0.5∗0.17)=0.59)Footnote 1. We thus obtain the following conclusions:

Efficacy: “the evidence shows that latanoprost is superior to timolol”, as \(\phantom \!}\llbracket \mathcal _}\rrbracket =1 > 0.5 = \tau \).

Safety: “the evidence does not show that latanoprost is superior to timolol”, as \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.17 < 0.5 = \tau \).

Overall conclusion: “latanoprost is superior to timolol”, as \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.59 > 0.5 =\tau \)

Fig. 5figure 5

Example of the construction of a DIAeT for glaucoma. The confidence acceptance threshold used in this example is τ=0.5

Exploration of other scenarios (Sensitivity analysis)

Our approach allows to modify the weights of the dimensions and/or exclude certain evidence points. For example, studies that are biased or where the methodology applied is unclear, can be excluded by adjusting the parameters. One could also explore other scenarios (or “what-if” simulation) by filtering different criteria, such as publication year, duration of the study, and number or age of the participants in the clinical trials.

In the previous example, we could for instance explore other scenarios by assigning a higher weight of 0.7 to safety and a lower weight of 0.3 to efficacy. The new weights would generate different degrees of confidence. For example, the degree of confidence for \(\phantom \!}_}\) would be now \(\phantom \!}\llbracket \mathcal _}\rrbracket = 0.3 * 1 + 0.7 * 0.17 = 0.42\), and since 0.42<0.5, then the new overall conclusion would be opposite to the one obtained before:

Overall conclusion: “it can not be concluded that latanoprost is superior to timolol”, as \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.42 < 0.5 =\tau \)

Further, if one excludes a study that compares the two drug treatments but that does not mention any result about conjunctival hyperemia (e.g., Mishima et al.,1996 in the tool demo), then the degree of confidence of \(\phantom \!}\mathcal _}\) would change to 0.9 (i.e., 10/11). In contrast, \(\phantom \!}\llbracket \mathcal _}\rrbracket \) would remain the same as 0.17 (i.e., 1/6). As a consequence: \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.9, \llbracket \mathcal _}\rrbracket =0.17\), and \(\phantom \!}\llbracket \mathcal _}\rrbracket =0.53\). Thus, the overall conclusion would change to “latanoprost is superior to timolol”. Table 3 summarizes the given example in which we can observe that when safety has a significantly higher weight than efficacy (e.g., 0.7 vs. 0.3), the overall conclusion changes to “It cannot be concluded that latanoprost is superior to timolol”. Otherwise, the conclusion indicates that “Overall, the evidence showed that latanoprost is superior to timolol”, including the case when one study is excluded.

Table 3 Conclusions generated with different settingsDifferent acceptance thresholds

The user can also explore the conclusions generated according to different acceptance thresholds. For example, Table 4 shows the conclusions generated according to different threshold ranges. This example compares two kinds of insulin treatments for a T2DM case, where balanced dimension weights and no evidence filters are considered. It can be seen that the low thresholds lead to the conclusion stating that IGlar is superior to NPH insulin overall and with respect to safety and efficacy. Thresholds between 0.45 and 0.70 lead to the conclusion that the superiority of IGlar over NPH with regard to efficacy is not supported by the available evidence. Stricter thresholds ranging from 0.71 to 1 lead to the conclusion that the superiority of IGlar over NPH insulin overall and in terms of efficacy is not supported by the given evidence.

Table 4 Conclusions generated according to different acceptance threshold ranges

Figure 6 depicts an example of the effect of changing the acceptance threshold. When the degree of confidence of an argument is not less than the acceptance threshold, then the argument is accepted, otherwise is rejected. The higher the threshold (i.e., closer to one), the stricter the acceptance of the argument becomes. In the opposite direction (i.e., closer to zero), the lower the threshold, the less restrictive the acceptance becomes.

Fig. 6figure 6

Example of confidence acceptance threshold. An argument is accepted if its degree of confidence [[A]] is not less than a given acceptance threshold, and rejected otherwise

Figure 7 shows the DIAeTs generated when using relaxed, majority and strict acceptance thresholds and three different dimension weight configurations to generate arguments on the superiority of the IGlar insulin treatment over the NPH insulin treatment. The threshold represents an acceptance condition for this statement, which implies the relative share of clinical evidence that supports (i.e., agrees with) the argument at the overall conclusion node, and the arguments at the dimension nodes that correspond to sub-conclusions. Setting the confidence threshold to 1 (strict) requires the evidence to be unanimous without any contradicting results. Setting the threshold to 0.5 (majority) requires the majority of studies to support the conclusion, while a value between 0 and 0.5 is very “lenient”, leading do the generation of arguments given very weak evidence. Along the table in Fig. 7, we can see that the stricter the threshold is, the more red nodes that are in the generated tree, that is, the more superiority arguments are rejected. Whereas with more relaxed thresholds, there are more green nodes, meaning that more superiority arguments are accepted.

Fig. 7figure 7

Trees generated with different confidence acceptance thresholds and weights. Where OS: Overall Conclusion, E: Efficacy, S: Safety, Hb: reduction of HbA1c, NH: Nocturnal hypoglycemia. The nodes in green are accepted arguments and in red rejected arguments on the superiority of IGlar insulin

The DIAeT approach implemented as a web tool

The DIAeT approach has been implemented as a web tool as a proof of concept to support its evaluation with end users. Figure 8 provides an overview of the steps in the processing of the implemented method. The knowledge base that contains the clinical trial information and the weighted dimension tree are the starting-point for the system. The evidence is retrieved from the knowledge base via predefined SPARQL queries that are aligned with the dimensions in the dimension tree. Based on these elements, an argument synthesis process, in which evidence can also be filtered, generates a DIAeT that represents a nested conclusion about the superiority of some therapies compared to other therapies. The DIAeT is verbalized relying on domain-specific templates that make the conclusion accessible to the user. By defining filters or modifying weights, the users can interactively change the generated argument tree and thus explore the impact of certain choices on the synthesis of results.

Fig. 8figure 8

Overview of the DIAeT framework

Figure 9 depicts the user interface of the DIAeT tool. The user can select treatments to compare, set the confidence acceptance threshold, and assign the weights for each dimension of a predefined dimension treeFootnote 2. The reached conclusion for each dimension is represented in a hierarchical fashion along the hierarchically ordered criteria in the dimension tree. Each section can be expanded/hidden interactively. At the lowest level, the atomic arguments are displayed and it is indicated whether they support or contradict the conclusion. Supporting statements are displayed in green color and contradictory statements in orange.

Fig. 9figure 9

User interface of the DIAeT web tool. Left: evidence filters, confidence threshold (in the red square), and dimension tree weights. Right: generated conclusion tree and clinical evidence table

Figure 10 shows an example for conjunctival hyperemia where there are five atomic arguments attacking a single supportive evidence (study CT_7). Supportive arguments in this example are those that state that latanoprost causes less conjunctival hyperemia cases than timolol, while attacking arguments are those that imply a contradiction to the supportive arguments by stating that either latanoprost causes more conjunctival hyperemia cases, or that there are equal number of cases caused by both drugs (i.e., “latanoprost is not superior to timolol”). The evidence used to generate the DIAeT is displayed in the clinical evidence table. In this table, the user can find more information about the clinical trials, such as duration, number of patients, sources of possible biases, etc. (see Fig. 11)

Fig. 10figure 10

Atomic arguments for conjunctival hyperemia. For each atomic argument, contradictory (attacks) and supportive information is displayed. The values in bold font denote “superiority” of the respective drug (i.e., the drug that provokes less cases of conjunctival hyperemia). Supportive arguments are in green and contradictory arguments in orange

Fig. 11figure 11

Clinical evidence table. The unticked study (on top) is not considered in the generation of the DIAeT

Once the conclusions are generated, the tool allows the user to explore different scenarios by changing parameters (e.g. publication year, number and age of the participants, etc.), weights, confidence threshold, and exclude/include clinical studies (i.e., rebuttal of data), and then re-generate the conclusions. For example, specific studies can be excluded from the considered evidence if the user deems that the study does not meet certain criteria. All the studies that are considered by the system as supporting evidence are ticked in the evidence table. The user can then decide to untick them if he wants to explore what happens by not including them. Figure 11 depicts an example in which there are two studies in the evidence table. Only the ticked study will be considered in the construction of the argumentsFootnote 3.

Although the final decision on the best treatment is made by medical expert users, the method implemented as a tool would help them in the exploration of the information by, for example, narrowing the search space or helping her to understand under which conditions and assumptions it can be assumed that a certain treatment is superior to other treatments. If the medical experts find some interesting, unexpected or contradictory conclusions, they can directly check possible explanations for these conclusions in the published clinical trials.

留言 (0)

沒有登入
gif