Exploiting document graphs for inter sentence relation extraction

We present this section in four main parts: the overview of our evaluated dataset; the overall picture of the proposed architecture and three main components in detail; additional techniques to improve model performance; and experimental configuration.

Dataset

Our experiments were conducted on the BioCreative V Chemical-Disease Relation dataset [19]. This corpus contained a total of 1500 PubMed articles that were separated into three subsets, each of 500 for the training, development and test set (the details are shown in Table 1). This dataset is annotated with chemicals, diseases and the chemical-induced disease relationships at abstract-level. Relation annotations are asserted for both within and across sentence boundaries. Following the data survey of BioCreative [26], about 30% of total instances are inter sentence relationships.

Table 1 Summary of the BioCreative V CDR datasetModel overview

Figure 1 illustrates our proposed model for extracting the semantic relation at the abstract level, which contain four main phases: (i) Firstly, we construct a document subgraph to represent the relationship between entity pairs. (ii) In order to represent an instance by a set of paths, we apply several advanced techniques for finding, merging and choosing the relevant paths between entity pairs. (iii) In the next step, the advanced attention mechanism and several types of linguistic information are applied to explore the information from the document subgraphs more effectively. (iv) Lastly, to exploit these enriched representations effectively, we develop a shared weight Convolutional Neural Network model (swCNN).

Fig. 1figure 1

Proposed model for inter sentence relation classification. Red dotted and striped nodes indicate two types of disease. Blue filled nodes indicate one type of chemical

Document subgraph construction

As we noted above two entities that participate in a relation may belong to different sentences. Dependency trees are often used to extract local dependencies of semantic relations in intra sentence relation extraction. However, such dependencies are not adequate for inter sentence RE since sentences have different dependency trees that are not connected. Because of this limitation, using the shortest dependency path to extract the local dependencies of semantic relations is not adequate for inter sentence RE.

To overcome these limitations, we construct a graph for consecutive sentences based on their dependency trees, called the document subgraph. In this graph, the nodes correspond to words and edges represent the connection between them. We make two assumptions: (i) the distance of two participating entities in a relation should not be too far (experimentally, two entities should be within five consecutive sentences). If two entities are too far apart, the method’s effectiveness would be reduced, or this pair may be ignored. (ii) The title of the abstract is a special sentence that is related to every sentence in the abstract in a certain manner. Because of this assumption, the title is always used together with the abstract sentences to generate each subgraph.

Creating a document subgraph is a three-step process:

Step 1: Generate the dependency tree for each sentence. All directed dependency labels are kept in the subgraphs as local dependency information.

Step 2: Merge the dependency trees of the sentences in each sliding window into a document subgraph.

The sliding window of size w indicates the number of consecutive sentences that we use to create the document subgraphs. w=1 indicates a single sentence, i.e. the model only extracts the intra sentence relations. With w=j, each j sentences are used to create a subgraph. Since two entity mentions can appear in different sentences, an unrestricted selection of text spans would risk generating many unexpected examples and lead to an explosion of computing space (see Instance merging.) We, therefore, limit w to 5, i.e., all relations with two entities that are not within 5 consecutive sentences are ignored. After this phase, each abstract will consist of several subgraphs.

Step 3: Create virtual edges for subgraphs. By using dependency trees, we already have local dependency information. In this step, we try to link new virtual edges by using several additional information:

NEXT-SENT edges connect root nodes in dependency trees of two consecutive sentences. They bring sequential non-local dependency information.

TITLE edges are created between two dependency tree roots of the Title and the first sentence in the sliding window. They provide non-local dependency information.

COREFERENCE edges link an anaphoric expression to its antecedent if identified by the multi-pass sieves coreference resolution method [23]. These edges show the semantic relation between terms. We divide this connection type into three specific types: (i) COREF-sent: anaphor and antecedent belong to two normal sentences, (ii) COREF-to-title: anaphor is in a normal sentence and antecedent is in the Title, (iii) COREF-from-title: anaphor is in the Title and antecedent is in a normal sentence.

KB-CTD edges are created between head nodes of two entities if they are annotated as having relation ‘M’ in the Comparative Toxicogenomics Database (CTD)Footnote 1. We call it knowledge-based information.

These virtual edges are undirected and labeled by their names. We give a realistic example of the document subgraph in Additional file 1: Appendix A.

Using the subgraphs already constructed, this module finds all possible paths between two entities in each graph. We perform a breadth-first search on a graph to find all possible paths between two entities. The graph we constructed is quite complex, moreover, the complexity increases with the sliding window size w and the number of new virtual edges. A traversal in breadth-first order on such a large graph with cycles is resource-consuming (even if we never go back to the passed nodes to avoid the infinite issue).

To overcome this risk, we use two thresholds:

Nearly all previous studies in relation extraction consider co-occurring entity pairs with known relations as positive instances for training. This assumption is reasonable for intra sentence relations, but the inter sentence problem presents a new challenge since this strategy would risk generating too many wrong examples. It is because a document has a relation between two entities does not mean that all spans of text contain these entities show that relation. Quirk and Poon [16] tackled this problem when an entity pair co-occurs in a large text span, and also co-occur in a smaller text span that overlaps with the larger one. In such cases, if there is a relation between the pair, most likely it is expressed in the smaller text span when the entities are closer to each other. To reduce the unexpected noise from the large text span, we apply a restriction of generating paths called ‘minimal span’ [16]. I.e., only the minimal span is chosen to generate the paths between two entities. A co-occurring entity pair has the minimal span if there does not exist another overlapping co-occurrence of the same pair. Since each abstract can have several subgraphs, in this phase, we receive several sets of paths.

Instance merging

Figure 2 illustrates the instance merging technique. Firstly, we address two unexpected problems while generating the instance from the document subgraph. In Fig. 2-A, a pair of entities appear several times at different positions in an abstract. Because the BC5 CDR corpus has relations annotated at the abstract-level, all of these co-occurrences are treated as positive examples for the CID relation. In fact, only a few of them actually refer to the CID relation. This may cause much noise during training.

Fig. 2figure 2

Examples of two unexpected problems while generating the instance from document subgraph

The example in Fig. 2-B shows the problem of unexpected instance repetition, especially when we widen the window to create subgraphs. In this example, we can generate three identical training instances, i.e., the training patterns of this instance are produced three times, changing the actual frequency of the representation in the training data. This issue may then lead the model to give this instance a higher priority (more important weight). We give a realistic example of these problems below:

“<Title> Hemolysis of human erythrocytes induced by tamoxifen is related to disruption of membrane structure.

< S1> TAM induces hemolysis of erythrocytes as a function of concentration.

< S2> The extension of hemolysis is variable with erythrocyte samples, but 12.5 microM TAM induces total hemolysis of all tested suspensions.

< S3> Despite inducing extensive erythrocyte lysis, TAM does not shift the osmotic fragility curves of erythrocytes.

< S4> The hemolytic effect of TAM is prevented by low concentrations of alpha-tocopherol (alpha-T) and alpha-tocopherol acetate (alpha-TAc) (inactivated functional hydroxyl) indicating that TAM-induced hemolysis is not related to oxidative membrane damage.

< S5> This was further evidenced by absence of oxygen consumption and hemoglobin oxidation both determined in parallel with TAM-induced hemolysis. …”

(PMID: 10704919)

Tackled with a title and 5 sentences as shown above and a sliding window size w=3, we have 42 valid pairs of CID: TAM-Hemolysis. Each entity pair can potentially be described by up to 15 paths. As a result, if each pair CID: TAM-Hemolysis is considered as a positive instance, we may have too many ‘similar’ positive instances. The same problem also appears for negative instances. To solve this problem, we propose a technique called instance merging, in which, we extract all possible dependency paths between a pair of entity mentions and merge them into a single set for this entity pair. To reduce overlapping training instances, we remove the repeated paths (i.e., if several paths are totally identical, only one is kept).

Choosing top −k paths

After the instance merging phase, we have a set of several paths to represent a pair of entities. Some of them are useful, but others may be noise.

Prior works on intra sentence relation extraction often explored the single shortest path between two entities [27, 28]. Applying these traditional approaches for inter sentence relation classification problem raises many problems. Firstly, we cannot take advantage of all the local and global features since they may appear in different paths; secondly, the shortest path may not the ‘best’ path.

In contrast to these previous approaches, we propose to consider a set of multiple paths as a novel representation for an entity pair. To reduce noise and model complexity, we only choose the top-k best paths. This leads to the problem of how to choose advantageous paths. In this work, we implement two strategies to choose the top-k paths:

Top-k shortest dependency paths, this strategy was also used by [16].

Top-k paths with the highest number of repetitions.

To explore the information in this novel representation, we cannot use our previous models. Instead, a new deep learning architecture capable of simultaneously processing multiple paths was proposed, based on the swCNN.

Path representation

Before inputting to the model, each component in the dependency paths must be transformed into an embedding vector. In order to have an informative representation, we take advantage of various linguistic information along the dependency path, from the original dependency tree and other resources.

The dependency relations with directions are proven more effective than the dependency relations without directions for the relation extraction task [27]. However, treating the dependency relations with the opposite direction as two separate relations can induce two vectors for the same relation. We represent the dependency relations with two discrete components: \( }^\in }^} \) represents the dependency relation type among 72 labels; and \( }^\in }^} \) is the direction of the dependency relation, i.e. from left-to-right or vice versa on the Shortest Dependency Path (SDP). The final representation di of dependency relation is obtained through a nonlinear transformation as follow:

$$ \mathbf_ = \tanh\left(\left[ \mathbf^_ \oplus \mathbf^_ \right] \mathbf_ + \mathbf_ \right) $$

(1)

where the dtyp and ddir vectors are generated by looking up the embedding matrices \( }_^e\in }^\times 72} \) and \( }_^e\in }^\times 2} \) respectively; Wd and bd are trainable parameters of the network.

For token representation, we utilize two types of embeddings to represent the word information in different aspects, including:

Pre-trained fastText embeddings [29] learn the word representation based on its external context and n-gram sub-word information. Each token in the input paths is transformed into a vector \(\mathbf ^_\) by looking up the embedding matrix \( }_w^e\in }^\times \left|V\right|} \), where dimwe is the word embedding dimension, and V is the vocabulary of all words we consider.

POS tag embeddings captures (dis)similarities between grammatical properties of words and their syntactic structural roles within a sentence. We concatenate the part-of-speech (POS) tag information into the token representation vector. We randomly initialize the embeddings matrix \( }_p^e\in }^\times 56} \) for 56 OntoNotes 5.0 version of the Penn Treebank POS tags. Each POS tag label is then represented as a corresponding vector \(\mathbf ^_\).

We concatenate two embedding vectors of each token and transform them into the final token embedding as follow:

$$ \mathbf_=\tanh\left(\left[\mathbf^_\oplus\mathbf^_\right]\mathbf^+\mathbf^ \right) $$

(2)

Each token ti is concatenated with the corresponding attentive augmented information from its child nodes on the original dependency tree proposed by Can et al. [30]. Given a token t, the attentive augmented information is calculated using the token itself and the set of its M child nodes. Word embedding and POS tag embedding are concatenated to form token embedding vector t while the dependency relation from a direct ancestor is added to form a child node representation ci. The position embeddings di is also used to reflect the relative distance from the i-th child to its parent on the original sentence.

Two sequential attention layers on the children of a token are used to produce children context vectors. A simple self-attentive network is applied to child nodes \(\left \} \right \}^_\) where the attention weights are calculated based on the concatenation of themselves with parent information and distance from the parent. I.e.,

$$ \begin \bar} &= \left\_ \oplus \mathbf \oplus d_ \mathbf_ \right\}^_ = \left\}_\right\}^_\\ \mathbf&=\left\}_ \mathbf_ + b_ \right\}^_= \left\ \right\}^_\\ \alpha^_&=\text(e_)\\ \mathbf^_&=\alpha^_ \mathbf_ \end $$

(3)

where \( }_d\in }^ \) is the base distance embedding; We and be are weight and bias term.

A distance-based heuristic attentive layer is applied on the self-attentive children context vector to keep track of how close each child is to the target token, as follow:

$$ \begin \alpha^_&=\text(\beta d^_)\\ \mathbf^_&=\alpha^_ \mathbf^_ \end $$

(4)

where f(d)=βd2 with β=−0.03 is a heuristically chosen weighting function.

Afterward, to capture the relevant and essential information from the output of the multi-attention layer and preserve the integrity of the word information, K kernel filters are applied to each child’s attentive vector to produce K features from each child. The final augmented information a is captured by a max-pooling layer, i.e.,

$$ \begin \mathbf&= \left\\left(\mathbf^_ \mathbf_ + \mathbf_ \right) \right\}^_\\ \mathbf &= \left\^_ \right) \right\}^_ \end $$

(5)

where Wf is the weight of K kernel filters; and bf is bias term.

Finally, this concatenation is transformed into an X-dimensional vector to form the representation \( }_i\in }^X \) of the token, i.e.,

$$ \mathbf_ = \tanh\left(\left[ \mathbf_ \oplus \mathbf_ \right] \mathbf_ + \mathbf_ \right) $$

(6)

where Wx and bx are trainable parameters of the network.

Shared-weight convolutional neural network

Convolutional Neural Networks (CNNs) [31] are good at capturing the n-gram features in the flat structure and have also been proved effective in many natural language processing tasks including relation classification [14, 17]. The typical structure of a shared-weight CNN (swCNN) is quite similar to the original CNN that is comprising convolution, pooling, fully-connected layers and softmax. The novel point is the ability to share weight between several convolutions, leading to the ability to process multiple data instances at once.

Figure 3 illustrates the overall architecture of our swCNN model, which is comprised of two main components: multi-path representation and classification. Given a set of multiple k paths as input, each path is converted into a separated embedding matrix. A shared-weight convolution with relu activation layer is followed to capture convolved features from these embedding matrices simultaneously. The essential features are gathered using a filter-wise pooling layer before being classified by a fully connected layer with softmax classification.

Fig. 3figure 3

Diagram illustrating of a swCNN architecture

In the embeddings layer, each component in the dependency path (i.e., token or dependency relation) is represented by a d-dimensional vector \( }_e\in }^d \) where d is the desired number of embedding dimensions as described in the previous section ‘Path representation’.

After the embeddings layer, the input multiple paths are transformed into:

$$ \mathbf = \left[\mathbf_, \mathbf_, \mathbf_,...,\mathbf_, \mathbf_, \mathbf_ \right]_^ $$

(7)

In general, let us define the vector xi,j:j+m as the concatenation of m tokens and m−1 dependency relation between them. I.e.,

$$ \mathbf_ = \mathbf_ \oplus \mathbf_ \oplus \mathbf_ \oplus... \oplus \mathbf_ \oplus \mathbf_ $$

(8)

In the convolution layer, we apply N filters with region size r to these embedding matrices simultaneously. These filters move by dependency unit to keep the dependency information between tokens. Since the same filters are used for all matrices, our model can extract information from them at the same time, as well as suppress increases in the number of weight parameters then reduce the computational complexity. The filter-wise pooling step converges all outputs of a filter to a single element by choosing the essential feature from all CNN features. This architecture helps swCNN to use the information on multiple paths simultaneously, and from there, selects the truly outstanding features. I.e., the convolutional layer computes an element fp of the convolved feature vector f as follows:

$$ \mathbf_ = \max_} \left[ \mathbf_ \mathbf_ + \mathbf_ \right]_ $$

(9)

where \( }_c\in }^ \) and \( }_c\in }^k \) are the weight matrix and bias vector of the convolutional layer.

At the classification phase, we have the number of features equal to the number of filters we used. They then are flattened into a feature vector and put through the softmax to decide the final prediction. I.e., the output f of the convolutional layer is then fed to a softmax classifier to predict a (K+1)-class distribution over labels \(\hat }\):

$$ \hat} = \text \left(\mathbf \mathbf_ + \mathbf_ \right) $$

(10)

where Wy and by are the parameters of the network to be learned.

The proposed model can be stated as a parameter tuple θ=(W,b). To compute the model parameters θ, we define the training objective for a data sample as:

$$ L(\theta) = -\sum_^ \mathbf_ \log \hat}_ + \lambda \left\| \theta \right\|^ $$

(11)

where y∈(K+1) indicates the one-hot vector represented ground truth; and λ is a regularization coefficient.

Additional techniquesEnsemble mechanism

Overfitting is one of the most notable problems of deep learning models. It happens when the neural network is very good at learning its training set, but cannot generalize beyond the training set (known as the generalization problem). The ensemble method [32] is one of the most effective paradigms to reduce variance and helps to avoid overfitting as well as improve the stability and accuracy of the model. Moreover, random initialization is demonstrated to have an impact on the model’s performance on unseen data, i.e. training model instances may perform substantially better (or worse) than the averaged results [17, 28, 33]. An ensemble mechanism was found to reduce variability whilst yielding better performance than the averaging mechanism [17].

In this paper, we use a strict majority vote – a simple but effective ensemble method that has been successfully used in some related works [28, 33]. Our ensemble system runs the model 20 times and uses the strict majority vote to obtain the final results.

Distant supervision learning

Distant supervision learning is proved its good impact on the relation classification by utilizing the knowledge base in some research [17, 23, 24]. In this work, we continue to apply distant supervision learning to the proposed subgraph models.

In order to take advantage of the available resources, we do not rebuild the distant data ourselves. Instead, we use the CTD-Pfizer dataset [34] that has been successfully applied in [17, 24]. Since this data does not contain entity annotations, we used Dnorm [35] and tmChem [36] tools to annotate the entities. This dataset contains 18,410 documents with 33,224 CID pairs (15,439 unique).

Experimental configuration and model’s hyper parameters

Our model was implemented using Python version 3.5 and TensorFlow v1.15.0Footnote 2. The dependency tree is generated using spaCyFootnote 3. To generate the document subgraph, we set the maximum depth of md=15 and the maximum number of paths k=150 for the breadth-first search algorithm of pathfinding phase. Widening w more than 5 as it may bring a lot of noise information and cause a computational burden. Therefore, we limit the size of the sliding window w lower than 5, i.e., exclude all entity pairs that are apart more than 5 consecutive sentences. Heuristically, we choose top-k path with k=3 for each entity pair.

The shared weight CNN employs the Adam optimizer [37] and uses Glorot random uniform [38] initialization. The mini-batch training size is set to 128. Surveying the data has shown an undesirable consequence of the subgraph representation. That is an unexpected increase in negative data. For intra sentence problem, the ratio of positive and negative is about 1:2. But using the subgraph this ratio is 1:2.95, 1:3.53, 1:3.85, 1:4.05 and 1:4.20 respectively for window sizes 1, 2, 3, 4 and 5 (note that the title is always connected to the first sentence in sliding window). This leads to an imbalanced data problem, which may negatively influence system performance caused by the bias to the negative label. To minimize the impact of this problem, we assign the class weights to give priority to the minor classes (positive). At this time, we cannot learn this weight automatically. Therefore, we set them heuristically as 3:1 for positive:negative.

We fine-tuned our deep learning model using training and development subsets (as described in Table 1). The optimized model’s hyper-parameters in detail are shown in Table 2. For the final results, we use these configurations to run the training process 100 times and report the average results of 100 runs. The training time for each run is about 17.5 hours. The prediction time for the BC5 test set using the trained model is about 2 minutes.

Table 2 Tuned hyper-parameter of the proposed model

We also apply some techniques to overcome overfitting, including max-norm regularization for Gradient descent [11]; adding Gaussian noise [13] with the mean of 0.001 to the input embeddings; applying dropout [39] at 0.5 after all embedding layers and CNN layers; and using early stopping technique [40].

留言 (0)

沒有登入
gif