A Boundary Regression Model for Nested Named Entity Recognition

In our experiments, the ACE 2005 corpus [3] and the GENIA corpus [46] are adopted to evaluate the BR model. To show the performance of the BR model to recognize flattened NE structures, in the “Ablation Study” section, the BR model is also evaluated on the OntoNotes 5.0 [47] and CoNLL 2003 English [48] corpora.

The ACE 2005 corpus is collected from broadcasts, newswires and weblogs. It is the most popular source of evaluation data for NE recognition. The corpus contains three datasets: Chinese, English and Arabic. In this paper, the BR model is mainly evaluated on the ACE Chinese corpus. To show the extensibility of the BR model regarding other languages, it is also evaluated on the ACE English corpus and the GENIA corpus.

The GENIA corpus is collected from biomedical literature. It contains 2000 abstracts in MEDLINE by PubMed based on three medical subject heading terms: human, blood cells and transcription factors. This dataset contains 36 fine-grained entity categories. In the GENIA corpus, many NEs have discontinuous structures. They are transformed into nested structures by holding the discontinuous NE as a single mention.

In the ACE Chinese dataset, there are 33,238 NEs in total. The number of NEs in the ACE English dataset is 40,122. The GENIA corpus is annotated with 91,125 NEs. The distributions of NE lengths in the three corpora are shown in Fig. 4.

Fig. 4figure 4

Distributions of NE lengths

In the basic network, the default length of sentence L is 50. Sentences with longer or shorter lengths are trimmed or padded, respectively. In the total loss function, \(\alpha =1\) is used. Two BERT\(_\) [49] models are tuned by implementing the innermost and outermost NE recognition tasks. Then, every sentence is encoded into two sequences of vectors by two tuned BERT models, where every word in a sentence is encoded as a concatenated \(2\times 768\) dimensional vector. It is fed into a Bi-LSTM layer, which outputs a \(2\times 128\) dimensional recu feature map. In the training process, word representations are fixed and not subject to further tuning.

In region proposal, two strategies have been introduced: exhaustive enumeration and interval enumeration, which correspond to two BR models referred to as “BR\(_}\)” and “BR\(_}\)”. The BR\(_}\) model exhaustively enumerates all NEs with lengths up to 6. We ignore NEs with lengths larger than 6. In the BR\(_}\) model, we intermittently enumerate bounding boxes from left to right with lengths [1, 3, 5, 7, 11, 15, 20]. To collect the positive bounding box set \(\mathbf _p\) to train the linear layer, \(\gamma\) is set as 0.7 and 0.6 for BR\(_}\) and BR\(_}\), respectively. The quantitative test to set \(\gamma\) is discussed in the “Influence of IoU” section.

In the output layer, a correct NE requires that the start and end boundaries of an NE are precisely identified. Because the BR model uses a regression operation to predicate the spatial locations of NEs in a sentence, all entity locations are mapped into interval [0, 1] for a smooth learning gradient. Therefore, the output of BR is rounded to the nearest character location.

Comparison with Related Work

To show the superiority of our model, we first compare the BR model with related work. The BR model is first evaluated on the Chinese corpus. Then, to show the extensibility of this model, the BR model is transformed to address the English corpus for further assessment.

Evaluation on the Chinese Corpus

On the Chinese corpus, we first conduct a popular sequence model (Bi-LSTM-CRF) [50]. It consists of an embedding layer, a Bi-LSTM layer, an MLP (Multi-Layer perceptron) layer and a CRF layer. The embedding layer and Bi-LSTM layer have the same settings as the basic network of the BR model. We adopt cascading and layering strategies to solve the nesting problem [20]. The layering model proposed by Lin et al. [11] is adopted for comparison.

On the Chinese ACE corpus, BA is a pipeline framework for nested NE recognition that has achieved state-of-the-art performance [9]. The original BA is a “Shallow” model, which uses a CRF model to identify NE boundaries and a maximum entropy model to classify NE candidates. NNBA is a neural network version, where the LSTM-CRF model is adopted to identify NE boundaries, and a multi-LSTM model is adopted to filter NE candidates.

In this experiment, the “Adam” optimizer is adopted. The learning rate, weight decay rate and batch size are set as 0.00005, 0.01 and 30, respectively. Shallow models refer to CRF-based models. In the BR model, we use the same settings as those used by Chen et al. [9] to configure the basic neural network, where the BERT is adopted to initialize word embeddings. These models are implemented with the same data and settings as Chen et al. [9]. The results are shown in Table 3.

Table 3 Evaluation in the Chinese corpusTable 4 Evaluation in the English corpus

In Table 3, all deep models outperform shallow models because neural networks can effectively utilize external resources by using a pretrained lookup table and have the advantage of learning abstract features from raw input. In deep models, the performances of the innermost and outermost models are heavily influenced by a lower recall rate, which is caused by ignoring nested NEs. The deep cascading model also suffers from poor performance because predicting every entity type by an independent classifier does not make full use of the annotated data. The deep layering model is impressive. This model is produced by implementing two independent classifiers that separately recognize the innermost and outermost NEs. It offers higher performance, even outperforming the NNBA model. The reason for the improvement is that, in our experiments, entities with lengths exceeding 6 are ignored, which decreases the nesting ratio. Most of the nested NEs have two layers, which can be handled appropriately by the layering model. In Table 3, the BR model exhibits the best performance.

The Chinese language is hieroglyphic. It contains very little morphological information (e.g. capitalization) to indicate word usage. Because there is a lack of delimitation between words, it is difficult to distinguish monosyllabic words from monosyllabic morphemes. However, the Chinese language has two distinctive characteristics. First, Chinese characters are similarly shaped squares. They are known as square-shaped characters, and their locations are uniform. Second, because the meaning of a Chinese word is usually derived from the characters it contains, every character is informative. Therefore, character representation can effectively capture the syntactic and semantic information of a sentence. The BR model works well on the Chinese corpus.

Evaluation on the English Corpus

On the ACE English corpus and the GENIA corpus, we adopt the same settings as those by Lu and Roth [32] to evaluate the BR model, where the evaluation data are divided according to the proportion 8:1:1 for training, developing and testing. On the GENIA corpus, researchers often report the performance with respect to five NE types (DNA, RNA, protein, cell line and cell type). To compare with existing methods, we generate results for the five NE types.

As shown in Table 4, Lu and Roth [32] and Katiyar and Cardie [51] represented nested NEs as hypergraphs. Ju et al. [34] fed the output of a BiLSTM-CRF model to another BiLSTM-CRF model. This strategy generates layered labelling sequences. The stack-LSTM [33] uses a forest structure to model nested NEs. Then, a stack-LSTM is implemented to output a set of nested NEs. Sequence-to-nuggets [11] first identifies whether a word is an anchor word of an NE with specific types. Then, a region recognizer is implemented to recognize the range of the NE relative to the anchor word. Xia et al. [6] and Fisher and Vlachos [52] are pipeline frameworks. They first generate NE candidates. Then, all candidates are further assessed by a classifier. Shibuya and Hovy [53] iteratively extracted the entities outermost to innermost. Strakova et al. [36] encode an input sentence into a vector representation. Then, a label sequence is directly generated from the sentence representation. Wang et al. [54] used a CNN to condense a sentence into a stacked hidden representation with a pyramid shape, where a layer represents NE candidate representations with different granularities. Shen et al. [56] proposed a two-stage method, where NE candidates are located by a linear layer, then fed into a classifier for prediction. These models are all nesting-oriented models. Their performances are listed in Table 4.

Table 4 shows that the performance on the GENIA corpus is lower than that on the ACE corpus. There are three reasons for this phenomenon. First, the GENIA corpus was annotated using discontinuous NEs. For the example mentioned in the “Motivation” section, “HEL, KU812 and K562 cells” contains two discontinuous NEs. Second, in the GENIA corpus, nested NEs may occur in a single word. For example, “TCR-ligand” is annotated as an “other_name” entity, and it is nested with a “TCR” protein. Third, a large number of abbreviations are annotated in the GENIA corpus, which brings about a serious feature sparsity problem. Therefore, the performance is lower on the GENIA corpus.

In related work, many models, e.g. Xu and Jiang [7], Sohrab and Miwa [5], Xia et al. [6] and Tan et al. [8] also exhaustively verify every possible NE candidate with length up to 6, because limiting the length of NEs can reduce the influence caused by negative instances. As a result, these models achieve higher performance. In comparison, the BR\(_}\) model achieves a significantly improved performance. Using the testing data, the ratios of NEs with lengths [1, 3, 5, 7, 11, 15, 20] on the ACE English and the GENIA corpora are 79.47% and 58.34%, respectively (the ratio in the ACE Chinese corpus is 39.89%). Therefore, the BR\(_}\) model also achieves competitive performance on the ACE English and GENIA corpora.

In Table 4, all neural network-based models exhibit higher performance. Especially in the BERT models, the performance is improved considerably. Li et al. [55] presented a model based on machine reading comprehension, where manually designed questions are required to encode NE representations. It achieves higher performance with on the GENIA corpus. However, because this model benefits from prior knowledge and experience, which essentially introduce descriptive information about the categories, it is rarely used for comparison with related work. In comparison with related work on the English corpus, the BR model also shows competitive performance.

Table 5 Feasibility of boundary regressionAblation Study

In natural language processing, continuous location representation, which denotes the positions of linguistic units in a sentence, has not been used. Therefore, the regression operation is rarely used to support information extraction. It is known that the BR model represents the first attempt to locate linguistic units in a sentence by regression operation. To analyse the mechanism of boundary regression for nested NE recognition, we design a traditional NE classification model named the bounding box classifier (BBC) for comparison. It is generated by omitting the linear layer from the deep architecture in Fig. 3. In the output, only a softmax layer is adopted to predict the class probability for every bounding box.

In this section, three experiments are conducted to show the usefulness of the regression operation. We first conduct two ablation studies to show the feasibility of boundary regression. In the first experiment, exhaustive enumeration is adopted in region proposal. The BBC model and the BR model are compared to show the ability of the BR model to refine the spatial locations of NEs in a sentence. In the second experiment, the BR model is implemented on intermittently enumerated bounding boxes. The experiment shows the ability of the BR model to locate true NEs from mismatched NE candidates. The BR model is mainly designed to support nested NE recognition. It can also be used to recognize flattened NEs. Therefore, in the third experiment, we evaluate the BR model on flattened NE recognition. The first experiment and the second experiment are conducted on the ACE Chinese corpus. The third experiment is implemented on two English corpora with flattened NE annotations: the OntoNotes 5.0 [47] corpus and the CoNLL 2003 [48] corpus.

Performance with Exhaustive Enumeration

In this experiment, we compare the BR\(_}\) model with two BBC models: BBC (0.7) and BBC (1.0). The BBC (0.7) model is implemented on the same evaluation data as the BR model with \(\gamma =0.7\) to collect positive bounding boxes. In the BBC (1.0) model, \(\gamma =1.0\) is applied. Under this setting, the positive bounding box set and the negative bounding box set can be denoted as \(\mathbf _G\) and \((\mathbf _p \cup \mathbf _n)-\mathbf _G\). It means that every positive bounding box is precisely matched to a true ground truth box. Therefore, the BBC (1.0) model is a traditional classifier implemented on precisely annotated evaluation data.

We implement the BBC model and the BR\(_}\) model with the same data and settings. The results are shown in Table 5, where the “number” column refers to the number of annotated NEs in the corpus. The performance is reported with respect to 7 true entity types. The “total” column denotes the microaverage of all entity types.

In NE recognition, a correct output requires that both the start and end boundaries be precisely matched to a manually annotated NE. Because the BBC model is a traditional classifier that cannot regress mismatched boundaries, as Table 5 shows, it suffers from significantly diminished precision caused by mismatched NE boundaries. The BBC (1.0) model is implemented on the evaluation data with \(\gamma =1.0\), where boundaries of positive bounding boxes are precisely matched to true NEs. The result in Table 5 shows that, in comparison with the BBC (0.7) model, BBC (1.0) achieves higher performance.

In the BR\(_}\) model, because bounding boxes in \(\mathbf _p\) have a high overlapping ratio relevant to a ground truth box, they have sufficient semantic features with respect to a true NE for supporting boundary regression. In the prediction process, mismatched boundaries of bounding boxes can approach a ground truth box through the regression operation. In comparison with the BBC (0.7) model, mismatched boundaries can be revised, which considerably improves its performance. This result indicates that the regression operation truly regresses boundaries and locates NEs in a sentence.

Comparing the BR\(_}\) model with the BBC (1.0) model, all NEs with lengths up to 6 are enumerated and verified. Under this condition, in the prediction process of the BR\(_}\) model, approaching an NE that has already been verified is less helpful to improve the performance. However, because the BR\(_}\) model can refine the spatial locations of bounding boxes in \(\mathbf _p\) and share model parameters in the bottom network, a higher recall ratio can be achieved in the BR\(_}\) model, which improves its final performance.

Performance of Interval Enumeration

In the second experiment, the BBC (1.0) model is compared with the BR\(_}\), which only verifies bounding boxes with lengths [1, 3, 5, 7, 11, 15, 20] in the testing dataset. The results are listed in Table 6.

Table 6 Superiority of boundary regression

Because the BBC is a traditional classifier that only assigns a class tag to every NE candidate, it cannot regress NE boundaries to locate possible NEs. Therefore, if a true NE is not enumerated in the testing data, it will be missed by the traditional classifier, which leads to greatly reduced recall. For example, in the ACE Chinese corpus, the sentence “中国要把广西发展为连接西部地区和东南亚的桥梁” (China wants to develop Guangxi into a bridge connecting the western region and Southeast Asia) contains five NEs: “中国” (China), “广西” (Guangxi), “西部地区” (the western region), “西部” (the western), and “东南亚” (Southeast Asia), which correspond to five ground truth boxes: [0, 2, GPE]Footnote 4, [4, 2, GPE], [11, 4, LOC], [11, 2, LOC], and [16, 2, GPE]. In the BBC model, only “东南” can be enumerated and verified, which considerably worsens its performance.

In the BR\(_}\) model, suppose that a true NE is missed in the region proposal process, if it is overlapped with one or more bounding boxes, a regression operation can be implemented to refine their spatial locations in a sentence, enabling these boxes approaching the missing true NE. As in the previous example, in the BR\(_}\) model, “西部地区” cannot be enumerated in the testing data. However, it is overlapped by at least two bounding boxes: [11, 3, ?] (“西部地”) and [11, 5, ?] (“西部地区和”), where “?” means that the class tag is unknown. Because their IoU values with the truth box [11, 4, LOC] are larger than 0.7 ( \(\gamma >0.7\)) , and they contain semantic information about “西部地区”. The softmax layer outputs a high confidence score on “LOC”. Above all, the offsets which are relevant to the truth box [11, 4, LOC] are learned, which enables the NE “西部地区” to be correctly recognized.

Evaluation on the Flattened Corpus

In this section, the OntoNotes 5.0 [47] and CoNLL 2003 English [48] corpora are employed to evaluate the performance of the BR model to recognize NEs with flattened structures. The OntoNotes corpus is collected from a wide variety of sources, e.g. magazines, telephone conversations, newswires. It contains 76,714 sentences and is annotated with 18 entity types. The CoNLL corpus consists of 22,137 sentences collected from Reuters newswire articles. It is divided into 14,987, 3466 and 3684 sentences for training, developing and testing.

The BR model is compared with several SOTA models conducted on the OntoNotes and CoNLL corpora. Ma and Hovy [57] used a BiLSTM-CNN-CRF model, that automatically encodes semantic features from words and characters. Ghaddar and Langlais [58] also used a BiLSTM-CRF model to learn lexical features from word and entity type representations. Devlin et al. [49] used the BERT framework, which is effective in learning semantic features from external resources. Li et al. [55] is a model based on machine reading comprehension. Yu et al. [59] used a biaffine model to encode dependency trees of sentences. Luo et al. [60] also used a Bi-LSTM model based on hierarchical contextualized representations. The results are shown in Table 7.

Table 7 Evaluation in the flattened corpus

In Table 7, the compared models are all sequence models. Three of these models are directly based on the BiLSTM network. Another tree model (the BERT, MRC and biaffine) also applied Bi-LSTM as an inner structure for capturing the semantic dependencies in a sentence. Because sequence models output a maximized labelling sequence for each input sentence, they are effective in encoding syntactic and semantic structures in a sentence. Therefore, in flattened NE recognition, sequence models achieve the best performance.

Comparing the BR model with sequence models, the BR model can be seen as a span classification model, which applies a regression operation to refine the spatial locations of NEs in a sentence. Because classification is based on enumerated spans, due to the vanishing gradient problem, encoding long-distance semantic dependencies in a sentence is weak for flattened NEs. Nevertheless, as Table 7 shows, the BR model also achieves competitive performance in flattened NE recognition.

Influence of Model Parameters

Because the IoU value and the NMS algorithm are influential on the BR model, in this section, we conduct experiments to analyse the influences of IoU and NMS.

Influence of IoU

In Eq. 2, a predefined parameter \(\gamma\) is adopted to divide the training data into positive bounding box set \(\mathbf _p\) and a negative bounding box set \(\mathbf _n\). Every bounding box in \(\mathbf _p\) has a high overlapping ratio with a true bounding box. The overlap of these boxes enables each bounding box to contain semantic features about a truth box, which are used to train the linear layer. This is the key to supporting boundary regression. As Eq. 5 shows, the location loss, which is computed from \(\mathbf _p\), which aggregates all position offsets between each bounding box and its relevant ground truth bounding box. Therefore, the IoU value directly determines the number of bounding boxes used for computing the location loss.

This experiment is conducted to analyse the influence of the IoU value \(\gamma\) on the final performance. Because \(\gamma =0.0\) cannot be used to collect positive bounding boxes, the value is initialized from 0.1 to 1.0 with a step size of 0.1. The result is shown in Fig. 5.

Fig. 5figure 5

In both the BR\(_}\) model and the BR\(_}\) model, if \(\gamma\) has a small value, \(\mathbf _p\) contains many bounding boxes with small overlapping ratios relevant to a true NE. In these bounding boxes, there are insufficient semantic features with respect to a true NE. The regression operation cannot guarantee appropriate learning of the location offset, which worsens the performance. The result indicates that a bounding box that is farther away from any true box is less helpful for boundary regression.

The BR\(_}\) model achieves high performance when \(\gamma\) is approximately 0.7. When \(\gamma > 0.7\), the output of the BR\(_}\) model exhibits stable performance. The reason for this phenomenon is that, when the value of \(\gamma\) is large enough, \(\mathbf _p\) contains almost exclusively enumerated ground truth boxes. As Eq. 5 reveals, the influence of regression is weakened. However, because the BR\(_}\) model verifies every NE candidate with length from 1 to 6, in this condition, the BR\(_}\) model is almost degenerated into a traditional classification model. Its performance is heavily dependent upon the output of the softmax layer.

In the BR\(_}\) model, the highest performance is achieved when \(\gamma\) is approximately 0.6. When the value of \(\gamma\) exceeds 0.6, the performance is considerably diminished. In the BR\(_}\) model, a large \(\gamma\) means that \(\mathbf _p\) contains a smaller number of positive bounding boxes. Their boundaries are almost precisely matched with ground truth boxes. In particular, when \(\gamma =1.0\), \(\mathbf _p\) only contains grounding truth boxes. As Eq. 5 shows, in the training process, the location loss is always zero. Therefore, the linear layer cannot be trained appropriately.

Influence of NMS

In the testing process, the BR model adopts a one-dimensional NMS algorithm to select true bounding boxes from the output (as Table 2 shows). The NMS algorithm was originally designed to support object detection in computer vision, where a rectangle is adopted to frame an object. One difference between object detection and entity recognition is that, when detecting an object, a rectangle is permitted to have mutual overlap with the reference object. In contrast, recognizing an NE requires that both the start and end boundaries of an NE be precisely matched. In this experiment, we study the influence of NMS on nested NE recognition. The result is shown in Fig. 6.

Fig. 6figure 6

The results show that when \(\lambda =0.0\), the lowest recall is obtained because many bounding boxes are discarded. When \(\lambda >0.1\), increasing \(\lambda\) slowly improves the performance. Because bounding boxes belonging to a true NE are closely overlap, if the \(\lambda\) is not large enough, increasing \(\lambda\) exerts little influence on the performance. Therefore, a stable performance is achieved when \(\lambda\) takes a value from 0.1 to 0.6. The BR models achieve the best performance at approximately \(\lambda =0.6\). Comparing the BR\(_}\) model with the BR\(_}\) model, the performance of the BR\(_}\) model is decreased when \(\lambda > 0.6\). The reason for this is that the BR\(_}\) exhaustively enumerates all NE candidates with lengths up to 6. The output contains a large number of bounding boxes that have precisely matched boundaries.

In the NE recognition task, identifying an NE heavily depends on its contextual features. Therefore, highly overlapped bounding boxes may refer to different true NEs, which will be discarded when \(\lambda \ge 0.6\). This problem can be avoided by setting \(\lambda = 1.0\). In the BR model, the performance of \(\lambda =1.0\) is the same as that of disabling the NMS algorithm. In this setting, only fully overlapped bounding boxes are considered and redundant boxes are removed . Because many bounding boxes are remained even they have high overlapping ratio, this setting has a higher recall. However, this approach worsens the precision. As shown in Fig. 6b, \(\lambda = 1.0\) leads to a poor F1 score.

Time Complexity of Boundary Regression

In object detection of computer vision, compared with multistage pipeline models (e.g. R-CNN [61]), an end-to-end framework (e.g. Faster R-NN [16]) is employed due to its superior of speed. The reason for this is that the background of an image is learned in a single pass in the training process. It is shared by all proposed regions in an image.

To show the time complexity of boundary regression, in this experiment, we compare our model with those of Zheng et al. [10] and Wang et al. [54]. Zheng et al. [10] presented a boundary-aware neural model that detects entity boundaries. Boundary-relevant regions are then utilized to predict entity categorical labels. The boundary detection and region prediction share the same bidirectional LSTM for feature extraction. Wang et al. [54] presented a pyramid-shaped model stacked with neural layers. This model directly implements NE span prediction. Therefore, it has a higher speed.

In this experiment, we implement these models on the ACE English corpus with the same data split, settings and GPU platform. The times required to train these models are shown in Fig. 7, where the height of the histogram represents the time cost in seconds.

Fig. 7figure 7

Time complexity comparison

In comparison with the two models described above, boundary regression achieves the least time complexity. The BR model has two characteristics that support high speed recognition. First, feature maps are generated from a basic network. They are shared by all bounding boxes in a sentence. In fact, all bounding boxes mutually overlap. They are parts of feature maps, which considerably reduce the model parameters. Second, every bounding box has location parameters. Therefore, in the learning process, the IoU value can be adopted to filter negative bounding boxes. This strategy is effective in reducing the time complexity.

Visualization of Boundary Regression

For a better understanding of boundary regression and to investigate more details of the BR model, in the following, we present a visualization of boundary regression.

The sentence “埃及是中东地区最重要的国家”Footnote 5 is selected from the testing data. It contains four nested NEs: “埃及” (Egypt, GPE), “中东地区最重要的国家” (the most important country in the Middle East area, GPE), “中东地区” (the Middle East area, LOC), and “中东” (the Middle East, GPE). A bounding box is denoted by 3 parameters \(s_i\), \(l_i\) and \(c_i\), which represent the starting position of the box, the length of the box and the class probability of the box, respectively. To visualize a bounding box, it is drawn as a rectangle. The horizontal ordinate represents the boundary positions of the bounding boxes in a sentence, which are normalized to [0, 1]. The vertical coordinate represents the classification confidence score. The colours of the bounding box represent NE types. To generate bounding boxes, the selected sentence is predicted by a pretrained BR model. All output bounding boxes are collected and drawn with respect to the sentence. The result is shown in Fig. 8.

Fig. 8figure 8

Visualization of the bounding box regression

In Fig. 8a, bounding boxes are predicted by the BR model without training (0 iterations). From Fig. 8b to f, the BR model is trained with different rounds (denoted in the titles of the subfigures). Because the regression operation may output negative values for parameters \(s_i\) and \(l_i\), we filter bounding boxes with \(l_i \le 0\) or \(s_i+l_i > 1\) (beyond the sentence range).

In Fig. 8a, there is no tendency between bounding boxes. They are distributed evenly across the whole sentence and all NE types. In Fig. 8b, the BR model is implemented on the training data in only one round. One interesting phenomenon is that the red bounding boxes and blue bounding boxes are quickly grouped around NEs. Furthermore, other true entity types are appropriately reduced. From Fig. 8c to f, when the number of iterations is increased, there are two tendencies with respect to the bounding boxes. First, the BR model becomes more confident in the entity type prediction, which increases the classification confidence of the bounding boxes. Second, the locations of the bounding boxes approach the true NEs. This indicates t

留言 (0)

沒有登入
gif