Transfer language space with similar domain adaptation: a case study with hepatocellular carcinoma

DatasetsStanford US dataset

With the approval of Stanford Institutional Review Board (IRB), we collected all the free-text radiology reports of abdominal ultrasound examinations performed at Stanford Hospital between August 2007 to December 2017. In total, there were 92,848 US reports collected over 10 years, with an average 9,000 US exams performed every year. Among them, 13,860 exams were performed for HCC screening. A total 1,744 abdominal US studies were coded with the US LI-RADS reporting format where a unique LIRADS score was reported in the impression section.

EUH MRI dataset

With the approval of Emory University IRB, we retrieved 10,018 MRI exams performed between 2010 - 2019 at EUH for HCC screening. Among these, only 548 studies were reported using the LI-RADS structured template where a unique LI-RADS score was reported in the impression section (Fig. 2a). From the LI-RADS coded reports, 99% were malignant cases (LR score > 2) since benign cases are often not coded with LI-RADS. 9,470 MRI abdomen exams were documented as free-text narratives where the final diagnosis was recorded without following any structured scoring schema (Fig. 2b). To obtain a representative sample of benign cases from the MR studies (which represented 1% of the LI-RADS coded reports), two radiologists manually annotated 537 benign cases from the EUH MRI dataset. We selected benign cases from reports performed after 2018 in order to match the report structure of annotated malignant cases.

Fig. 2figure 2

Sample MR reports: (a) Sample with LI-RADS structured template, (b) Sample free-text report

Synopsis of the datasets

In Table 1, we present the synopsis of the Stanford US dataset and EUH MRI dataset according to report-level and word-level statistics which reflects a slight diversity between the style of reporting. For instance, the number of words in the reports ranged from 24 to 331 in the US dataset while for MR dataset in varies from 11 to 2116. Same observation holds for the number of sentences in the reports. It is also interesting to note that there were 1790 common words between MR and US.

Table 1 Statistics of the cohorts before processing - Stanford US dataset and EUH MRI datasetAnnotated datasets

We evaluated the efficiency of our transfer learning scheme on the task of HCC malignancy scoring. We have malignancy information available for the following sets of reports.

i) Templated US reports from Stanford dataset: These reports are associated with LI-RADS scores. LI-RADS > 2 is not definitely benign. We experimented with a collection of 1462 reports which are split in training and test sets. Test sets contains 29 reports with ‘malignant’ label and 264 reports with ‘benign’ label.

ii) Templated MR reports from EUH MRI dataset: We have a total of 944 MR reports with associated LI-RADS scores such that LI-RADS > 2 indicates that the lesion is not definitely benign. This set is split into training and test sets. Test sets contains 81 reports with ‘malignant’ label and 108 reports with ‘benign’ label.

iii) US reports without template from Stanford dataset: We randomly sampled 142 US reports and two expert radiologists associated each selected report with a LI-RADS score (Cohen kappa 0.85). 11 reports were labelled ‘malignant’ while the remaining 131 reports were labelled as ‘benign’. Malignancy prediction model trained on structured US reports is tested over these unstructured reports.

iv) MR reports without template from EUR MRI dataset: We randomly sampled 112 unstructured MR reports with no LI-RADS scores. Two expert radiologist assigned LI-RADS scores for these reports with malignancy indicated by LI-RADS > 2 (Cohen kappa 0.92). 21 reports were labelled ‘malignant’ while the remaining 91 reports were labelled as ‘benign’. We test our malignancy prediction model trained over structured MR reports on these sampled unstructured reports.

Report pre-processingSegmentation

We design a python-based liver section segmentation module that works with both MRI/CT and US reports. The model uses a combination of regular expressions and dictionary based sentence retrieval using anatomical vocabularies derived from Foundational Model of Anatomy (FMA) [16] to extract only findings related to liver and its sub-regions from the whole report. The module maintains dependencies between anatomical entities (e.g. all the lesions found within the liver would be extracted even if they are not described in the same paragraph). This segmentation module has been manually validated on randomly selected 100 MRI and 100 US reports and obtained perfect accuracy for segmenting the liver section and liver related statements from both recent (more structured) and historic radiology reports. In order to perform a valid experiment from the LI-RADS formatted US and MRI reports, we excluded the Impression section of the reports since the final LI-RADS assessment category is often reported explicitly in the Impression. The Findings section includes only the imaging characteristics of the liver abnormalities; thus, it does not contain any clear-cut definition of the LI-RADS final assessment classes.

Text cleaning

We normalize the text by converting in to lowercase letters, removing general stop words (e.g. ‘a’, ‘an’, ‘are’, ‘be’, ‘by’, ‘has’, ‘he’, etc.), removing words with very low frequency (< 50). We also removed unwanted terms/phrases (e.g. medicolegal phrases such as “I have personally reviewed the images for this examination”); these words generally appear in all or very few reports, and are thus of little to no value in document classification. We used the Natural Language Tool Kit (NLTK) library [17] for determining the stop-word list and discarded these words during report indexing. We also discarded radiologist, clinician, and patient-identifying details from the reports. Date and time stamps were replaced by ‘ <datetoken>’ phase.

Language modeling

Language modeling allows for learning vector representation of text such that semantic characteristics of the text are preserved. We were able to use both labeled and unlabelled US and MR reports for training language models for US and MR domains since training the language models does not need supervised labels. We used the following two approaches for language modeling - context-dependent (BERT) and context-independent (Word2Vec) language modeling.

Word2Vector language model

Word2vec language modeling schemes captures finer semantic properties of words by learning a vector representation corresponding to each word. Instead of one-to-one mapping between a word and an element in the vector, representation is distributed across several hundred dimensions, mapping to words to a new representation space such that semantically similar words are mapped closer to each other than semantically dissimilar words. Our model learns such vector representation by training a skipgram model with hierarchical softmax [11]. Since training such models require no labels, LI-RADS scores are not needed for language model training.

We trained two base word2vec models; i) US word2vec, and ii) MR word2vec. US word2vec language model was trained using all US reports from Stanford US dataset regardless of availability of LI-RADS scores. MR word2vec model was trained using all MR reports from EUH MRI dataset regardless of availability of LI-RADS scores.

We also trained two cross-domain language models using transfer learning; ii) US-finetuned word2vec, and ii) MR-finetuned word2vec. For US-finetuned word2vec, words of the US domain that are common with MR domain are initialized by MR word2vec vectors. The model is further finetuned on US reports. Similarly, common words are initialized by using US word2vec vectors for MR-finetuned word2vec model that is further finetuned on MR reports from EUH MRI dataset. These models do not have to learn from scratch. Instead, they can take advantage of language model training performed in a separate but similar domain.

BERT language model

BERT learns bi-directional representations for text by jointly conditioning on both left and right context [13]. We used BERT to train a masked language model that is optimized to be able to predict masked tokens from the sentences. We trained two language models using BERT; i) US BERT - by training on all US reports from Stanford dataset, and ii) MR BERT - by training all MR reports from EUH MRI dataset. Similar to word2vec modeling, we also trained two cross-domain language model using transfer learning for BERT as well; iii) US-finetuned BERT, and iv) MR-finetuned BERT. For US-finetuned BERT, common words of US and MR domains are initialized by MR BERT vectors and the model is finetuned on US reports. Similarly, common words are initialized by using US BERT vectors for MR-finetuned BERT model that is further finetuned on MR reports from EUH MR dataset. BERT models are usually limited in terms of length of input text sequence they can process. We employ only portion of the report discussing liver as input to all our models. Thus, our models generally work with smaller length text sequences. Wherever needed, text is clipped to fit into BERT model.

Predictive modeling

We experimented with the following three predictive models.

i) Language model vectors + RF classifier: In this setup, we use our language models to generate embeddings for input reports. We then train a discriminative model - Random Forest (RF) binary classifier to predict ‘malignant’ or ‘benign’ labels for each input report.

ii) 1D-CNN model: Preserving context is one of the prominent differences between BERT and Word2vec. Thus we applied one-dimensional convolutional neural network (1D-CNN) with random word embedding generated that uses one-dimensional convolutional filters to process textual sequences and can learn important semantic structured such as phrases using, while avoiding memorization of entire text sequence. The model consists of Embedding layer, one-dimensional Convolutional layer, Dropout and Dense layers.

iii) Word2vec embedding + 1D-CNN: Normally, weights of Embedding layer in 1D-CNN are initialized randomly as weights of all other layers, and then finetuned over training input-output tuples. We designed a separate classifier by initializing Embedding layer of 1D-CNN with the weights of word2vec model. While the overall classifier is still trained over input-output tuples of training data, Embedding layer is able to take advantage of unlabelled data as well as word2vec models are trained over both labelled and unlabelled reports.

Experimental setup

We used the following four experimental setups to thoroughly evaluate the advantages of our transfer learning scheme for the language space.

Setting 1) - MR language model, tested on MR: Under this setting, language model is trained over MR reports. Trained language model is used to generate feature vectors for MR reports with and without template.Vectorized reports with template are used to train the classifier. The trained classifier is then used to evaluate test sets consisting of MR reports with template and MR reports without template. Since we are working with multiple language models and classifiers, the following experiment titles fall under this setting; i) ‘MR Word2Vec+Random Forest Classifier’, ii) ‘MR Word2Vec Embedding+1DCNN’, iii) ‘MR BERT+Random Forest Classifier’.

Setting 2) - MR-finetuned language model, tested on MR: Under this setting, language model trained over US reports is used to initialize vectors for common terms between US and MR domains. Such initialized language mode is further fine-tuned over MR reports to generate vectors for all MR terms. Finetuned language model is used to generate feature vectors for MR reports with and without template. The same classifier training and testing process is applied as described for setting 1. The following experiment titles fall under this setting; i) ‘MR-finetuned Word2Vec+Random Forest Classifier’, ii) ‘MR-finetuned Word2Vec Embedding+1DCNN’, iii) ‘MR-finetuned BERT+Random Forest Classifier’.

Setting 3) - US language model, tested on US: Under this setting, language model is trained over US reports. Feature vectors for all structured (with template) and unstructured (without template) US reports are generated using this trained language model. Feature vectors of reports from the training set of US reports with template are used to train chosen classifier to detect malignancy. The trained classifier is then used to evaluate test sets consisting of US reports with template and US reports without template. The following experiment titles fall under this setting; i) ‘US Word2Vec+Random Forest Classifier’, ii) ‘US Word2Vec Embedding+1DCNN’, iii) ‘US BERT+Random Forest Classifier’.

Setting 4) - US-finetuned language model, tested on US: Under this setting, language model trained over MR reports is used to initialize vectors for common terms between US and MR domains. Such initialized language mode is further finetuned over US reports to generate vectors for all US terms. Finetuned language model is used to generate feature vectors for US reports with and without template. The same classifier training and testing process is applied as described for setting 3). The following experiment titles fall under this setting; i) ‘US-finetuned Word2Vec+Random Forest Classifier’, ii) ‘US-finetuned Word2Vec Embedding+1DCNN’, iii) ‘US-finetuned BERT+Random Forest Classifier’.

Experiment titles in the Results section are consistent with the terminology presented above for clarity. Note that under all of these settings, classifier is trained only using reports with template while it is tested over two test sets; one consisting of reports with template, and one consisting of reports without template.

留言 (0)

沒有登入
gif