Topic modeling of maintenance logs for linac failure modes and trends identification

1 INTRODUCTION

Medical linear accelerators (linacs) are the most important equipment in radiotherapy. At their core, these are particle accelerators reconfigured as medical devices. The mode of operation of the linacs uses similar physical principles as high-energy accelerators for particle physics research. However, the medical devices operate in hospital environments that pose additional challenges for operations due to the different physical environments and availability of technical support. Operating medical linacs requires skilled personnel to repair, adjust, and otherwise maintain the proper operation of the devices. Further, medical linacs have many subsystems that all must operate faithfully for the device to function correctly. Given their complexity, linacs can fail in a multitude of different manners.

It is desirable to understand the nature of medical linac failures for several reasons. First, components used in medical linacs are costly, and improved knowledge of components that fail more often can be of help in projecting service and maintenance costs for medical linacs. Second, the training of the qualified technical staff able to maintain these devices can be simplified with better knowledge of failure modes since emphasis can be placed on areas that fail more frequently. Third, a better understanding of failure modes can help medical linac operators in stocking components that are more likely to be needed in maintenance, which can help reduce repair times.

There have been relatively few studies of failure modes for linacs. Wroe and colleagues1 studied downtime and failure modes for radiotherapy equipment in lower income and developed countries. Sheehy and colleagues2 performed a reliability analysis of radiotherapy equipment in lower income countries. Both studies commented on the difficulty in obtaining sufficient quality data and statistics to conduct their analysis. The first study relied on the manual review of linac maintenance records, in both electronic and paper form to estimate failure modes and times between failure. The second study improved in the first by building failure mode models using its results.

Modeling of linac failure modes would be greatly simplified and improved with high quality and consistent input data for the model development. However, medical linac maintenance data are often kept in generic equipment maintenance databases whose primary purpose is to keep records of maintenance, but not to classify and analyze the failure modes. The core information in maintenance logs is recorded in narratives by maintenance personnel. In general, these logs depict the repair procedures and maintenance results, making it useful for analysts to evaluate the linac performance and identify failures. But these logs have the characteristics of colloquial, unformatted, noisy, and may contain spelling and grammatical errors, which is not suitable for general analytical tools. In particular, it is time-consuming to analyze this type of data by humans when the scale of data is large.

This research is triggered by the natural language processing (NLP) application in the transportation domain, where researchers applied topic modeling (TM), a type of unsupervised learning algorithm, into categorizing safety reports for reducing incidents’ events.3, 4 The NLP technology is established with a suite of methods capable of interpreting, evaluating, and generating narratives in human language. There has been a slightly increased amount of literature on medical records analysis during the COVID-19 pandemic. Shah et al.5 carried out a number of investigations into the patient online reviews in physician rating websites to examine trends of patient concerns due to the COVID-19. The coherence-based TM method was applied to generate topics and corresponding keywords and experiment results showed that policymakers can benefit from the topic analysis to deal with the COVID-19 crisis efficiently. Kaveh-Yazdy and Zarifzadeh6 investigated the top-ranked people concerns to the COVID-19 in Iran. Based on the output of the TM model, researchers summarized the major concerns are PCR lab and test, policy on the education system, and personal protection actions such as wash hands and wear masks. In our study, the emphasis was placed on the maintenance work of linacs, where the TM method was applied to analyze the massive and unformatted linac maintenance logs to identify the most frequent failure modes.

The purpose of this work is to investigate the feasibility of using TM to analyze electronic medical linac maintenance logs. The main contribution of this article is to introduce TM to analyzing the unstructured maintenance logs data to find out the most frequent failure modes of linacs during daily use. Another purpose is to demonstrate the performance of different linacs over time by examining the trends of different failure modes. With a data-driven analysis method, it is hoped that the larger pool of current medical linac maintenance logs can be used to better understand medical linac failure modes.

2 MATERIALS AND METHODS 2.1 Linac maintenance logs

The maintenance logs used in this study were collected from several linacs of BC Cancer center, Kelowna (Canada) under the regulation designed by the Canadian Nuclear Safety Commission. The linacs were in service from April 1998 until the study date. There were nine linacs in total, four of these were replaced partway through the study period. The fifth linac was also added in 2009. These linacs were installed in five different treatment rooms, labeled A–E as shown in Table 1. The dates of service, manufacturer, and model were also listed. Of the four original linacs, two were equipped with multi-leaf-collimators (MLCs) and amorphous silicon-based electronic portal imaging (EPID); the other two did not have MLCs but had fluorescence imaging-based portal imaging which was upgraded to amorphous silicon EPIDs. The five accelerators in service from 2011 onward are modern medical linacs with MLCs, EPIDs, and kV based on-board imaging. The dataset used in our study was from nine different linacs recorded from April 1998 to December 2019 consisting of 4323 entries in total.

TABLE 1. Linacs specification and service date Treatment room Manufacturer Model Starting service date End service date A Elekta SL75 April 1998 October 2008 A Varian Clinac iX July 2009 September 2021 B Elekta SL75 April 1998 December 2009 B Varian Clinac iX September 2010 January 2021 C Elekta SL 20 July 1998 July 2010 C Varian TrueBeam March 2011 Present D Elekta SL 20 July 1998 February 2011 D Varian TrueBeam August 2011 Present E Varian Clinac iX November 2009 Present

The maintenance log is a collection of narrative maintenance records of linac repair and service work completed by maintenance personnel. In our study, there are two main parts in the logs, namely ‘‘Comments’’ and ‘‘Repair Description.’’ ‘‘Comments” briefly describes the linac status and breakdown occurs on the linacs. ‘‘Repair Description’’ records the maintenance procedure, repair action, and related broken component of linacs. Some metadata were also recorded, such as the date of the maintenance service. Table 2 shows two maintenance log entries sliced from the original dataset. Apart from the ‘‘Comments,’’ ‘‘Repair Description,” and ‘‘Date,’’ Keyword ‘‘TaskKey” tells the type of the maintenance service.

TABLE 2. Example entries of linac maintenance logs TaskKey Comments Repair Description Date Corrective Touchguard interlock could not be removed with iView detector in place. Adjusted touchguard microswitches at the locking pin end for proper contact. Adjusted the alignment nuts on both sides of detector for easier locking pin insertion. April 1998, 16 Major repairs CARR/FOIL W29 cable repaired. Error 7F when calibrating carrousel, was also getting error 70 when exiting calibration, pointing to switch S16. Cleaned all five switches and reseated connectors J82 and J83. Adjusted the Carr pot voltage to 5.05 V from 4.74 V. Replaced and adjusted S16 on the carrousel switch assy, no change. We could reproduce the fault by moving the gantry from Zero degrees to 350°. Lubed Carr chain with TriFlow. Replaced the PWM pcbA4, no change. January 2020, 29

As shown in Table 2, the narratives in the logs (‘‘Comments’’ and ‘‘Repair Description’’) contain a wealth of information describing the health condition of linacs and repair action. However, to identify frequent failure modes through examining the logs, it is evident that it would be time-consuming to extract key information from the lengthy sentences by humans, especially for the whole dataset. In such a case, by using the NLP techniques, the key information, in our case, the failure modes and related components of linacs, can be extracted automatically and quickly from the logs. Furthermore, the temporal analysis method was used by implementing the metadata ‘‘Date” to find out the trend of specific failure modes over time. As mentioned previously, the linacs have been replaced and added from different manufacturers around 2010. Thus, the temporal analysis was used to examine whether there is a difference between different linac models.

2.2 Topic modeling of linac maintenance logs 2.2.1 Latent Dirichlet allocation model

Following the documents representation method, latent semantic indexing (LSI), Blei et al.7 proposed latent Dirichlet allocation (LDA) algorithm and formulated a general technique named probabilistic TM. TM is a typical unsupervised machine learning algorithm, and it doesn't require labeling the dataset but constructs a model solely on the distribution of the words in documents. TM is capable of extracting core information by distilling topics from messy documents.

LDA is the most commonly used algorithm to perform the TM in a collection of documents. LDA constructs a three-layer architecture between documents, topics, and words by independent multinomial distributions. Each document is represented by several latent topics and each topic is governed by a multinomial distribution over words. In our application scenario, for each entry of the maintenance logs, the ‘‘Comments’’ and ‘‘Repair Description,’’ were combined as a document. All documents in the maintenance logs dataset comprise the corpus. LDA summarizes the documents with several topics by searching similar ‘‘bag” of words co-occurring in all documents. The words with top frequency in each topic describe the core information of the topics. In our case, the top-ranked words often point to some kinds of failure modes. Thus, the TM can help us find the most frequent failure modes by searching the keywords in the dataset. It should be noted that a single document is often represented by several topics. This is quite reasonable that one maintenance service usually handles multiple failures.

To further explain the mechanism of the LDA model. Some notations and assumptions are introduced here. A document noted as urn:x-wiley:15269914:media:acm213477:acm213477-math-0001 consists of urn:x-wiley:15269914:media:acm213477:acm213477-math-0002 words, where urn:x-wiley:15269914:media:acm213477:acm213477-math-0003 is the urn:x-wiley:15269914:media:acm213477:acm213477-math-0004 word. A topic, noted as urn:x-wiley:15269914:media:acm213477:acm213477-math-0005, is a bag of semantic words and can be expressed by the urn:x-wiley:15269914:media:acm213477:acm213477-math-0006. The topic is indexed by urn:x-wiley:15269914:media:acm213477:acm213477-math-0007 which goes from 1 to urn:x-wiley:15269914:media:acm213477:acm213477-math-0008. The process of building an LDA model was displayed in a ‘‘plate-like’’ graphical model as shown in Figure 1.7 In the graphical model, nodes represent variables and the arrows represent the variable dependency. Plates denote repeated sampling and the number of sampling is indicated at the bottom right corner of the box. The LDA graph model can be interpreted in the algorithm by the following steps8:

Set prior parameters: urn:x-wiley:15269914:media:acm213477:acm213477-math-0009.

For each document urn:x-wiley:15269914:media:acm213477:acm213477-math-0010, choose urn:x-wiley:15269914:media:acm213477:acm213477-math-0011.

For each word urn:x-wiley:15269914:media:acm213477:acm213477-math-0012, the topic of the word belonged denoted as urn:x-wiley:15269914:media:acm213477:acm213477-math-0013 is drawn from urn:x-wiley:15269914:media:acm213477:acm213477-math-0014.

The word urn:x-wiley:15269914:media:acm213477:acm213477-math-0015 itself is a variable drawn from another distribution urn:x-wiley:15269914:media:acm213477:acm213477-math-0016.

Here, urn:x-wiley:15269914:media:acm213477:acm213477-math-0017and urn:x-wiley:15269914:media:acm213477:acm213477-math-0018 are two independent symmetric Dirichlet priors; urn:x-wiley:15269914:media:acm213477:acm213477-math-0019 and urn:x-wiley:15269914:media:acm213477:acm213477-math-0020 are the document-topic distribution and topic-word distribution drawn from urn:x-wiley:15269914:media:acm213477:acm213477-math-0021and urn:x-wiley:15269914:media:acm213477:acm213477-math-0022, respectively. image

Graphical representation of latent Dirichlet allocation model

The joint distribution of urn:x-wiley:15269914:media:acm213477:acm213477-math-0023, and urn:x-wiley:15269914:media:acm213477:acm213477-math-0024 is given by urn:x-wiley:15269914:media:acm213477:acm213477-math-0025(1) By applying the Bayes’ theorem, the topic mixture urn:x-wiley:15269914:media:acm213477:acm213477-math-0026 can be obtained by computing the posterior distribution: urn:x-wiley:15269914:media:acm213477:acm213477-math-0027(2) 2.2.2 Metrics for topic modeling parameter fine-tune

The number of topics urn:x-wiley:15269914:media:acm213477:acm213477-math-0028 is the most important parameter need to be determined when training an LDA model. A too-small urn:x-wiley:15269914:media:acm213477:acm213477-math-0029 will concentrate too much information on a single topic, making it difficult to identify the specific failure mode and map it to the responsible components. Likewise, a too big urn:x-wiley:15269914:media:acm213477:acm213477-math-0030will make the information too scattered, leading to some meaningless topics. However, the TM is an unsupervised method, there is no ground truth to provide reference for selecting the optimal number of topics. Thus, some metrics are proposed to help to address this issue by evaluating the topic model.9-13

Evaluation metrics can provide a good reference for finding the optimalurn:x-wiley:15269914:media:acm213477:acm213477-math-0031. Kuhn3 used a pair of trade-off metrics, namely coherence and exclusivity, and selected the outlier as the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0032. Wang et al.14 selected Jensen–Shannon divergence and perplexity to find the optimal metrics. Another more direct way is to check the result of the models and discern whether it is reasonable.15 Tanguy et al.16 chose 50 as the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0033 by a subject matter expert in the application of analyzing the aviation safety reports.

The divergence metric proposed by Arun aims to find the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0034 by computing the Kullback–Leibler divergence of topic–word distribution over singular values and is defined as11: urn:x-wiley:15269914:media:acm213477:acm213477-math-0035(3)where urn:x-wiley:15269914:media:acm213477:acm213477-math-0036 is the distribution of singular values of the topic–word matrix urn:x-wiley:15269914:media:acm213477:acm213477-math-0037, and urn:x-wiley:15269914:media:acm213477:acm213477-math-0038 is the distribution of the normalized document–topic matrix urn:x-wiley:15269914:media:acm213477:acm213477-math-0039. Several experiments showed the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0040 can be reached by minimizing the proposed metric. Another commonly used metric is perplexity, which is an indicator of the uncertainty of a model in predicting the topic of the held-out data. The perplexity is “equivalent to the inverse of the geometric mean per-word likelihood” in mathematics: urn:x-wiley:15269914:media:acm213477:acm213477-math-0041(4)where a model with lower perplexity tends to give a more reasonable prediction. A trial and error process would be needed if the perplexity and divergence become a pair of trade-off metrics for a specific problem. The above two metrics are statistical evaluation methods. However, the perplexity is shown insufficient to determine the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0042 in some real applications. Thus, another metric called coherence was proposed to evaluate the interpretability of the model.17 The coherence is established on the assumption that a topic is easier to interpret when its top-ranked words co-occurred more frequently in the documents of the corpus. For example, a topic with the top words ‘‘MLC’’ and ‘‘leaf’’ is easy to interpret as these two words co-occur many times in different log pieces. The coherence is defined as follows17: urn:x-wiley:15269914:media:acm213477:acm213477-math-0043(5)where urn:x-wiley:15269914:media:acm213477:acm213477-math-0044 is the list of urn:x-wiley:15269914:media:acm213477:acm213477-math-0045 top words in topic urn:x-wiley:15269914:media:acm213477:acm213477-math-0046 is the number of service records that contain the word urn:x-wiley:15269914:media:acm213477:acm213477-math-0047, and urn:x-wiley:15269914:media:acm213477:acm213477-math-0048 is the number of service records in which both words urn:x-wiley:15269914:media:acm213477:acm213477-math-0049 and urn:x-wiley:15269914:media:acm213477:acm213477-math-0050 occur at least once. Generally speaking, the coherence will decrease as the number of topics in a model increases and the model with a higher coherence score is more interpretable.

In this article, perplexity, divergence, and coherence were used to determine a narrow range of optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0051. The final optimal number of topics is selected by three subject experts by checking the model interpretability.

2.3 Temporal analysis of linac failure modes The most frequent failure modes of linacs may change as the service time increases. To examine whether there is performance degradation of specific subsystems in linacs, the occurrences of different failure modes in each year should be analyzed. As there is a replacement of linacs from different manufacturers around 2010, the analysis over time can be used to compare the performance of linacs from different manufacturers. The temporal analysis method was used to investigate the trends of different failure modes over years.18 As the metadata ‘‘Date’’ is available, each entry in the maintenance log was assigned to a specific date, and each entry was modeled as a topic cluster about failure modes through the LDA model. Thus, the proportion of any failure modes over years can be obtained by calculating the probability urn:x-wiley:15269914:media:acm213477:acm213477-math-0052, of a given year urn:x-wiley:15269914:media:acm213477:acm213477-math-00539: urn:x-wiley:15269914:media:acm213477:acm213477-math-0054(6)where urn:x-wiley:15269914:media:acm213477:acm213477-math-0055 is the document–topic distribution and urn:x-wiley:15269914:media:acm213477:acm213477-math-0056 is the year that the document was recorded. The NLP is able to analyze phrases or combinations of words by building a language model using N-grams. An N-gram is a sequence of N words. An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. The most commonly used are bigram and trigram models. In our work, we built two topic models using bigram and trigram, respectively.19 In our experiment, the bigram-based model outperformed the trigram-based model. The results in the article were based on the bigram model. It is worth noting that the maintenance logs are noisy and often includes meaningless and incorrect information such as spelling mistakes. Several text pre-processing techniques were employed before building an LDA model, including the following:

Tokenization: Token is the basic element in a topic model. This process breaks up the sentence into an individual token for the following processing and analysis.

Words cleaning: Remove punctuation characters, numbers, and stop words that highly frequent occurring in most topics while contributing little to the topic building, such as preposition words and etc. In addition, any words that occurred in the whole dataset less than three times were removed.

Lemmatization: Lemmatization aims to return the base or dictionary form of words so that they can be treated as a single element via TM. Another more aggressive technique called stemming is tested ineffective because it may combine distinct tokens as one and convert jargon in the linac field to some other words.

Lowercase conversion: Convert words to lowercase.

3 EXPERIMENTS AND RESULTS

There are three parts in this section. First, the process of selecting the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0057 using the metrics mentioned in Section 2 was demonstrated. Then, the method of interpreting the topic contents produced by the LDA model was demonstrated and the most frequent failure modes of linacs were summarized. Finally, the temporal analysis method was used to identify the trends of some failure modes.

3.1 Experiments on LDA model of maintenance logs 3.1.1 Topic number selection

The number of topics urn:x-wiley:15269914:media:acm213477:acm213477-math-0058 is the most significant parameter in building a good LDA model. To find the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0059, three metrics mentioned in Section 2 were used to evaluate LDA models with different urn:x-wiley:15269914:media:acm213477:acm213477-math-0060. Figure 2 illustrates the trends of divergence metric, perplexity, and coherence as the number of topics increases from 5 to 70.

image

Metrics used to select the optimal number of topic

The value of divergence reaches a lower level, which indicates a possible good LDA model when the topic numbers were set from 25 to 50. However, the perplexity increases as the topic numbers exceed 25. Since the model with a smaller perplexity has a better prediction power for new data, these two metrics become a pair of trade-off metrics for this problem. Thus, the third metric coherence is used to determine the final interval of the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0061. Nevertheless, the range of optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0062 can be initially determined from 25 to 35 according to the above two indicators. Furthermore, from the figure, we can find the coherence value of the model with 28 topics standing out compared to others in the initial range from 25 to 35. With the help from the subject matter expert interpretation of the models with urn:x-wiley:15269914:media:acm213477:acm213477-math-0063 range from 25 to 30, 28 was selected as the optimal urn:x-wiley:15269914:media:acm213477:acm213477-math-0064 in the following analysis. Finally, the following analysis is based on the model with 28 topics.

3.1.2 Identified topics and latent failure modes

From the output of the well-built LDA model, 28 topics were clustered with specific words selected from the linac maintenance log dataset. However, the explicit concept and topic meanings are not generated accordingly.3 A post-analysis to identify the core information of the topics based on the top-ranked words is required. With this topic interpretation procedure, the specific failure modes and related components or subsystems of linac can be identified and summarized.

The most straightforward approach to find what a topic represents would be to rank words by frequency on the topic and find the common narratives among those words. Thus, the underlying failure mode and subsystem can be found. However, it would cause a problem that some words which have a high overall frequency across the corpus would show up on many topics. These words may cover words that have a relatively lower probability while contributing a lot in interpreting topics. Therefore, another statistical metric called lift was introduced to rank the top words within topics. The lift is defined as ‘‘the probability of word occurrence conditional on topic divided by the probability of word occurrence across the corpus.’’ This metric will highlight words that have a high probability within a topic locally than those across a corpus.

To use the frequency and the lift metric more flexible, the visualization and analysis package PyLDAvis20 was used to sort the top-ranked words. It introduces a parameterurn:x-wiley:15269914:media:acm213477:acm213477-math-0065 to adjust the weight of frequency and lift metric given a specific word. The ‘‘relevance’’ of a word to the topic urn:x-wiley:15269914:media:acm213477:acm213477-math-0066 is defined as: urn:x-wiley:15269914:media:acm213477:acm213477-math-0067(7)where urn:x-wiley:15269914:media:acm213477:acm213477-math-0068 denote the probability of word urn:x-wiley:15269914:media:acm213477:acm213477-math-0069 for topic urn:x-wiley:15269914:media:acm213477:acm213477-math-0070 and urn:x-wiley:15269914:media:acm213477:acm213477-math-0071 denote the marginal probability of word urn:x-wiley:15269914:media:acm213477:acm213477-math-0072 in the corpus. Given urn:x-wiley:15269914:media:acm213477:acm213477-math-0073, the relevance is identical to the ranking of words in decreasing order within a topic and the relevance will equal to lift when urn:x-wiley:15269914:media:acm213477:acm213477-math-0074.

Take topic 3 as an example to demonstrate the process of determining the meaning of a specific topic. When we set urn:x-wiley:15269914:media:acm213477:acm213477-math-0075, the words will be ranked more by their occurrence within the topic. As shown in Figure 3, the top four words are ‘‘mlc, leaf, motor, stuck,’’ which clearly point to the failure on the MLC and the failure mode is ‘‘mlc leaf stuck.” Then, we set urn:x-wiley:15269914:media:acm213477:acm213477-math-0076, the words are ranked much more by their overall probability over the corpus. From Figure 3, the weight of the word ‘‘replaced’’ was increased and it indicates the repair action ‘‘mlc leaf motor replaced.” Therefore, it is evident that we can find the failure modes and corresponding repair action by setting the urn:x-wiley:15269914:media:acm213477:acm213477-math-0077 to 0.2 and 0.8, respectively. It should be noted that the words in a topic are not purely related to one failure mode or one component. What we are seeking is to find out the dominant words in a topic and relate these words to specific failure modes.

image

Top-ranked words within topic 3 and corresponding failure mode

All 28 topics were examined by the same process to identify dominant failure modes in each topic. However, not all topics are related to specific failure modes. Some topics are generated with words describing the general maintenance work. For example, the top-ranked words in topic 6 are ‘‘pm day routine carried maintenance,” which point to the routine maintenance records. Under this topic, the co-occurred verbs and nouns are ‘‘cleaned, checked” and ‘‘rim, iview, and processor,” respectively. These words indicated that the main work in routine maintenance is about cleaning and lubrication. Therefore, we categorized topics with identified failure modes into related subsystems according to Wroe's paper.1 Furthermore, the LDA model also gave the overall proportion of words contained in each topic in the corpus. Given the assumption that topics are dominated by few top-ranked words, the proportion of the topic reflects the frequency of the failure modes reflected in the topic. Table 3 displays the topics with top-ranked words and identified failure modes. It is worth noting that topic 14 appeared in two subsystems as it shows up two different failure modes. A possible explanation would be these two failure modes co-occurred many times in service. The identified topics and related failure modes are summarized below.

TABLE 3. Keywords within topics and identified failure modes of linacs Subsystem Topic (words frequency) Keywords within topic Identified failure modes Topic 2 (6.5%) Fuse supply power tube generator relay rectifier blown outage black bridge magnetron replaced checked Fuse blown; power outage; bridge rectifier, magnetron, modulator failure Electrical Topic 3 (6.4%) MLC leaf motor stuck moving emitter reflector replaced pushing initialized cleaned MLC motor failure Topic 21 (2.2%) Reflector reference line detector lost locked verified adjusted calibration reset MLC leaf lost reflector; reflector out of calibration Control Topic 4 (5.9%) Console physics keyswitch sound timing frame board replaced check Keyswitch replaced; buzzing sound from board Topic 8 (4.4%) PCB controller carriage program fitting driver change clear tightened Replaced controller PCB; tightened up connection

留言 (0)

沒有登入
gif