Text mining-based measurement of precision of polysomnographic reports as basis for intervention

Polysomnography reports

In a retrospective quality survey, 243 PSG medical reports were retrieved from the Sleep Center of the Cantonal Hospital St. Gallen. These reports were taken from consecutive patients with suspicion of SA referred for a whole-night PSG. All patients were included in a prior study investigating the clinical validity of a novel wearable electrocardiogram (ECG) device [1618]. The study was performed in accordance with the Declaration of Helsinki, following the principles of Good Clinical Practice. The study was approved by the local institutional review board (EKSG 15/140) and patients gave written informed consent to participate. Patients data were analyzed in a fully anonymized manner.

Altogether, the PSG medical reports were assessed by 7 sleep technicians and validated by 9 sleep physicians. Diagnoses included obstructive, central and mixed sleep apnea with various levels of severity. Data from PSG records are evaluated by sleep technicians based on information presented in the form of tables and graphics. Technicians typically provide a provisional interpretation of the sleep record, highlighting the main features and characteristics. This initial interpretation is thereafter validated by a pulmonologist who adapts and corrects the report if necessary. A snapshot of an example of PSG report is provided in the Additional file 4 (Snapshot of a PSG medical report). The narrative interpretation is highlighted in the bottom inset.

Text block standardization

A standardization of the PSG reports was implemented using predefined blocks of text sequentially assessing sleep features in a systematic manner. The resulting standardized approach – thereafter called text block standardization – increases the uniformity of the diagnostic information contained in these reports. This standardization automates the generation of PSG reports with a systematic sequential description of the following items: sleep latency (normal, shortened, lengthened), sleep efficiency (normal, reduced), sleep architecture (fragmented, shortened, with lack of rapid eye movement [REM] phase), sleep stages and position in which the patient slept (lateral position, on the back, on the abdomen). Thereafter, it is described whether the patient had an obstructive, mixed or central sleep apnea, together with indications on the sleep apnea severity (mild, moderate, severe) and whether sleep apnea was associated with the patient’s position and/or REM phase. Furthermore, the following items are highlighted: oxygen saturation, hypoxemia and hypercapnia, presence of snoring, arousal index and presence of periodic movements of the lower limbs. The specialized pulmonologist finally checks (and possibly adapt/correct) the automatically generated report. For the purpose of the current analysis, one hundred consecutive reports from independent patients were extracted.

Statistical approachesText mining approach

The narrative section of PSG electronic reports was extracted and analyzed using TM. TM summarizes the usage of key terms throughout a corpus of textual documents by generating a term-document matrix. More specifically, TM requires several pre-processing steps of data cleansing [19]. The TM procedure used in the current study follows the guidelines provided in the vignette of the R package tm [20]. The procedure includes the elimination of extra white spaces, stop words, common words in the German language, punctuation, numbers, sparse terms and transformation to lower case terms. The filtered terms were cross-tabulated in a term-document matrix. The term-document matrix tend to be very large and, as suggested in the introductory guidelines of the R-package tm, a step consisting in removing sparse terms occurring only in few documents can be employed to reduce the matrix without losing significant relations inherent to the matrix.

(Constrained-)correspondence analysis and variation partitioning

The term-document matrix was analyzed using correspondence analysis (CA), a multivariate dimension reduction method appropriate for the analysis of contingency tables. Theoretical aspects underlying CA can be summarized by defining the following:

X the n×m term-document matrix (n documents, m terms)

P=X/N the data matrix divided by its grand total (\(\mathbf = \sum _^ \sum _^ x_\), the sum of all elements in X)

r the n-dim vector of row sums of P (row weights)

c the m-dim vector of row sums of P (column weights)

Dr the n×n diagonal matrix of row sums

Dc the m×m diagonal matrix of column sums

In CA, the main table of interest (term-document matrix) is converted into a χ2 distance matrix after performing the following transformation:

$$\mathbf = \mathbf_^ (\mathbf - \mathbf^) \mathbf_^ $$

CA consists in the singular value decomposition of Z:

$$\mathbf = \mathbf\mathbf\mathbf^ $$

with Λ the k×k (k=rank(Z)) diagonal matrix of singular values associated with Z with λ1≥⋯≥λk>0,U the n×k matrix of left singular vectors and V the m×k matrix of right singular vectors. The total inertia of the contingency table is given by the sum of the squared singular values (\(I = \sum _^ \lambda _^\), with p the smaller dimension of X).

The contingency table was partitioned with respect to explanatory variables using variation partitioning techniques [21]. The following four explanatory variables were considered: type of apnea, apnea severity, physician, technician. The partitioning was based on constrained correspondence analysis (CCA), a supervised counterpart of CA (e.g., [22]). In CCA, linear constraints are applied observation-wise. Each categorical explanatory variable is used to define row blocks. If we define M the n×g matrix of dummy variables defining g blocks among observations, the observation-wise constraint is given by the projection operator:

$$\mathbf_ = \mathbf (\mathbf^ \mathbf_ \mathbf)^ \mathbf^ \mathbf_ $$

The projection on Or computes the means per block of observations for each variable. CCA consists in performing the following singular value decomposition:

$$\mathbf^ = \mathbf_^ \mathbf} (\mathbf - \mathbf^) \mathbf_^=\mathbf^\mathbf^\mathbf^ $$

with Λ∗ the k∗×k∗ (k∗=rank(Z∗)) diagonal matrix of singular values associated with Z∗ with \(\lambda _^ \ge \cdots \ge \lambda _^ > 0, \mathbf ^\) the n×k∗ matrix of left singular vectors and V the m×k∗ matrix of right singular vectors.

The percentage of explained variance associated with a specific explanatory variable is given by the ratio of the total inertia of constrained over unconstrained CA. In a first step, the total inertia of CA was partitioned according to each explanatory variable using univariate analyses and the reported percentage of explained variance corresponded to the unadjusted R-squared, i.e. the fraction of variance explained by each individual explanatory variable independently of the other variables. In a second step, adjusted R-squared were calculated where the joint effect among variables was taken into account. For each explanatory variable, the percentage of explained variance and its significance was assessed using permutation tests. The inter-rater variability was defined by the percentage of explained variance associated with both physicians and technicians.

Predictive accuracy of the final diagnosis

The predictive value of the text standardization was assessed using a linear support vector machine (SVM) classifier and the prediction accuracy of the classifier was estimated using repeated 10-fold cross-validation. In 10-fold cross-validation, the original sample is randomly partitioned into 10 equal size subsamples. Of the 10 subsamples, 1 single subsample is retained as test data and the remaining 9 subsamples are used as training data. The process is repeated 10 times, each subsample being used exactly once as validation test data. All observations are used both for training and validation. Furthermore, the cross-validation procedure was repeated 3 times. The SVM-classifier and its cross-validation was implemented using the function train of the R package caret using the following control parameters: resampling method was set to “repeatedcv”, number of folds was set to 10 and number of repetitions of k-fold was set to 3. The following diagnostic classes were considered: OSAS severe, OSAS mild, OSAS light, central SA, mixed SA, undetected SA. The class distribution and detailed class-wise performance was provided.

Statistical software implementations

Source codes can be provided upon request to the corresponding authors. All analyses were done using the R statistical software (v. 4.0.3) including the following extension packages: tm [23], ade4 [24], vegan [25] and caret [26]. CA was performed using the function dudi.coa implemented in ade4, and CCA using the function cca implemented in vegan. Variation partitioning was performed using the function varipart implemented in ade4. Source codes can be provided upon request to the corresponding authors.

留言 (0)

沒有登入
gif