Enabling reliable usability assessment and comparative analysis of medical software: a comprehensive framework for multimodal biomedical imaging platforms

The Graphic User Interface (GUI) of the tested software allows users to view medical data, images, and other relevant information. The designed protocol admits only two modalities for mirroring the screen during the test: cabled (via High-Definition Multimedia Interface - HDMI connection) or wireless. These modalities are available on all possible devices on which the tested SaMD can be installed regardless of its type: workstations, laptops, and tablets. These modalities are mandatory and cannot be changed to reduce variability and ensure the reproducibility of the testing protocol. Therefore, the machine on which the SaMD is installed must have an Operative System (OS) that supports display mirroring and at least one HDMI port or a stable WiFi connection.

According to the IEC 62366-1:2015 [2], the goal of usability testing is reducing risks. However, for this specific scope, the fastest access to all basic functionalities and overall user satisfaction was identified as the main goal beyond the evaluation of the risks. Moreover, the usability test may also reveal missing useful functionalities in the current release of the software [20] which were not detected with analytical approaches.

2.1 The multimodal biomedical imaging platform

The platform under testing is intended for pre-, post-, and intra-operative usage in the human and veterinary fields. Despite the imaging platform supports the visualization of any DICOM (Digital Imaging and COmmunications in Medicine) image, it was mainly developed to satisfy the needs of orthopaedics, and, therefore, is mainly focused on Computed Tomography (CT), fluoroscopy, and radiography. However, at the moment of preparing the first iteration of the usability testing procedure, the CT modality was the one with the highest level of readiness, leading us to limit the tasks in the administered tests, as described in this work, to CT acquisitions only. The proposed software provides a three-dimensional (3D) representation of the CT volumetric data, including Multiplanar Reformation (MPR) and 3D volumetric reconstruction views (see Fig. 1). The basic controls (zoom, pan, rotation, and scroll) enable the series navigation. The MPR functionalities include image enhancement, colour inversion, slab thickness control, annotation and measurement, histogram window control, and intensity presets. 3D functionalities cover different volume crop and rendering modes, auto-play, transfer functions control, and preset saving. Moreover, advanced functionalities such as compare, surgical planning, and multislice viewing are supported.

Fig. 1

Multimodal Biomedical Imaging Platform All-in-One

The set of proposed functionalities is based on the interview of a large group of orthopaedics, radiologists, and veterinaries. Additionally, the imaging software systems already present in the market were analysed. The GUI was developed according to Gestalt visual design principles. Moreover, the law that predicts the time taken to acquire a target (Fitts) and the human choice-reaction time (Hick-Human) were taken into account.

After completing the first proposal of the GUI design, the method of heuristic evaluation proposed by Zhang was applied. The result of the evaluation was analysed and the requested modifications and fixes were implemented. The heuristic evaluation process was repeated until only minor issues remained. At that point, the development was suspended and the current version of the software was used for usability testing.

2.2 Participants

The complexity of the software and the environment in which it is meant to be utilized determine the optimal number of participants. According to research [21,22,23], between three and twenty people can yield trustworthy results, with five to ten being a good start. Generally speaking, a higher number of testers is needed for more difficult, risky initiatives, but fewer participants are needed to test more creative ideas. Following these considerations, twelve volunteers were recruited to compose the testing population.

2.3 Environment

According to the above-mentioned requirements, only two modalities are allowed for mirroring the screen during the administration of the usability test: HDMI cable and wireless. In both modalities, the testing occurred in person.

2.3.1 HDMI testing

The environment for the HDMI test consisted of two adjacent rooms: one room designated as the test room and a second one as the observation room (see Fig. 2).

Fig. 2

The chosen rooms were adjacent to allow an HDMI cable to be passed through. That enabled the duplication of the test machine screen on the monitor of the observer (see Section 2.4). As a precaution, the test machine was not connected to the Internet, while superfluous operative system processes were suspended to avoid compromising software productivity (see Appendix A).

2.3.2 Wireless testing

For wireless testing, no external cable was needed for mirroring, as a Google Meet session was set up on the same machine where the tested software was installed to share the screen. In this scenario, the superfluous operative system processes were not suspended, to evaluate their impact on the performance of the SaMD during the test.

2.4 Test conductors

The testing procedures require a minimum set of two people for conducting the test:

Moderator: in charge of managing the progress of the test; responsible not only for administering the tasks, but also for observing the user’s facial expressions, resolving any problems, and answering the possible questions arising during the session;

Observer: responsible for reporting the user’s performance of the tasks, tracking down the time taken to perform each task, and leaving comments on eventual issues and user’s difficulties. In specific cases, when reporting the task performance while taking notes may result complicated, the presence of more than one Observer can be useful.

Optionally, a third person (namely the Recorder) can be involved. The Recorder observes and analyses the footage coming from the camera which frames the user from the entire scene’s perspective during the test. In the absence of this third person, the footage recorded with the external camera can still be analysed after the conclusion of the test.

2.5 Equipment

A list of tools needed for carrying out a usability test for both HDMI and wireless modalities for touchscreen and mouse/keyboard configuration was defined in Appendix A. The settings for the different modalities/configurations are quite similar and are described in Tables A.1 and A.2.

Custom Stopwatch software was developed to ease the observer’s tasks. The software is developed in C/C++ and the source code has been made publicly available at https://github.com/eletiri93/Stopwatch. The software enables the observer to record the amount of time spent on each task during the usability test while also noting any noteworthy user behavior: the GUI shows the current task, a stopwatch, and a space for taking notes. On the right side of the screen, a table summarises the recorded times and notes. After the completion of each task, the observer is able to export a CSV (Comma Separated Values) file containing the recorded times and notes.

2.6 Exploratory tasks and specific scenarios

Based on the experience acquired during the heuristics evaluation, the list of 55 exploratory tasks (see Appendix F) was produced. The objective was to ensure that participants did not become overly fatigued, while covering all of the most crucial capabilities within realistic time constraints. The tasks had to be clear, short, and as independent as possible (e.g., the failure of one task should not compromise the success of the following tasks). Moreover, four specific scenarios were developed with the help of an external radiologist consultant (see Appendix G). The exploratory tasks and specific scenarios can be modified to allow the testing protocol to be tailored to the specific application and the provided functionalities of the tested software.

2.7 Test evaluation

Tests were evaluated in terms of effectiveness, efficiency, and user satisfaction. The first two parameters were evaluated with the aid of a purposely developed Stopwatch software, notes taken by the Observer, and a camera footage. User satisfaction was evaluated by using a post-test questionnaire administered at the end of each test session.

More specifically, the effectiveness evaluates the participant’s capacity to finish each suggested task, independent of the amount of time required. It is evaluated through the following score system:

Score: 1. Failure. The user fails to complete the task, despite some suggestions

Score: 2. Partial success. The task is partially completed or completed after suggestions

Score: 3. Complete success. The task is completed without any difficulties or suggestions

On the other hand, the efficiency measures how fast each participant completes each assignment and is evaluated by timing the performance of each task. The above-mentioned Stopwatch software was used to record timestamps. The recorded value corresponds to the time interval between the end of the reading of the task by the moderator, and the moment when the user asserts the completion of the task.

Finally, scores given to each of the statements proposed in the post-test questionnaire were analysed to assess user satisfaction. The agreement scale was used, where 5 represented the fully agree and 1 represented strongly disagree options. All statements were designed to have consistent meanings (e.g., “The software is intuitive to use”, “I had no problem using the basic features”).

For comparison purposes, it is important to evaluate only the functionalities common to all thecompared SaMD. This enables a direct comparison of effectiveness and efficiency, task by task, and scenario by scenario. It is also crucial for assessing the user satisfaction evaluation: if the tasks and scenarios differ between the assessed SaMD, user satisfaction may be higher for the simpler test suite and lower for the specific and innovative features tested. However, the methods for comparing similar SaMD using the proposed testing protocol are beyond the scope of this paper.

2.8 Statistical analysis

Efficiency, effectiveness, and user satisfaction variables’ distribution were tested using the Shapiro-Wilk test. The variables were statistically described as mean ± standard deviation (SD) for normally distributed quantitative data; median and interquartile range (IQR) for non-normally distributed data; frequency count and percentage for qualitative data. The difference between the groups of participants and the hardware modalities was evaluated using the independent t-test for normally distributed quantitative variables and the Mann-Whitney test for non-normally distributed quantitative variables. The relationship between efficiency, effectiveness, and user satisfaction was tested with the Spearman test. When a significant difference was detected, Cohen’s d (for normal distribution) or Cliff’s delta (for non-normal distribution) was calculated as a measure of the difference. The significance level for all tests was set to 0.05 (\(p < 0.05\)).

2.9 Pilot testing

Three weeks before the test administrations, the Pilot Test in touchscreen mode was carried out to validate the proposed test method, as well as the above-described environment, equipment, and tasks. Four persons were involved in the pilot test administration: a moderator, two observers, and a user. The observers prepared all the necessary environment and equipment, while the moderator made sure that all the procedures were followed correctly. The pilot test was very useful, as it uncovered some task-related issues, such as the duration of some of them or the used lexicon.

2.10 Testing procedure

After the administration of the pilot test and its further analysis, the final testing procedure was set up. Before the participant arrived, the moderator and the observers verified the instrumentation, making sure that nothing was missing or abnormally working by using a dedicated checklist (see Appendix B). A single testing session performed by one user took about 1.5 h and consisted of the following steps:

1.

Introduction to the test, including the software description and the desirable goals (about 5 mins). See Appendix C

2.

Signing of the recording agreement (about 2 mins). See Appendix E

3.

Compilation of the pre-test questionnaire (about 5 mins). See Appendix D

4.

Execution of exploratory tasks (45-50 mins). See Appendix F

5.

Specific scenarios (10-15 mins). See Appendix G

6.

Compilation of the post-test questionnaire, including user’s feedback (10-15 mins). See Appendix H

Figure 3 illustrates the entire workflow of the testing protocol as described in this section.

Fig. 3

The flow chart of the proposed testing protocol illustrates the sequence of steps to perform before testing (I), during testing (II), and after testing (III). Dashed-border boxes represent the steps that could be repeated more than once

View original article

HEALTH AND TECHNOLOGY

Like

分享书签

0 0 0 0 0 0 0

More from this channel

Enabling reliable usability assessment and comparative analysis of medical software: a comprehensive framework for multimodal biomedical imaging platforms

留言 (0)