Design of interactive augmented reality functions for robotic surgery and evaluation in dry‐lab lymphadenectomy

2.1 Developed functions We iteratively designed and implemented four AR functions that can be used in both lab and clinical settings: external camera view, medical records, distance computation and instrument position warning system. We intended to create easy and intuitive interactions that could improve surgical procedures by speeding up processes, increasing effectiveness, helping in decision making and reinforcing safety. Concise descriptions and motivations of each function follow, and Figure 1 shows our functional prototypes.

The external camera view function allows the surgeon to see the surrounding OR environment without removing their head from the robotic console. As shown in Figure 1A, the external camera might point at the patient cart (e.g., to look for collisions between the robotic arms), or it could look elsewhere in the OR.

The medical records function aims to expedite the procedure by providing the surgeon with preoperative data that can affect the surgery, for example CT scan, histology report and consent form. Staying in the console rather than removing one's head reduces sight diversion, a common issue in image-guided surgery.6 To avoid a shift of focus and any size reduction of the intraoperative view,23 we overlay the preoperative data directly in the primary tile rather than on an auxiliary window. The surgeon can define the position and size of the displayed medical records.

The distance computation function allows the surgeon to measure 3D Euclidean distances and sizes intraoperatively, such as during tumour enucleation or tissue replacement. Similar to a previous patent,24 the surgeon uses the tip of the surgical instruments (with closed jaws) to select the points between which the distance should be measured. Once a point is defined, a virtual marker appears and the system computes the distance between two markers. The surgeon can choose between two methods to place the markers:

-

Method 1: placing both markers simultaneously. The distance is measured between the two instrument tips, as shown in Figure 1C.

-

Method 2: placing the markers sequentially. The surgeon applies one marker at a time with either the left or the right instrument. The distance is calculated after the placement of the second marker.

The instrument position warning system function has the goal of improving safety. Moving a surgical instrument out of the camera's view can be dangerous since unseen tissue could be damaged. A visual indicator appears right before either robotic instrument moves out of the surgeon's field of vision. The warning persists until the missing instrument returns. In contrast with the built-in ‘off-screen indicator’ of the da Vinci Xi,10 our implementation relies on vision rather than robot kinematics for improved accuracy and easier integration into different surgical robots.

image

Our functional prototypes of the four proposed AR functions: (A) External camera view: the surgeon sees the view from an external camera, which here shows the robotic arms and the laparoscopic box trainer. (B) Medical records: preoperative images and other data are shown on top of the camera view. Herein, we provide an X-ray image of the simulated tissue on which the surgeon is practicing. (C) Distance computation: the 3D Euclidean distance is computed between the tips of the robotic instruments (white dots) and displayed at the top of the monitor (here, 4.08 cm). (D) Instrument position warning system: a warning symbol appears at the upper left corner of the camera view to tell the surgeon that the left instrument has moved outside the visual field

Sample videos of all four functions recorded during the user study can be seen in the Supporting Information S2–S5 of this article. The audio was distorted to protect the identities of the surgeon participants.

2.2 Visual augmented reality To meet the inter-operability requirement defined by Sielhorst et al.,19 we developed a platform that does not require access to any custom data interfaces of the surgical robot or information on the kinematic operation of its instruments. Instead, our platform uses visual augmented reality and image processing. Visual AR consists in the superimposition of computer-generated visual information on a user's view of the real world, resulting in a composite view. To achieve this superimposition, two-way video communication between the surgical robot and a workstation computer is required: one direction streams images of the surgical field as they are captured by the stereo camera (the user's view of the real world), and the other direction displays the augmented images in the stereo viewer (the composite view). Our requirements for video communication are:

to have real-time processing for the vision pipeline,

to be able to read the synchronised stereo images acquired by the surgical robot and

to output the processed content in HD into the stereo viewer of the surgical robot.

A description of our setup, its integration with the da Vinci system and the computer, and the superimposition of the computer-generated images follows.

2.2.1 Video capture and playback card

To acquire the video signal, most research groups working with the dVRK use off-the-shelf USB frame grabbers. This method suffers from high latency in the grabbing process. Furthermore, most low-cost frame grabbers cannot read digital video signals (HDMI, DVI and SDI) and are limited to noise-sensitive analog video formats. These issues make this approach unsuitable for our requirements. In addition to capturing the images seen by the stereo endoscope, the video interface must also display the virtual content in the stereo viewer.

To fulfil the video communication requirements, we selected a Blackmagic Design DeckLink Quad 2 card that can perform keying, that is, composite two full-frame images together. We use this technique to overlay other video signals, images and computer-generated stereoscopic content over the source video with minimal latency.25 As shown in Figure 2, we create the augmented world by compositing each real-world frame with a transparent frame of the same size wherein the virtual content is placed at the desired position; the left and right stereoscopic frames are processed individually.

image

The keying technique is used to superimpose the virtual content on the real world. The real-world image and the transparent frame with the virtual content are composited to generate the augmented-world image

It is essential to consider that most surgical stereo endoscopes have high-definition (HD) resolution; only advanced video playback cards can perform keying in HD.

The DeckLink Quad 2 can be used simultaneously for up to eight independent I/O ports that can be configured to be either input (reading), output (writing) or keying. This means that it is not possible to read and perform keying simultaneously from the same channel. Nonetheless, we need to read the real-world images to process the captured images, and at the same time, we need to perform keying to superimpose the transparent frame we have created on top of the images acquired by the stereo endoscope. To overcome this challenge, we split the left and right input video signals taken from the da Vinci system with two SDI splitters, one for each channel of the stereo camera.

2.2.2 Integration of the video card with the robotic system

The left and right camera control units (CCUs) provide HD-SDI output for the respective cameras. Normally, this output is connected to the HD-SDI input of the Core (the processing centre for the robotic system) through BNC cables. Thus, it is possible to intercept the stream of the standard vision pipeline by placing the video capture and playback card in this location, between the CCUs and the Core (the exact location varies depending on the generation of the da Vinci system25). The modified pipeline is shown in Figure 3. This setup requires the computer to be booted and processing in order for the da Vinci to work in its standard mode. In an OR, additional hardware would be required to provide a video connection that does not go through the video capture and playback card and, in turn, through the computer, to avoid losing the intraoperative video input in case of computer failure. Specifically, one would need three-output splitters instead of the two-output splitters, plus a switcher before each channel of the Core.

image

The standard video connection between the left and right camera control units and the Core is replaced with two SDI-splitters and the DeckLink Quad 2 video capture and playback card. I, input; L, left; O, output; R, right

In contrast to the da Vinci video outputs, the selected card has mini-BNC connectors, so mini-BNC-to-BNC cables are required. Further, RG-59 coaxial cables with BNC connectors of impedance 75 Ω are used in order to maintain a consistent impedance throughout the system and avoid reflections.

2.2.3 Integration of the video card with the workstation computer

The selected video capture and playback card requires a second-generation PCI Express slot (8 or 16 lanes). To interface with the vision system in real time and in HD, we developed appropriate drivers that can be downloaded from our GitHub site.26 These drivers expose DeckLink cards to a ROS network by leveraging libdecklink, a higher-level interface to the BlackMagic Design SDK used to control the cards.

2.2.4 Overlaying of the computer-generated images

As shown in Figure 4, the images acquired by the stereo endoscope and, in turn, by the video capture and playback card, have a resolution of 1920 × 1080 (16:9 ratio). Importantly, the stereo images shown in the stereo viewer are cropped by the da Vinci system itself to a resolution of 1340 × 1072 (5:4 ratio). The cropping of the internal and external edges (respectively, lighter and darker areas in Figure 4) reduces problems related to the keystone distortion, typical of the toe-in stereo-rendering method, as distortions are more severe towards the edges and in particular at the corners.27 This strategy is also confirmed by the patent of the endoscope.28 Furthermore, the cropping of the internal edges compensates for the small endoscope baseline (∼6 mm), which is the distance between the two cameras. Because of this cropping, the virtual content has to be positioned in the 1920 × 1080 transparent frame considering the resolution of the images seen by the surgeon, that is, 1340 × 1072. Thus, researchers adapting our technology to a new system would need to provide as input the cropping of the images shown in their stereo viewer, if any. A further consideration is needed when using a da Vinci robot; the x and y origin of the crop varies slightly for both channels every time the da Vinci stereo endoscope 3D calibration is performed to correct for the misalignment of the optical axes.28 Consequently, the stereo window position, that is where the disparity of the raw stereo images is zero, also varies. These subtle shifts must be taken into account to properly overlay the virtual content on the raw stereo images, and in particular when overlaying directly in the raw images.

image

The left and right images acquired by the stereo endoscope have a resolution of 1920 × 1080. In the stereo viewer, they are cropped mostly on the horizontal internal edges (lighter area) and on the horizontal external edges (darker area), leaving a visible area of 1340 × 1072 (area outlined by the red dashed rectangle). The axes are oriented as shown by the arrows

2.2.5 Inter-operability

DeckLink cards support the most popular video formats and can detect input resolution and pixel format. Our drivers have been validated on several variants of DeckLink cards with different generations of da Vinci systems; every tested combination worked. Furthermore, the described video communication is not limited to work with da Vinci systems; it can be extended to any SDI-compatible devices. Our platform can thus be integrated into any surgical robotic system equipped with a stereo camera and a stereo viewer.

2.3 Image processing

In the external camera view and medical records functions, the video acquired by an external camera (Figure 1A) and the patient's preoperative data (Figure 1B), respectively, are overlaid through keying on the stereo window plane, the same plane as the da Vinci overlaid elements, for example, instrument status, critical messages.

The instrument position warning system and distance computation functions include image processing of the surgical field captured by the stereo endoscope to identify the instruments and their tips. An alternative strategy would be to use robot kinematic data either alone or combined with vision; however, previous work showed that accurate detection of the tip of an articulated instrument is not achievable with only kinematic data29 due to friction, flexibility and other non-idealities in the robot's kinematic chain. Achieving positioning errors on the millimetre scale requires at least two-dimensional (2D) images.29 Furthermore, relying on robot kinematic data could undermine the platform's inter-operability and would also require accurate hand-eye calibration.29

The instrument position warning system function uses the neural network TernausNet-16 to locate the surgical instruments.30 This neural network can be used to perform three different tasks: binary segmentation, part segmentation and instrument segmentation. In our function, the raw left-channel image is given as input to TernausNet-16 for binary segmentation: the left and right instruments are extracted from the background and assigned to two different labels. When the distal end of either instrument reaches a distance of about 160 pixels from the left or right edge of the 1340 × 1072 frame shown to the surgeon, or a distance of about 190 pixels from the top or bottom edge of the same frame, the relevant left or right attention symbol is overlaid on the zero-disparity plane (Figure 1C). To decrease the computing cost and take advantage of the wider captured view, we use only one channel of the 1920 × 1080 stereo-pair images to detect out-of-view instruments, and we assume that the right instrument is always visible in the right-most 200 pixels of the cropped image, and the left instrument in the left-most 200 pixels.

The distance computation function recognises the tips of the robotic instruments to place virtual markers. Each placement consists of four steps, which are summarised in Figure 5 and described here.

image

Placing a virtual marker for the distance computation function consists of four steps. Step 1 rectifies the acquired raw images and corrects the lens distortions. In step 2, the tip (marked in blue) of the selected robotic instrument is identified in the left image plane. As shown on the left, the instruments are segmented with deep neural networks in the case of real tissues and with colour-based segmentation in the case of simulated tissues. Step 3 retrieves the tip in the right image plane through template matching. In step 4, the identified position of the tip is transformed to be shown in the raw stereo images that will be displayed in the surgeon console

In the first step, the stereo images acquired with the toed-in cameras are rectified, and the lens distortions are corrected.

In the second step, the rectified left-channel image is given as input to the neural network TernausNet-16 for part segmentation: the robotic instruments are extracted from the background, and three articulated parts of each instrument are identified, that is, the rigid shaft, the articulated wrist and the claspers. We remove wrongly classified areas based on thresholds related to their sizes. The projection of the tip of an instrument (P) on the rectified left image plane is defined as the corner of the claspers, that is farthest from the centre of the wrist. If two corners have a similar Euclidean distance from the centre of the wrist, tip P is defined as their midpoint. Importantly, these two conditions depend on how the surgeon holds the instrument tip (Figure 6). The aforementioned corners are detected with the Harris corner detector and refined in small search windows (best results were obtained with 11 × 11 search windows). Finally, an 81 × 61 region-of-interest (ROI) image centred on tip P is created. The selected dimensions guarantee that the ROI entirely includes tip P when the instrument is held at different distances from the stereo endoscope.

image

Holding the instrument tip as shown in the left image produces one possible corner (blue circle). As such, the tip is considered to be that corner. Holding the instrument tip as shown in the right image produces two corners (green circles) with similar Euclidean distances from the centre of the wrist (yellow cross). As such, the tip is defined as the midpoint between these two possible corners (blue circle)

In the third step, the ROI image extracted from the left-channel image is used as a template to search for tip P in the rectified right-channel image using template matching. Since we rectified the images, the same pixels lie (approximately) on the same epipolar line in the left and right images, so the vertical searching area is reduced. Consideration of the maximum feasible disparity decreases the horizontal searching area.

In the fourth step, the pixel locations of tip P identified in the rectified stereo-pair image are retrieved in the raw stereo-pair image through the undistortion and rectification transformation maps. A white circle of radius 5 pixels centred on the retrieved tip positions is superimposed over each raw image through keying, thereby allowing the user to visualise the virtual marker placed with tip P in 3D space (Figure 1D). The x-, y- and z-coordinates of the virtual marker are computed by combining the identified disparity (d = xL − xR) with the parameters of the intrinsic right camera matrix for the rectified images (obtained with a stereo calibration of the stereo endoscope31), as follows: urn:x-wiley:14785951:media:rcs2351:rcs2351-math-0001(1) urn:x-wiley:14785951:media:rcs2351:rcs2351-math-0002(2) urn:x-wiley:14785951:media:rcs2351:rcs2351-math-0003(3)where Tx is the translation along the x-axis of the optical centre of the right camera in the left camera's frame, xL and xR are the x-coordinates of the projections of tip P on the left and right image planes, yL is the y-coordinate of the projection of tip P on the left image plane (Figure 4), cx and cy are the optical centres (principal points) and f is the camera's focal length.

The neural network TernausNet-16 produces good results when working on images acquired from real surgeries, similar to those it was trained on30; however, during this design phase and for the sake of reproducibility and live demonstration, we chose to evaluate the functions with phantom tissues. Thus, we replaced the TernausNet-16 with a binary colour-based segmentation that outputs the robotic instruments and the background. Having two classes instead of four, we had to modify the algorithm of the second step of the distance computation function, that is, the identification of the tip. When using phantom tissues, the corners are found in a window of 31 × 51 pixels centred on the extremity of the robotic instrument, that is, the right-most pixel belonging to the left instrument or the left-most pixel belonging to the right instrument, instead of on the claspers.

When developing this function, we also focussed on avoiding diplopia, that is, the simultaneous perception of two images of a single object. Once placed in its 3D position, the virtual marker becomes part of the real environment; thus, it should disappear if covered (e.g. by a surgical instrument) and reappear again once uncovered. A pixel is known to be covered when its computed disparity value increases above a threshold. The AR system must continuously update the disparity map to recognise such events. During testing, the virtual dots disappeared when covered after approximately 140 ms21; however, since the virtual markers are visible only for a short time, and because implementing real-time tracking of the virtual markers was outside the scope of this function, we decided to remove this feature during the user study.

2.4 Voice commands

The developed functions are controlled by voice commands, except for the instrument position warning system function, which is permanently active in our implementation to reinforce safety. We avoided control modes that shift the surgeon's attention away from the surgical field, for example activating the functions using external hardware, or that make the surgeon control the instruments in potentially unsafe ways, for example, activating the functions with specific gestures. As such, voice commands represented a promising option worthy of evaluation. The integration of voice commands in laparoscopic surgical robots began in 1996 with the AESOP 2000, which was used to manoeuvre the endoscopic camera.32 Voice controls achieved quicker operation times than human assistance and than the hand and foot controls used in the previous version of AESOP. Nevertheless, they also had some disadvantages. The surgeon had to talk continuously during the procedure, potentially distracting the rest of the personnel in the OR.32 Additionally, the surgeon had to pre-record the voice commands33; the resulting voice recognition was accurate34, 35 but limited to a particular language dialect.32

In da Vinci systems, audio communication is facilitated by using the microphone located under the viewport of the surgeon console.10 Our workstation computer accesses the audio signal acquired by this microphone through the Line Out situated in the back of the surgeon console.10 In robots not equipped with a microphone, an externally mounted or head-worn microphone can be directly connected to the computer. The acquired audio signal is chunked every three seconds and analysed through a lightweight speech recognition engine, PocketSphinx,36 with a statistical language model that contains the probabilities of the possible words and their combinations. Each command follows the pattern ‘da Vinci + verb + object of the action’ to create a natural interaction with the surgical robot (‘da Vinci’ is used as a wake word). A command is valid only if said correctly, for example, not forgetting the wake word at the beginning, not pausing for too long between the words, not changing the sequence of the words and not continuing to talk after saying the command. We expected the surgeon to briefly stay silent after giving the command to wait for the function to work. To improve voice command recognition, the echo is cancelled, and the background noise is filtered out. The list of the voice commands used during the experimental evaluation is reported in Table 1.

TABLE 1. List of the voice commands used during the experimental evaluation Voice commands Explanation da Vinci, show commands To display the list of all voice commands da Vinci, remove commands To remove the list of all voice commands da Vinci, show camera To display the external camera view da Vinci, remove camera To remove the external camera view da Vinci, show X-ray To view the first image in the X-ray folder da Vinci, show next To view the next image in the X-ray folder da Vinci, show previous To view the previous image in the X-ray folder da Vinci, move top-left To reduce and move the X-ray image to the top-left corner da Vinci, move top-right To reduce and move the X-ray image to the top-right corner da Vinci, move bottom-left To reduce and move the X-ray image to the bottom-left corner da Vinci, move bottom-right To reduce and move the X-ray image to the bottom-right corner da Vinci, move centre To return to the large X-ray image in the centre da Vinci, remove X-ray To remove the X-ray image da Vinci, compute distance To measure the distance between the two instrument tips (method 1) da Vinci, mark lefta To select a point with the tip of the left instrument (method 2) da Vinci, mark righta To select a point with the tip of the right instrument (method 2) da Vinci, show work area To see the outline of the area wherein the instrument tips should be placed da Vinci, remove work area To remove the area limits da Vinci, remove content To remove all content related to the distance computation function a In method 2, the distance is calculated after the selection of two points. 2.5 Experimental evaluation

We conducted an exploratory user study to evaluate the usability and utility of the four implemented AR functions and the voice commands by which they are triggered.

2.5.1 Experimental setup Figure 7 shows the experimental setup. The workstation computer is an Alienware Aurora R7 customised as follows:

CPU: Intel Core i7-8700K (six cores, 12-MB cache, overclocked up to 4.6-GHz across all cores).

GPU: NVIDIA GeForce GTX 1080 Ti with 11-GB GDDR5X memory.

Operating system: Ubuntu 16.04 LTS.

The surgical environment is made of a custom laparoscopic box trainer containing a piece of simulated tissue all attached to a tilting table.

image

A clinical da Vinci Si HD is connected to the workstation computer that recognises voice command, processes the images and outputs the virtual content. The experiment is recorded by a laptop computer that captures the images acquired by the stereo endoscope as well as the view of the procedure seen by another camera (not shown) that can see the surgeon console, the vision cart and the patient cart of the da Vinci system. The experimenter's voice is recorded by the laptop computer and the surgeon's voice by the surgeon console microphone. The surgeon controls the robot from the console to perform a lymphadenectomy on a phantom tissue sample placed under the laparoscopic box trainer on the tilting table

Each participant performed a physically simulated lymphadenectomy, that is, removal of lymph nodes. Their goal was to remove four simulated lymph nodes from the phantom tissue. The positions of the lymph nodes were provided using preoperative images (in particular, X-ray images acquired with a Kubtec XPERT 80-L).

We built the laparoscopic box trainer using a 360-mm-diameter rigid bowl from ROTH (Rotilabo-mixing bowls PP ref. YE52.1) upside down, and we painted it black to reduce the dispersion of light. Then, five circular windows for internal access were drilled and each filled with a flexible membrane. A smaller diameter hole was created at the centre of each membrane to allow instrument access. Input from surgeons allowed for determination of the optimal positions of the access points for the endoscope (E), the left and right robotic tools (TL and TR) and the left and right assistant instruments (AL and AR), as shown in Figure 8.

image

Top view of the CAD model of the laparoscopic box trainer. The endoscope (E) is at the centre of the bowl, the left and right robotic tools (TL and TR) are placed symmetrically from the sagittal plane and the left and right assistant instruments (AL and AR) are opposite from the centre of the bowl at lower elevation angles. The angular positions are given in parentheses (vertical axis angle, elevation angle). ∅ indicates the diameter of each access point and s∅ indicates the spherical diameter of the box trainer

We prepared the phantom tissue samples with Smooth-On Soma Foama 25, a soft two-component platinum-cure silicone casting foam. To create a similar appearance to real tissue, we added a red colouring substance (UVO Colorants) to the material during its preparation. Mixing takes ∼1 min, and an additional hour is needed for the curing process. The material expands about two to three times from its original volume and has properties similar to real tissue, as judged by our clinical co-authors.

The lymph nodes (Figure 9A) were prepared using Smooth-On VytaFlex 20 urethane rubber, the red colouring substance and metallic particles (iron powder). This composite material was then poured into moulds with different shapes where a layer of fabric was added in the joint plane to emulate the real-life tissue adherence and resistance properties. The resulting lymph nodes were radiopaque, and darker and harder than the phantom tissue surrounding them.

image

Task materials: (A) Lymph nodes: Two samples of simulated lymph nodes placed in the phantom tissue. The lymph nodes are made of rubber and fabric to emulate real tissue adherence properties, and they contain metallic particles to be opaque to X-rays. (B) Flat surface: Flat surface placed above the sample tissue. The surgeons were asked to measure and guess the dimensions of different sections of this flat surface. The ground truth measurements are labelled

During the curing process of the tissue, we placed four lymph nodes at different locations inside the material, plus a radiopaque bendable cable visible on the surface. This cable was used by the participants as a landmark for finding the radiopaque lymph nodes through the X-ray images (Figure 1B).

2.5.2 Experimental protocol

Nine surgeons (eight male and one female) participated in the experiment. One participant was excluded from the analysis for not meeting our inclusion criterion for experience with RMIS. As summarised in Table 2, the other eight participants were expert surgeons with a median of 16.5 years of experience and 200 cases performed in the last year in open surgery, 9 years and 17.5 cases in MIS and 2 years and 62.5 cases in RMIS. Five of our participants are specialised in urology, one in abdominal surgery, one in general/colorectal surgery and one in cardiology. One subject declared having no experience in MIS as non-robotic laparoscopic surgery is not used in cardiology.

TABLE 2. Information about the expertise of the eight surgeons who participated in the study Open surgery MIS RMIS Years of experience Minimum 9 0 1 Median 16.5 9 2 Maximum 35 30 12 Cases performed in the last year Minimum 20 0 10 Median 200 17.5 62.5 Maximum 350 200 130 Note: The table reports minimum, median and maximum years of experience and approximate number of cases performed last year in open surgery, MIS and RMIS. Abbreviations: MIS, minimally invasive surgery; RMIS, robot-assisted minimally invasive surgery.

The experimental protocol was reviewed and approved by the Ethics Council of the Max Planck Society (Application Number: 2018_26). All participants provided informed consent to participate in the study prior to data collection. Subjects were offered payment of €8/h for participation.

The experimenter calibrated the stereo endoscope before the arrival of each participant. After informed consent, the experimenter collected the surgeon's demographic data and presented the functions being evaluated. The participant practiced the voice commands by reading every command once. The participant then performed a lymphadenectomy on the phantom tissue; two lymph nodes were extracted without our AR functions, and two were extracted with our AR functions. Half of the participants were randomly assigned to start with the AR functions and half without. When the task was performed without our technology, we asked participants to guess the size of the two extracted lymph nodes and of two desired sections of a flat surface that we placed above the sample tissue (Figure 9B). When the task was performed with our technology, we asked them not to guess but to measure the other two extracted lymph nodes and any two sections of the flat surface with our distance computation function. Before beginning the experiment, each participant was given the possibility to practice the new functions and voice commands as long as desired. At the end of the study, that is after all four lymph nodes were extracted, the subject was asked to fill out a questionnaire regarding the utility and the usability of the technology. A full listing of all 29 questions (from Q1 to Q29) is provided as Supporting Information S1 for this article. A total of 15 questions are answered on a 5-point Likert scale, nine are multiple-choice questions, and the remaining five are open-ended. Additional space for comments was given at the end of each subsection and at the end of the entire questionnaire.

留言 (0)

沒有登入
gif