Immersive Virtual Reality–Based Methods for Assessing Executive Functioning: Systematic Review


IntroductionBackground

Executive functioning (EF) has long been a focus of neuropsychological assessment because of the significant role it plays in everyday functioning. EF is an umbrella term for higher-order cognitive skills used to control and coordinate a wide range of mental processes and everyday behaviors [-], including “...mentally playing with ideas; taking the time to think before acting; meeting novel, unanticipated challenges; resisting temptations; and staying focused” []. Although a universally accepted definition of EF does not exist [], there is agreement on the attributes of 3 core executive functions: inhibition, cognitive flexibility, and working memory [,,]. These core executive functions support other higher-order executive functions such as reasoning, planning, and problem-solving [-]. As EF impairment has been linked to a variety of mental disorders [], it is often considered a transdiagnostic risk factor [].

Although traditional methods used to assess EF are popular [,] and well validated [], they have been criticized for their lack of ecological validity [,]. Ecological validity, within the scope of this study, is defined as the “functional and predictive relationship between the person’s performance on a set of neuropsychological tests and the person’s behavior in a variety of real world settings” []. Specifically, we interpret ecological validity as comprising 2 principal components: representativeness—the degree to which a neuropsychological test mirrors the demands of a person’s daily living activities that it aims to evaluate [], sometimes referred to as verisimilitude []—and generalizability—the extent to which test performance predicts an individual’s functioning in their daily living activities [], also known as veridicality [].

Traditional assessments tend to take a “construct-led” approach, with each test intended to isolate a single cognitive process in an abstract measure. This process of abstraction may limit the ecological validity of the measure by resulting in poor alignment between the test outcomes and real-world functioning. In turn, this produces a large amount of variance in EF that is unaccounted for by traditional tasks. For example, Chaytor et al [] noted that traditional EF tests accounted for only 18% to 20% of the variance in the everyday executive ability of participants. This lack of explained variance may be attributed to the nature of the testing environment, the constructs assessed in isolation, the participant’s affective state, and the compensatory strategies available to the participant []. A related methodological issue, known as the “task impurity problem” [,], indicates that the score on an EF task usually reflects not only the systematic variance attributable to the specific aspect of EF targeted by that task but also the (1) systematic variance across multiple types of EF tasks, (2) systematic variance attributable to non-EF aspects of the task, and (3) nonsystematic (error) variance (see the study by Snyder et al [] for a detailed review). Outside the testing environment, the process of making a decision or planning and eliciting goal-directed behavior in everyday life is often highly dynamic and influenced by numerous internal and external factors [,]. Therefore, an ecologically valid assessment tool will need to include relevant contextual, dynamic, and multidimensional features such as affect and physiological state, which traditional assessments cannot include.

Furthermore, although traditional EF assessment tools may be appropriate for clinical populations, they generate less information about functioning in relatively healthy individuals. For example, the Trail-Making Test (TMT) has routinely been administered as a neuropsychological assessment of driving performance. Although some studies have demonstrated a relationship between the two [,], others have shown no relationship [], particularly in nonclinical populations [,]. Thus, although traditional tools are adequate for detecting more severe EF impairments, they are less effective in detecting subtle changes in EF and early decline. Increased test sensitivity to detect subtle intraindividual changes may enable better detection of the prodromal stages of cognitive decline. Early detection is important as it enables early intervention, which may in turn improve prognosis. For example, sensitive detection can identify the prodromal stages of Alzheimer disease in seemingly healthy individuals [] and mild cognitive decline up to 12 years before clinical diagnosis []. Similarly, in a situation in which an individual requires a capacity assessment for an activity, traditional assessments may have limited utility for nonclinical populations. The triangulation of multiple data sources such as biosensors may increase sensitivity to better identify subtle changes in capacity.

To address the shortcomings of poor ecological validity and test sensitivity, research on psychological assessment has begun to investigate virtual reality (VR) technology as a means of providing a more naturalistic environment for evaluating EF in clinical neuropsychological assessments. VR enables the development of custom-designed simulated environments that can replicate real-life environments, potentially increasing its ecological validity through representativeness. In addition, VR could increase engagement [,], reduce test time, and better integrate data from biosensors with in-task events that facilitate assessment. The following sections will expand on these points and consider the importance of validating and assessing the reliability of VR for EF assessment.

Ecological Validity and Representative Tests

There is an increasing emphasis on conducting EF assessments using tasks that resemble situations experienced in everyday life []. For example, the Multiple Errands Test (MET) [] requires individuals to run errands in a real environment (eg, a shopping center). Empirical assessment of the MET has demonstrated its generalizability to daily functioning [] and carer reports of daily functioning []. However, given that the MET is designed to be performed in real-life locations, it is impractical for routine administration by clinicians [,] and susceptible to the variable features of real-world environments that are outside experimental control. VR can mitigate these difficulties by maintaining the real-world environment without requiring travel while enabling fine-tuned control and uniform presentation of environmental characteristics []. Several studies [-] have investigated and developed platforms for this purpose, commonly known as the virtual MET.

Engagement

VR has the potential to enhance individual engagement more effectively than traditional pencil-and-paper or computerized tasks by offering a fully immersive experience []. Recognized as a crucial aspect of cognitive assessment, engagement can be improved through gamification, thereby improving task performance []. “Serious games,” defined as games intended for a variety of serious purposes, such as training, learning, stimulation, or cognitive assessment [], have been shown to be more engaging than nongamified tasks [-]. The unique immersive environment of VR captures increased attention, leading to reduced average response times and response time variability []. Notably, recent studies using electroencephalography (EEG)-based metrics have shown greater attention elicited in immersive VR paradigms than in 2D computerized assessments []. This heightened immersion and engagement in VR may enhance the reliability of the measures by capturing a more accurate representation of an individual’s best effort.

Cybersickness

Despite their increased engagement, VR paradigms have the potential to induce cybersickness, which can threaten the validity of the paradigm. Cybersickness (ie, dizziness and vertigo) is akin to motion sickness but occurs in response to exposure to VR []. Previous research suggests that there is a negative relationship between cybersickness and cognitive abilities. For example, Nalivaiko et al [] found that reaction times were moderately correlated (r=0.5; P=.006) with subjective ratings of nausea. Similarly, Sepich et al [] found that participants’ accuracy on n-back task performance was weakly to moderately negatively correlated (r=−0.32; P=.002) with subjective cybersickness ratings. Therefore, there is reasonable concern that the potential benefits of engagement and ecological validity may be compromised if participants experience cybersickness.

Validity, Reliability, and Sensitivity

Arguably, the biggest threat to the utility of VR platforms is that many studies do not document their validity and reliability. A meta-analysis showed that VR assessment tools are moderately sensitive to cognitive impairment across neurodevelopmental, mental health, and neurological disorders [], demonstrating their promising application in clinical settings. Borgnis et al [] reviewed the VR-based tools for EF assessment that are currently available, illustrating the plethora of platforms developing in this field. The works by Neguț et al [] and Borgnis et al [] highlight the utility of VR assessment tools to detect dysfunction and present the various tools in the literature created to investigate EF. Kim et al [] provided an overview of the research trends using VR for neuropsychological tests and documented the cognitive functions assessed in each study. However, to the best of our knowledge, there is no overview or examination of the psychometric properties of these VR tools or how they are being evaluated.

Typically, novel measures and assessments are validated against current gold-standard tasks for concurrent validity []. Concurrent validity can be a reliable means of determining whether two assessments measure the same construct. However, concurrent validity can also occur when two tests contain the same problems, such as inaccurately measuring a particular construct in the same way. Sequentially, many VR tasks are being created from a “function-led” perspective but validated against “construct-led” tasks [,]. Given their different approaches, function-led and construct-led assessments should be validated in different ways or at least using several validation approaches. If function-led VR assessments improve upon the validity of current assessment methods, validation techniques may also need to go beyond comparisons with traditional assessments. For example, function-led VR assessments may be better validated against additional alternative methods, such as carer reports, real-life performance (eg, self-care, residence, transportation, and employment), and diagnostic trajectory [] as opposed to validation through traditional (construct-led) assessment. Without incorporating tests of ecological validity, the potential advantages of VR may go unrecognized. Given the increasingly rapid development of VR neuropsychological assessments, it will be imperative to maintain high validation standards for these tools [].

Establishing the reliability of novel VR EF assessments is also critical to the integrity of the outcomes. Reliability ensures that the measure yields consistent and repeatable results, a foundational element for test validity. Consequently, both reliability and validity ought to be evaluated for each measurement tool. Test-retest reliability, confirming consistency over time, should be accompanied by the interval between assessments and the correlation of the results. Internal consistency, typically measured using the Cronbach α, should also be reported for each target construct or domain of assessment. Importantly, for immersive VR EF assessments that evaluate multiple EF constructs, it is essential to report the α for each distinct construct rather than a collective coefficient. This is because the coefficient is intended to evaluate item consistency within a scale measuring a single construct; applying it across disparate constructs could be confusing and potentially misleading.

Consistency of Terminology

Finally, to ensure psychometric precision and build on previous research, EF assessment paradigms must adopt consistent terminology for their target assessment constructs. The field of EF, although of significant interest to both researchers and clinicians, is marked by varied terminology for identical constructs. This issue, longstanding in EF research (see the study by Suchy []; for a review, see the study by Baggetta and Alexander []), presents challenges to VR in the EF assessment field. For instance, inconsistent terminology hinders the synthesis of research findings. Diverse labels such as “impulsivity” and “impulse control” might, upon examination, refer to the same underlying construct. Consequently, researchers aiming to extend the literature on “impulsivity” might overlook pertinent studies or exclude valuable references because of terminological discrepancies.

This literature review sought to examine and discuss the development of the VR tools used to assess EF with a specific focus on evaluating their psychometric properties. The studies selected for inclusion in this review were those that developed assessment tools for EF either holistically or in part. The aims of this review were to (1) determine the components of EF assessed using VR paradigms, (2) investigate the methods used to validate VR assessments, and (3) explore the frequency and efficacy of reporting participants’ immersion in and engagement with VR for EF assessment.


Methods

Our review methodology followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement []. In line with the literature, EF was defined as a set of executive functions, including inhibition, cognitive flexibility, and working memory [,,], that support other higher-order executive functions, such as reasoning, planning, and problem-solving [,].

Inclusion Criteria

Before conducting the literature search, the inclusion criteria were established. First, only peer-reviewed articles and conference proceedings (complete manuscripts) written in English would be included. Second, articles that detailed an empirical, clinical, or proof-of-concept study in which an immersive virtual environment (ie, using a head-mounted display, not a 2D computer screen) was reported to broadly investigate EF or higher-order cognition or that examined EF via a selection of one or more subconstructs (eg, inhibitory control and working memory) would be included. Finally, only articles with an adult participant population published after 2013 would be included. This temporal limit was based on the release date of the Oculus Rift Development Kit 1 as it was one of the first accessible products for public use of VR. Articles were identified through the EBSCOhost, Scopus, and Web of Science (WoS) citation databases. Scopus and WoS were chosen because of their prominence as citation databases []. To compensate for the bias toward engineering and natural science articles found through Scopus and WoS [], EBSCOhost was searched for articles published in fields such as (clinical) psychology and medicine.

Search Strategy

Keywords were developed by identifying 3 main components that the relevant literature should include. The 3 components were based on “Virtual Reality,” “Neuropsychological Tests,” and “Executive Function.” It was decided not to search for specific components of EF because of the lack of consensus in the field regarding its components. Rather, it was assumed that, if an article addressed EF or a component of EF, it would include “executive functioning” as a keyword in the title, abstract, or keywords. Other reviews looking broadly at VR paradigms have used similar search strategies [].

In this study, key terms were developed by identifying synonyms for key components and concatenating them using the “AND” Boolean operator. The final keywords used for the search were as follows: ([“virtual” OR “artificial” OR “simulated”] AND [“realit*” OR “world” OR “environment”]) AND ([neuropsych* OR function* OR cognit*] AND [(executive AND function*) OR (high* AND order AND cognit*)] AND [assessment]).

Literature queries made through EBSCOhost were limited to the following databases: Academic Search Complete, AgeLine, AMED, Applied Science and Technology Source, CINAHL, E-Journals, Health Source Consumer and Nursing/Academic Edition, MEDLINE, Mental Measurements Yearbook, Psychology and Behavioral Sciences Collection, and all variations of the American Psychological Association databases. Furthermore, for the search, 3 topic fields (ie, title, abstract, and subject terms) were used to paste the keywords. The 3 topic fields were concatenated using the “OR” Boolean operator. Using the Scopus database, we implemented a basic search in the article title, abstract, or keywords using the keywords. No additional limitations were applied. Our search in WoS included all databases, and the advanced search method was used wherein keyword searches in the article title, abstract, and keyword topic fields were concatenated using the “OR” Boolean operator (ie, Title=(keywords) OR Abstract=(keywords) OR Keywords=(keywords)).

The results for each database were exported to Covidence systematic review software (Veritas Health Information) [], which removed duplicates. All abstracts were screened independently by the first author and the senior author to determine whether the contents met the inclusion criteria. Full-text screening was also performed by the same authors. Any disagreement was discussed by the first (RK), second (LK), and senior (KR) authors.

Data Extraction

The first and second authors completed the data extraction process by manually reviewing each manuscript; data items (see the following section) were recorded in a tabular format using Microsoft Excel (Microsoft Corp).

Data Items and Synthesis

Demographic details, qualitative descriptions of the VR paradigm, user experience, cybersickness, immersion and engagement details, and comparative measures for validation purposes were extracted ( [-,-]).

A qualitative evaluation of the studies included in the review was performed, meaning that the content of each manuscript was assessed based on the reported target constructs or constructs relevant to EF and the extent to which the reported VR task was related to the assessment of the target construct or constructs. To do this, studies were categorized based on the construct they targeted through their VR paradigm as reported by the authors of the respective articles. If multiple constructs were assessed in a single study, the study was included for each construct. No inferences were made about which cognitive construct or constructs was assessed based on the tasks that were reported in the manuscripts. For example, if an article indicated only that they used a VR version of the Stroop test (ST) but did not disclose which construct it assessed using this test, the study was not categorized under inhibitory control or cognitive flexibility but under the general factor “executive functioning.”

Next, it was indicated whether the articles explicitly or implicitly disclosed the way in which the comparative measures (such as particular metrics) were used to validate the VR paradigm. For instance, if the article directly stated a priori that they hypothesized a correlation between a VR task measuring inhibition and a validation task such as the ST, this was recognized as providing explicit validation for inhibition. Conversely, if an article indicated that participants completed the ST, which assessed inhibition and processing speed, and mentioned that the VR paradigm evaluated inhibition, it was considered to provide implicit validation for inhibition. Furthermore, traditional construct- and function-led assessments were identified from the text.

The (quantitative) results of the studies were screened to identify (1) the direction and strength of the relationship between traditional and VR assessments and (2) whether the results from all possible and a priori–defined comparisons were reported.

Finally, qualitative and quantitative tools used to evaluate beneficial and adverse effects of VR immersion were identified from the manuscripts and categorized in a tabulated format. The results of the studies were screened to identify whether they assessed the influence of the beneficial and adverse effects of VR immersion on task performance.


ResultsOverview

Through WoS, EBSCOhost, and Scopus, 892 items were identified, from which the Covidence systematic review management platform [] filtered 337 (37.8%) duplicates. A total of 555 unique articles remained, of which 424 (76.4%) were deemed irrelevant through abstract screening. The final 131 articles had their full texts screened, and 19 (14.5%) met the inclusion criteria. The systematic literature search process is shown in .

Figure 1. Systematic review process and results from literature searches in EBSCOhost, Scopus, and Web of Science databases. General EF

In total, 7 of the 19 (37%) of the reviewed studies assessed EF in general, meaning that the authors of these articles did not explicitly state which subconstruct of EF was targeted using the VR task. shows which validation tasks were used in each study to measure EF.

Table 1. The validation tasks, authors, and total number of studies examining general executive functioning.VRa target construct and validation taskValidationAuthorsStudies examining the construct, n (%)Executive functioning: general7 (37)
D-KEFSb []
TMT-Ac and TMT-Bd
STe
Modified version of the SETf
HTTg
ZMTh
ImplicitBanville et al []i

ImplicitDavison et al []j

ExplicitMiskowiak et al []

ExplicitPallavicini et al []

Groton Maze Learning Test (Cogstate)
ImplicitPorffy et al []

None specifically reported
N/AnTan et al []

None specifically reported
N/ATsai et al []

aVR: virtual reality.

bD-KEFS: Delis-Kaplan Executive Function System.

cTMT-A: Trail-Making Test version A.

dTMT-B: Trail-Making Test version B.

eST: Stroop test.

fSET: Six Elements Test.

gHTT: Tower of Hanoi test.

hZMT: Zoo Map Test.

iThe VR task was predominantly a sorting task for executive functioning assessment. The comparative assessments that validated this assessment were detailed under “executive function” broadly as the paper did not specify which components of the VR task the comparative tasks aimed to validate.

jThe VR task was reported to assess executive functioning. The comparative assessments that validated this assessment were detailed under “executive function” broadly as the paper did not specify which components of the VR task the comparative tasks aimed to validate.

kOTS: One Touch Stockings of Cambridge.

lCANTAB: Cambridge Neuropsychological Test Automated Battery.

mVFT: verbal fluency test.

nN/A: not applicable.

Banville et al [] immersed participants in a Virtual Multitasking Test (VMT), which was in principle designed to measure prospective memory and executive functions by having participants perform multiple tasks in a virtual apartment. However, this paper reported specifically on the task in which participants had to store groceries as fast as possible while also being attentive to other tasks, such as answering the phone or closing a window. Although the authors hypothesized that VMT scores would be correlated with neuropsychological assessments, such as mental flexibility, planning, and inhibition, it was not explicitly stated which metric of the VMT would be correlated with which neuropsychological assessment. Nonetheless, the authors identified that grocery storing time was correlated with the rule-break score on the Six Elements Test (r19=−0.49; P=.04; P value as reported in the manuscript). Furthermore, the number of errors in storing fruits and vegetables was found to correlate with the perseveration score on the Zoo Map Test (r20=0.53; P=.02; P value as reported in the manuscript) and reading speed during the second condition of the ST (r20=0.44; P=.05; P value as reported in the manuscript).

Davison et al [] immersed participants in a parking simulator and a chemistry laboratory where they had to park a vehicle, sort chairs, or locate items. Before immersion, participants completed the ST and the TMT versions A (TMT-A) and B (TMT-B). The authors identified that the completion time of the second level (Kendall τ=−0.32; P=.01; P value as reported in manuscript) and the number of levels completed in the parking simulator (τ=0.43; P<.01; P value as reported in manuscript) were correlated with participants’ performance on the ST. In addition, the ST was correlated with seating arrangement metrics, such as time to place the first stool (τ=−0.33; P=.01; P value as reported in manuscript) and number of stools placed (τ=0.33; P=.02; P value as reported in manuscript), as well as with time to locate the first item in the chemistry laboratory (τ=−0.37; P=.01; P value as reported in manuscript). Correlations between the TMT-A or TMT-B and, for example, the number of completed parking levels (τ=−0.49; P<.01; P value as reported in the manuscript) or the number of items placed in the seating arrangement task in the chemistry laboratory (τ=−0.35; P=.01; P value as reported in the manuscript) were reported. However, reporting was limited to significant correlations only, and no a priori expectation of how performances on the VR and validation tasks were correlated was indicated in the study.

Miskowiak et al [] assessed executive functions by letting participants complete the TMT-B, One Touch Stockings of Cambridge mean choices to correct, and verbal fluency test versions S and D. The performance on these tests was compared with participants’ performance on a cooking task in VR. The authors hypothesized that the number of cooking tasks that were correctly placed on a to-do list and the latency to solve the task would be VR-equivalent measures of EF. The authors found that VR performance was correlated (r121=0.26; P=.004) with EF, which consisted of a correlation between the average performance on the VR subtasks and the average performance on the validation tasks. The correlations between the individual performances on the VR and validation tasks were not reported in the manuscript.

Pallavicini et al [] had participants play the Audioshield dance game, which the authors hypothesized could be closely related to EF constructs such as inhibition and working memory. However, the authors correlated participants’ performance on the Audioshield game with their performance on the TMT-A and TMT-B, which measure psychomotor speed (TMT-A) and mental flexibility (TMT-B). Nonetheless, the results showed that TMT performance was negatively correlated with Audioshield performance metrics.

Porffy et al [] had participants complete VStore, where the 2 tasks measured EF, namely the “Find” task and the “Coffee” task. Specifically, participants had to find 12 items from a list they had previously memorized. In addition, participants had to order a hot drink from the coffee shop after finding, bagging, and paying for the 12 remembered items they had found in the store. Notably, the authors indicated that the 2 VR tasks also tapped into navigation (ie, "Find" task) and processing speed (ie, "Coffee" task). Furthermore, the Groton Maze Learning Test from Cogstate, which the participants completed before the VR task, was used to evaluate general EF. Nonetheless, through their regression analysis, the authors identified that the Groton Maze Learning Test was not a predictor for the "Find" task (B=0.024; SE 0.029; P=.11; P value as reported in the manuscript) or the "Coffee" task (B=−0.003; SE 0.051; P=.96; P value as reported in the manuscript).

Tan et al [] had 100 participants complete 13 tasks in a virtual environment that were designed to measure 6 cognitive domains, such as EF and complex attention. Although differences in performance on VR tasks related to EF between age groups were found, no comparison was made with a traditional neuropsychological assessment of EF or any subconstructs of EF.

Tsai et al [] immersed 2 participant groups in a virtual shopping environment: one group with mild cognitive impairment (MCI) and one control group. The VR tasks assessed participants’ memory, EF, and calculation by having them memorize a shopping list, search for the listed items in the shop, and subsequently pay for them. The authors trained machine learning models on features extracted from the VR tasks to predict whether participants had MCI or were healthy controls, which was achieved with high accuracy. Nonetheless, no neuropsychological assessment of EF was reported as a validation for the VR tasks.

Targeted Constructs

The following subsections elaborate on the EF constructs and subconstructs addressed in the studies under review. A range of correlation coefficients were reported in these papers; however, because of the lack of uniformity in results reporting, these coefficients were omitted from the current synthesis. Typically, the papers reported only significant correlations between metrics without presenting all potential correlations. Furthermore, only 16% (3/19) of the studies specified an α level (ie, .05), with another 16% (3/19) of the studies indicating statistical significance at a P value of ≤.05. A total of 21% (4/19) of the studies did not indicate an α level but mentioned applying corrections for multiple comparisons, yet they did not detail the adjusted α level. In total, 5% (1/19) of the studies adopted Bayesian statistics using a Bayesian factor of >10 for statistical inference. Nonetheless, in the reviewed studies, it was not consistently clarified which VR tasks were validated against traditional tasks, hindering the construct validity of the various EF components. Consequently, drawing consistent conclusions on how EF constructs of subconstructs were evaluated was not feasible without inferring the nature of the tests and assessment paradigms.

Core Executive FunctionsInhibition

Of the 3 “core” executive functions, 37% (7/19) of the studies included in our review investigated inhibitory control, interference control, or impulsivity either singly or combined. details the respective validation tasks and target constructs of each of these studies. For example, Chicchi Giglioli et al [] presented participants with 6 standardized tasks, 3 of which assessed inhibition (), before administering a serious game in which participants were required to perform tasks in outer space. In total, 10 of the 36 possible correlations between measures for the standardized tasks and the serious game tasks were reported as statistically significant and ranged from weak (0.20<r<0.39; relative P values indicated in the manuscript, eg, P<.05) to strong (0.60<r<0.79; relative P values indicated in the manuscript). For example, the latency metric of the dot-probe task (DPT) correlated positively (0.35<r<0.54; relative P values indicated) with the latency metric of the 3 VR tasks aimed at measuring inhibition, whereas no correlations were reported between the correct answer metric of the DPT and the correct answer metric of the 3 VR tasks aimed at measuring inhibition. None of the metrics from the ST correlated with those of the VR task (requiring participants to fight aliens); however, the correct answer and latency metrics of the ST correlated with those of the VR task (requiring participants to repair a valve).

Table 2. The validation tasks, authors, and total number of studies examining each construct.VRa target construct and validation taskValidationAuthorsStudies, n (%)Inhibition or Inhibitory control6 (32)
ImplicitChicchi Giglioli et al []

ExplicitChicchi Giglioli et al []

ImplicitMarín-Morales et al []

ImplicitVoinescu et al []f

None specifically reported
N/AgParsons and Carlew []

ImplicitParsons and Barnett []
Interference control3 (16)
ImplicitMarín-Morales et al []h

The CW-ITi from the D-KEFSj
Automated neuropsychological assessment metrics
ST
ImplicitParsons and Carlew []

ImplicitParsons and Barnett []
Impulsivity1 (5)
None specifically reported
N/AChicchi Giglioli et al []

aVR: virtual reality.

bDPT: dot-probe task.

cGNG: Go/No-Go.

dST: Stroop test.

eCPT: continuous performance test.

fSome traditional tasks listed were included for divergent validity and, therefore, have been omitted from this table.

gN/A: not applicable.

hThe VR task involved 42 VR mini-games that assessed various cognitive constructs. A total of 4 mini-games and their target constructs were documented and included in this table; however, the comparative assessments were not provided, and an extensive list of all 42 mini-games was not provided.

iCW-IT: Color-Word Interference Test.

jD-KEFS: Delis-Kaplan Executive Function System.

Similarly, Chicchi Giglioli et al [] immersed participants in a virtual kitchen in which they had to cook different types of food. The activities were grouped into 4 subtasks of incremental difficulty where, in the third level, inhibition was assessed by determining whether the right dressing was added using a Go/No-Go (GNG)–type paradigm. The authors stated that the DPT, GNG, and ST were used as standard tasks to assess inhibition. The unspecified metric of “correct dressing” was shown to correlate well (r=0.527; P<.01; relative P value indicated in the manuscript) with the correct answer metric of the ST in one group, whereas in the second group, a moderate negative correlation (r=−0.486; P≤.05; relative P value indicated in the manuscript) was found between the execution time of the Tower of London task and the correct dressing metric. However, no other correlations between the VR task metric and those of the traditional assessments of inhibition were reported.

Marín-Morales et al [] had participants complete neuropsychological assessments, including the GNG task, as well as 42 mini-games in VR. An undisclosed set of variables from the mini-games was used as predictors for measures of neuropsychological batteries. The mini-game predictor variables were fed into different machine learning algorithms. The authors highlighted that games related to inhibition produced worse results compared with other games but did not report any results on inhibition. The authors did find that mini-game features of planning and attention could predict GNG hit proportions and mean time with 80% and 94% accuracy, respectively.

Parsons and Carlew [] had participants perform the ST in a virtual classroom as well as complete a computerized and paper-and-pencil version of the task. The authors found that participants’ performance was lower for color naming and word reading in the VR paradigm than in the paper-and-pencil version but interference performance was better in the VR paradigm than in the paper-and-pencil version. Similarly, Parsons and Barnett [] had participants perform the ST in a virtual apartment as well as complete a computerized and paper-and-pencil version of the task. Here, the authors found that participants were more accurate in the ST in the paper-and-pencil version than in the VR paradigm.

Voinescu et al [] immersed participants in a virtual aquarium where they had to perform a variety of tasks. For example, participants had to respond when they saw a fish that was different from a clown fish or heard a fish name different from surgeonfish. After the VR aquarium, participants completed a variety of computerized tasks, among them a continuous performance test (CPT), which was hypothesized to measure sustained attention and inhibition. The authors found weak to moderate (0.22<r<0.49; relative P values indicated, eg, P<.05) correlations between CPT measures and VR measures.

Working Memory

Working memory was investigated in 21% (4/19) of the studies [,,,]. details the respective validation tasks and target constructs of each of these studies. The working memory component from the study by Marín-Morales et al [] included a mini-game wherein participants had to recall the ingredients of a recipe seen before the mini-game and collect from a range of options only those ingredients found in the recipe. However, no correlations with neuropsychological tasks were presented. Miskowiak et al [] compared their VR paradigm with a traditional task that assessed working memory. In this study, participants were instructed to plan and cook a meal in a virtual kitchen. Performance metrics, such as the number of drawers opened and the latency until the task was completed, were used to assess working memory and were correlated with metrics from traditional tasks such as the Wechsler Adult Intelligence Scale Letter-Number Sequencing. The authors reported a significant positive correlation (r121=0.31; P=.001) between the VR task metrics and the traditional task metrics that evaluated working memory.

Table 3. The validation tasks, authors, and total number of studies targeting working memory.VRa target construct and validation taskValidationAuthorsStudies, n (%)Working memory4 (21)
WAIS-IVb
The Working Memory Index (Digit Span and Arithmetic)
ImplicitMarín-Morales et al []c

WAIS-IIId LNSe
SWMf CANTABg (error and strategy)
ExplicitMiskowiak et al []

1-back and 2-back test (Cogstate)
ImplicitPorffy et al []

None specifically reported
N/AhRobitaille et al []i

aVR: virtual reality.

bWAIS-IV: Wechsler Adult Intelligence Scale–IV.

cThe VR task involved 42 VR mini-games that assessed various cognitive constructs. In total, 4 mini-games and their target constructs were documented and included in this table; however, the comparative assessments were not provided, and an extensive list of all 42 mini-games was not provided.

dWAIS-III: Wechsler Adult Intelligence Scale–III.

eLNS: Letter-Number Sequencing.

fSWM: Spatial Working Memory.

gCANTAB: Cambridge Neuropsychological Test Automated Battery.

hN/A: not applicable.

iRobitaille et al [] used a VR paradigm with avatars to trial a dual-task walking protocol.

Porffy et al [] asked participants to operate a virtual store in which the working memory component was assessed at the “Pay” step, where participants had to select and pay for their items at a self-checkout machine providing the exact amount. The authors specified that the reaction time on the 1-back task and the accuracy of performance on the 2-back task were metrics from traditional tasks used to assess working memory. Using linear regression, the authors found that performance on the 2-back task was negatively associated (B=−0.085; SE 0.042; P=.047) with participants’ performance on the “Pay” step.

Robitaille et al [] assessed working memory during their simultaneous cognitive tasks, in which participants had to both recognize faces in windows that had been previously declared as “hostile” or “nonhostile” and complete a navigation task. However, no correlations between the traditional and VR tasks were reported.

Cognitive Flexibility

One study by Chicchi Giglioli et al [] investigated cognitive flexibility (termed “cognitive shifting” in the paper) through 3 VR tasks. The authors specified that the TMT was used as a traditional task to assess cognitive flexibility as a comparator for the first VR task (CF1, cultivating food) and the Wisconsin Card Sorting Test was used as a traditional task to evaluate cognitive flexibility as a comparator for the other 2 VR tasks (CF2, growing plants, and CF3, fueling a turbine). The total time metric of the first VR task correlated positively with the total time of the TMT-B (r=0.396; P<.01; P value as reported in the manuscript), and multiple metrics of VR tasks 2 and 3 correlated with the performance metrics of the Wisconsin Card Sorting Test.

Higher-Order Executive Functions: Planning

In total, 26% (5/19) of the studies [,,,,] identified planning as a target construct in their VR paradigms. details the respective validation tasks and target constructs of each of these studies. The VR environment created by Chicchi Giglioli et al [] used a cooking task with 4 levels of difficulty. In the 3 more difficult levels, planning was required to complete the tasks as 2 burners were used. There was no clearly specified metric for the VR task that was used to evaluate planning, but the authors specified that the Tower of London task was used as a traditional assessment to evaluate planning. A variety of VR task metrics, such as total time to complete a difficulty level, were shown to correlate with various Tower of London task metrics.

Table 4. The validation tasks, authors, and total number of studies targeting planning.VRa target construct and validation taskValidationAuthorsStudies, n (%)Planning5 (26)
TOL-DXbImplicitChicchi Giglioli et al []

TOLcExplicitChicchi Giglioli et al []

None specifically reportedN/AdDavison et al []e

The Key Search task from BADSf []ExplicitKourtesis et al []

None specifically reportedN/AKourtesis and MacPherson []

aVR: virtual reality.

bTOL-DX: Tower of London–Drexel test.

cTOL: Tower of London test.

dN/A: not applicable.

eThe VR task was used to assess executive function. The comparative assessments that validated this assessment were detailed under “executive function” broadly as the paper did not specify which components of the VR task the comparative tasks aimed to validate.

fBADS: Behavioral Assessment of the Dysexecutive Syndrome.

In another study, Chicchi Giglioli et al [] used a VR paradigm based on an outer-space environment. The paradigm contained 8 tasks, one of which assessed planning ability (task 7). The authors stated that the Tower of London task was the traditional assessment tool used to evaluate planning and explained that the total score, initial time, and execution time of the VR task were the outcome metrics. Moderate positive correlations were found between the execution time of the VR task and of the Tower of London task (r=0.463; P<.01; P value as reported in the manuscript) and between the initial time of the VR task and the total time of the Tower of London task (r=0.372; P<.05). Furthermore, the VR task correlated with some metrics of other traditional assessments used to assess planning ability, although these were not specified a priori.

Both the studies by Kourtesis et al [] and Kourtesis and MacPherson [] used the same VR environment based on a variety of everyday tasks. One task assessing planning ability required participants to draw their route around the city (eg, visiting the bakery, supermarket, and library and returning home) on a 3D board. Kourtesis et al [] explained that the Key Search Test from the Behavioral Assessment of the Dysexecutive Syndrome was used as a traditional measure to assess planning and found a strong positive correlation between the traditional and VR tasks (r=0.80; Bayes factor=4.65 × 108). Furthermore, Kourtesis and MacPherson [] noted in their results that planning explained a substantial 12% (P=.03) of the variance in time-based prospective memory, which was required in 10 of 17 tasks.

Davison et al [] assessed planning ability using a task involving the arrangement of a table and a chair. However, they did not explicitly mention the traditional task that was used to evaluate planning. Various correlations between the performance metrics of the VR task and the traditional task were reported. For example, the performance on the Stroop Color and Word Test was negatively correlated with the time participants took to place a blue chair in the seating arrangement task (Kendall τ=−0.39; P=.01; P value as reported in the manuscript).

Other Domains

Several studies (14/19, 74%) examined domains of functioning that did not align with the EF definition used in this review. Broadly, these domains fell under the categories of memory, attention, processing, task performance, and a variety of other uncategorized subconstructs. As the literature [,,,] does not relate these broad domains to EF, they are not discussed further but are presented in -.

Table 5. The validation tasks, authors, and total number of studies targeting constructs classified as uncategorized.VRa target construct and validation taskValidationAuthorsStudies, n (%)Memory11 (58)
Memory (general)1 (5)

None specifically reported
N/AbTsai et al []

Verbal memory and verbal learning2 (11)

RAVLTc subtests: total, immediate recall, delayed recall, and recognition
ExplicitMiskowiak et al []


International Shopping List Test (Cogstate; verbal learning)
ImplicitPorffy et al []

Prospective memory4 (21)

None specifically reported
N/ABanville et al []d


ExplicitKourtesis et al []f


None specifically reported
N/AKourtesis and MacPherson []


ImplicitParsons and McMahan []

Episodic memory3 (16)

ExplicitKourtesis et al []f


ImplicitParsons and McMahan []

Immediate recognition2 (11)

ExplicitKourtesis et al []


None specifically reported
N/AKourtesis and MacPherson []

Delayed recognition2 (11)

ExplicitKourtesis et al []f


None specifically reported
N/AKourtesis and MacPherson []
Attention13 (68)
General attention4 (21)

ImplicitChicchi Giglioli et al []


ExplicitChicchi Giglioli et al []


留言 (0)

沒有登入
gif