Establishing a training plan and estimating inter-rater reliability across the multi-site Texas childhood trauma research network

The assessment of psychological symptoms by research raters across multi-site research studies presents several methodological and measurement challenges. Ensuring consistency and replicability of measures is paramount, and requires fostering and demonstrating strong inter-rater reliability (IRR; i.e., comparability in scoring/measurement between different raters using the same instruments). Investigators in Texas have undertaken the challenging task of developing a large statewide multi-site pediatric trauma network. Development of the Texas Childhood Trauma Research Network (TX-CTRN) required thoughtful navigation of a number of challenges including training research raters (hereafter “raters”) in the assessment protocol, guarding against rater drift, and estimating IRR, while doing so in a virtual context given challenges posed by the COVID-19 pandemic.

The TX-CTRN was established in 2020 as a multi-site collaboration to develop a state-wide registry of youth, ages 8–20, who had experienced a traumatic event. This registry facilitates the analysis of population health outcomes related to trajectories of mental health following trauma and supports development of predictive models of short- and long-term risks and resilience. The network uses a “hub-and-spoke” organization. The hub represents the anchor site where the research plan is developed and monitored, and training, outreach, and support is provided to the other sites. The spokes represent the 12 academic medical center sites across Texas where participants are recruited and data are collected. For all sites, recruitment is conducted at multiple settings, including hospitals, emergency departments, mental health inpatient and outpatient clinics, and primary care clinics.

Once informed consent from parents and/or legal guardians and assent from youth are obtained, baseline data are collected regarding trauma history, symptoms of PTSD, depression, and other psychiatric disorders, suicidal ideation and behavior, associated comorbidities, medical history, treatment history, service utilization, and social determinants of health. Follow-up assessments are then conducted at 1-month, 6-months, 12-months, 18-months, and 24-months, providing a rich portrayal of the trajectory of mental health outcomes and social supports. Data collection at these time points includes, among others, three rater-administered assessments: (1) Traumatic Events Screening Inventory – Child (TESI-C; Ford et al. 2000), (2) Clinician-Administered PTSD Scale for DSM-5, Child/Adolescent Version (CAPS-CA-5; Pynoos et al., 2015), and (3) MINI International Neuropsychiatric Interview for Children and Adolescents English Version 7.0.2 for DSM-5 (MINI-Kid; Sheehan et al., 2010), Major Depressive Episodes (MDE) and Posttraumatic Stress Disorder (PTSD) modules. Although this longitudinal design is excellent for identifying predictors and outcomes associated with trauma, it also comes with the challenge of establishing sufficiently robust IRR that is critical for ensuring the quality of the data.

There is limited guidance in the literature for establishing a training protocol or estimating and monitoring IRR that could be applied to the TX-CTRN (Rosen et al., 2008). Over 30 years ago, Castorr et al. (1990) raised concern over the lack of observational studies that reported IRR estimation in describing their methodology and acknowledged the lack of available information to guide researchers in these processes. Two decades later, also bemoaned the lack of IRR reporting, particularly in large multi-site studies. In a review of clinical trials of depressive disorders, Mulsant et al. (2002) found that only three multi-site studies reported IRR and the median number of total raters in these studies was five. This lack of inclusion of IRR reporting has remained consistent in psychological assessment studies in general, and trauma studies in particular.

Assessment of PTSD symptoms in youth presents additional challenges. The CAPS-5-CA is a semi-structured interview of both child and parent, which provides a measure of the severity of PTSD symptoms in youth. It is widely considered to be the gold-standard assessment tool with excellent psychometric properties (Weathers et al., 2018). However, excellent reliability presumes that raters are sufficiently trained to overcome the obstacles that make it difficult to assess PTSD in children. One such difficulty is that both parents and youth tend to under-report physical and sexual abuse, including up to 50% of incidents (Grasso et al., 2009; Grant et al., 2020). Additionally, assessing symptom frequency and intensity can be difficult as parental and child reports can be quite discrepant (Scheeringa et al., 2006). This can partly be explained as some of the symptoms of PTSD are not readily observable by parents such as an overgeneralized fear response, nightmares, and dissociation (Cohen and Scheeringa, 2009). Additionally, parental report of child symptoms has been shown to be more strongly associated with the parent's own reaction to the trauma than that of the child (Shemesh et al., 2005). While the child's report is sometimes a more accurate guide to symptom severity, it is limited by the trauma avoidance that is one of the hallmarks of the disorder (Cohen and Scheeringa, 2009). Finally, the wide range of trauma experiences that comprise the inclusion criteria for this study differs from many other trauma studies that target the assessment of trauma symptoms in the aftermath of specific types of traumatic events (e.g., war, gun violence, natural disasters).

In addition to the complexities associated with assessment of children, the background and longevity of the raters plays a role in IRR. While many studies of psychological assessment utilize clinician raters (e.g., Kobak et al., 2005), TX-CTRN predominantly (approximately 80%) used non-clinician (lay) raters. This constraint posed challenges for the reliable assessment of psychological phenomena in interviews using the MINI-KID, TESI-C, and CAPS-CA-5 – all scales in which clinical judgement is necessary. Typically, each of the twelve sites has between 2 and 4 raters at any one time with some staff turnover over the course of the study, requiring the expeditious training and certification of newly onboarded raters. Other raters, however, remain in the study for extended periods. For these raters, guarding against rater drift was an important consideration; processes were therefore established to provide iterative training and longitudinal performance monitoring.

In summary, ensuring consistency and replicability by fostering strong IRR is essential when conducting research using multiple raters, particularly multi-site research in which groups of raters are geographically dispersed, and may have varying degrees of rater experience and rater turnover over time. However, there is currently no gold standard method for training raters in multi-site psychological research using clinical interviews where clinical judgement is required of non-clinician raters. In addition, whereas there are psychometric reviews of the scales used in the study (e.g., Duncan et al., 2018; Ohan et al., 2002; Ribbe, 1996), there are no comprehensive guidelines for developing a system for assessing and reporting IRR in the extant literature. TX-CTRN represents an ambitious undertaking wherein a mixed team of (predominantly) non-clinician research raters were trained to adhere to a rigorous administration standard when conducting a psychological assessment battery across diverse and unique recruitment catchment areas throughout the state. The methods and approaches developed by TX-CTRN may serve as a model for other multi-site projects. Thus, this paper aims to: (1) describe the TX-CTRN rater training curriculum and the certification process for raters, (2) describe the method for empirically evaluating and estimating IRR, (3) present results of IRR estimation and formal statistical inferences (e.g., confidence intervals) at both the item-level and scale-level for rater-administered scales used in the study, and (4) describe a process for longitudinal monitoring for rater drift.

留言 (0)

沒有登入
gif