The Implementation of Recommender Systems for Mental Health Recovery Narratives: Evaluation of Use and Performance

IntroductionBackground

Recommender systems create personalized recommendations within a specific domain, suggesting items that may be of use to a user and helping quickly narrow down a potentially overwhelming number of options []. Recommender systems are used on global platforms such as Netflix—a movie streaming service—which uses other people’s movie ratings to recommend movies, Amazon—an e-commerce company—which uses frequently-bought-together items to recommend purchases, and Pandora—a music streaming service—which uses 450 musical attributes to recommend songs [].

A range of health care applications for recommender systems have been examined, including the use of recommender systems to suggest prompts for counselors in a suicide prevention helpline chat [], tailor care preference assessments in nursing homes [], and identify expert physicians for specific diseases [].

In this paper, we present an evaluation of NarraGive, the first recommender system for providing web-based recommendations from a collection of mental health recovery narratives.

Lived Experience Narratives

Mental health recovery narratives are a subset of lived experience narratives, which are representations of a person’s experiences of physical or mental health and how that person has lived through and responded to those experiences []. The uses of lived experience narratives in health research have been extensively studied but with little focus on which narratives people engage with.

Studies have explored the use of lived experiences to encourage people to seek and sustain treatment, such as using narratives to improve health care participation in patients with breast cancer [], promote smoking cessation in the African American community [], and promote diabetes self-management [] and diabetes medication adherence []. The use of lived experiences in support groups has also been studied, such as sharing stories in diabetes education in minority ethnic groups []. Some studies have provided medical students with narratives to facilitate learning and improve subsequent medical practices, such as using patient stories during practice placements [] and learning about cancer pathology using narratives of patients who have experienced cancer [].

Other studies have explored the use of lived experiences as a therapeutic tool for individuals, such as student nurses creating digital stories to challenge the “reality shock” of beginning clinical practice [], young women telling their stories to reduce stress [], women with eating disorders accessing recovery stories [], service users with psychosis watching lived experience videos [], incarcerated women telling their stories [], patients with dementia using storytelling as a therapeutic tool [], adults with diabetes engaging in lived experience support groups to reduce diabetes-related distress [], painting trees to symbolize periods of one’s life as a starting point for telling a life story to treat depression and anxiety [], and young people watching digital stories to reduce the prevalence of binge drinking [].

Lived experience narratives have the potential to be used for a wide variety of purposes and, as a result—as documented previously—are frequently used in interventions. However, so far, the focus of health lived experience–based interventions has been solely on examining the effects of engaging with these narratives, with less focus on which specific narratives the participants are exposed to (though a few studies have placed emphasis on providing representative narratives [] or particularly engaging and high-quality narratives []). Thus, while there have been studies evaluating the use of recommender systems in health care settings and, separately, evaluating the use of lived experience narratives, there have not been any lived experience narrative recommender systems developed before this study.

The Problem Being Addressed

This is the first evaluation of a lived experience narrative recommender system. The design of such a recommender system has distinct challenges. For example, narratives are sensitive types of data that impose ethical requirements to protect both the narrator and the recipient. Therefore, the use of recommender systems needs to be informed by considerations about the curation and use of narratives [-]. The goal of our evaluation was to develop preliminary evidence to inform the future use and evaluation of recommender systems with lived experience narratives.

The Narrative Experiences Online InterventionOverview

The Narrative Experiences Online (NEON) study [,] evaluated whether having web-based access to people’s real-life stories of recovery from mental illness can be helpful for people who are experiencing psychosis or other mental health problems. This builds on the evidence base that indicates that receiving recovery narratives can support mental health []. In the NEON intervention, participants interact with a web application through which they can access a web-based collection of mental health recovery narratives (henceforth, narratives)—the NEON Collection.

Narrative Characterization

The development of the NEON Collection, including the narrative inclusion criteria, has been reported elsewhere []. In brief, recorded recovery narratives were obtained, always with consent, from existing collections and individual donations to the study. Only narratives that could be presented on the web in a single electronic file (eg, PDF, JPEG, and WAV) were included. Within these files, narratives were presented in a range of forms, including prose, poetry, audio recordings, video recordings, individual images, and sequential art. Each was presented by a single narrator only—there were no composite narratives. The narratives were deliberately chosen to be diverse []. All narratives in the NEON Collection were characterized using the Inventory of Characteristics of Recovery Stories (INCRESE) [] to capture 77 different features of the narratives related to narrator characteristics, narrative content, and turning points. While we used selected INCRESE characteristics in our recommender system, the greater breadth of characteristics collected will support future secondary analyses. The trials opened with 348 narratives and closed with 659 narratives available.

Narrative Request Routes

There are 6 ways for participants to request narratives through the NEON intervention, which are internally documented as 1 of 8 request methods.

summarizes the external and internal narrative request routes.

The NEON intervention home page has buttons corresponding to 4 of the 6 external narrative request routes: “Match me to a story (recommended),” “Get me a random story,” “Browse stories,” and “My stories.”

The first option uses NarraGive to recommend a single narrative that the participant has not seen before. NarraGive is a hybrid recommender system (meaning that it uses a combination of recommendation strategies []) that uses both content-based filtering (recommending narratives based on their content) and collaborative filtering (recommending narratives based on how other participants have rated them) to recommend narratives to participants.

The second option presents a randomly selected narrative that the participant has not seen before.

The third option allows participants to browse narratives grouped into categories based on the narratives’ INCRESE characteristics (Figures S1 and S2 in )—some categories are based on the value of a single characteristic (eg, the narrator’s gender is “female”), and some are based on the value of multiple characteristics (eg, a positive narrative, defined as having an “upbeat” tone and an “escape” or “enlightenment” genre; Table S1 in ). Not all narratives are accessible through the category option.

The fourth option allows participants to access narratives that they have previously bookmarked or rated highly.

In addition, the internal request routes include whether NarraGive produced the recommendation using content-based filtering or collaborative filtering and whether a narrative selected from the “My stories” page was previously rated highly for hopefulness or manually bookmarked by the participant. One important benefit of having different narrative request routes is to prevent exposure bias, a well-known issue in recommender systems where participants are only presented with a subset of the available items, so they only provide ratings for that subset, with recommender systems unable to distinguish between disliked and unrated items and unknown and unrated items []. For example, the “Get me a random story” button might allow participants to access narratives that they would not otherwise be exposed to but that nonetheless may be beneficial.

Textbox 1. Narrative request mechanisms that participants use to access narratives (external routes) and the corresponding logs made by the intervention (internal routes).

External and internal narrative request routes

Participant clicks on the “Match me to a story (recommended)” buttonParticipant accesses a narrative recommended via content-based filtering.Participant accesses a narrative recommended via collaborative filtering.Participant clicks on the “Get me a random story” buttonParticipant requests a random narrative.Participant clicks on the “Browse stories” button and selects a narrativeParticipant makes a category-based request for a narrative.Participant clicks on the “My stories” button and selects a narrativeParticipant requests a narrative that they have rated as hopeful.Participant requests a narrative that they have marked as a favorite.Participant uses the intervention for the first time and is presented with their first narrativeParticipant accesses their “first” narrative.Participant clicks on a narrative from a Narrative Experiences Online (NEON) communicationParticipant accesses the suggested narrative in a reminder message aimed at prompting them to use the NEON intervention.Narrative Feedback

After a participant has accessed a narrative through any request route, they are presented with 5 feedback questions (), and their responses to these questions are time-stamped and logged. The focus (hope, similarity, learning, and empathy) is based on the NEON Impact Model [] developed through a systematic review [] and qualitative [] and experimental studies []. The measurement approach has been previously validated []. To maximize response rates, the first question is marked as compulsory. The other 4 questions are marked as optional, and the participant has the choice to answer either all or none of the optional questions. A set of 5 response values (for the 1 compulsory and 4 optional questions) forms a single rating, as does a single response value for the compulsory question. Ratings with optional questions answered are also referred to as optional ratings. shows the questions, answer options, and numerical ranges (not visible to participants) of the questions and whether they are mandatory.

If a narrative is rerated, this overrides the previous rating (but the time-stamped logs of previous ratings are not deleted).

One benefit of recommender systems requiring a rating for each narrative is that this helps minimize selection bias, which occurs when participants are allowed to choose whether to rate the items, leading to ratings that are typically biased toward higher or more homogeneous ratings [,]. Selection bias is a well-known problem in recommender systems relying on explicit data.

Table 1. Questions, answer options, numerical ranges, and mandatory nature of narrative response data.QuestionAnswer optionsRangeMandatoryHow hopeful did the story leave you feeling? (hopefulness)“Less hopeful than before,” “no change,” “a bit more hopeful,” and “much more hopeful”−1 to 2YesHow similar was the storyteller to you? (similarity to the narrator)“Not at all,” “a bit,” “quite a lot,” and “very much”0 to 3NoHow similar was the storyteller’s life to your life? (similarity to the narrative)“Not at all,” “a bit,” “quite a lot,” and “very much”0 to 3NoHow much did you learn from the story? (learning)“Not at all,” “a bit,” “quite a lot,” and “very much”0 to 3NoHow emotionally connected did you feel with the story? (empathy)“Not at all,” “a bit,” “quite a lot,” and “very much”0 to 3NoThe NarraGive Recommender System

NarraGive is a hybrid recommender system. It uses one content-based and 2 collaborative filtering algorithms to allow for comparison of performance of the 3 algorithms using 2 distinct approaches to inform this new field of lived experience narrative recommendation. NarraGive was assembled using the Simple Python Recommendation System Engine library (SurPRISE; version 1.1.1; Nicolas Hug) for Python (version 3.6 and above), integrating implementations of filtering algorithms provided in these libraries []. NarraGive does not recommend previously requested narratives, types of narratives that a user has previously blocked, or individual narratives that a user has blocked.

The content-based filtering algorithm is based on the SurPRISE implementation of the k-nearest neighbor (kNN) algorithm. Although kNN is traditionally used as a collaborative filtering algorithm, NarraGive used an adapted version to measure the similarity between narratives, in which it uses their INCRESE characteristics to cluster together narratives in “neighborhoods” and recommend to participants unseen narratives that are similar to their other highly rated narratives. Narrative similarity is assessed using selected INCRESE characteristics, consisting of the INCRESE sections on narrator characteristics, narrative characteristics, narrative content, and turning points.

The selected collaborative filtering algorithms are the SurPRISE implementations of the singular value decomposition (SVD) and, to support comparison, SVD++. A broad introduction to these 2 algorithms is provided in the work by Hug []. These aim to capture the latent factors that determine how much a participant likes a narrative. NarraGive ran these 2 algorithms and selected the narrative with the highest predicted rating. Thus, the 2 algorithms served as distinct subsystems, so this evaluation will analyze the 2 subsystems separately to compare them. For the purposes of collaborative filtering, similarity between users is assessed using the demographic items collected in a “personal profile” created at first use and containing items describing participant demographics and format preferences. provides details on all items in the profile.

When making a narrative recommendation, narrative feedback ratings are weighted (with a hopefulness rating twice as influential as each of the individual optional ratings) and combined. This was due to the underlying theory that we developed on narratives making an impact on recipients, which emphasized hope creation as the most critical mechanism. When a participant requests a narrative from NarraGive, it internally generates 1 list per algorithm of the 10 narratives with the highest predicted rating. It then presents the highest-scoring narrative of these 30 to the participant. The participant is not shown the predicted rating, other internally generated narratives, or which of the 2 filtering mechanisms was used to generate the recommendation.

The NEON Trials

The NEON intervention has been evaluated in 3 pragmatic randomized controlled trials with different populations. The NEON trial (ISRCTN11152837; N=739) is a definitive trial for people with experience of psychosis. The NEON for other (eg, nonpsychosis) mental health problems (NEON-O) trial (ISRCTN63197153; N=1023) is a definitive trial for people experiencing any other type of mental health problem. The NEON-C trial (ISRCTN76355273; N=54) is a feasibility trial with people who informally care for people experiencing mental health problems, which is not within the scope of this study. The NEON intervention was identical in all 3 trials. A separate instance of NarraGive was used for each trial, and there was no pooling of narrative feedback or recommendations among the 3 trials.

Aims and Objectives

The aim of this study was to analyze the 3 recommender system algorithms used in NarraGive to inform future interventions using recommender systems in this new field of lived experience narrative recommendations. An evaluation of the impact of the NEON intervention using NarraGive has been reported elsewhere []. This study did not aim to provide an indication of NarraGive’s viability but rather to inform the development of future lived experience narrative recommender systems and guide design choices on collaborative versus content-based filtering algorithms.

The objectives were as follows:

To describe participant characteristics and patterns of narrative requests and feedback.To evaluate the algorithms used in NarraGive by comparing collaborative-based and content-based narrative recommendations to inform future implementation approaches.

Objective 1 was addressed using data from the intervention version of NarraGive, and objective 2 was addressed using data from the final evaluated version.

MethodsOverview

An evaluation of NarraGive was conducted using data from the NEON and NEON-O trials, structured using the framework for evaluating recommender systems (FEVR), which was developed through a review of recommender system evaluation work []. The FEVR defines a set of components intended to guide the design of a recommender system evaluation.

After the NEON trials closed, logging files describing interactions with trial procedures and the NEON intervention were downloaded for analysis. These files included trial allocation, baseline demographic characteristics, personal profiles, narrative characteristics, narratives that the participants requested and the corresponding internal narrative request route, and participants’ ratings. All log entries were time-stamped.

Ethical Considerations

The NEON study trial protocol and an update have been published elsewhere [,]. Ethics approval was obtained in advance of trial start from a UK National Health Service Research Ethics Committee (Leicester Central Research Ethics Committee; 19/EM/0326). All participants provided web-based informed consent for the use of their data for research purposes, and all study data were pseudonymous, with each participant’s data linked by a unique ID. Some participants were compensated (£20 [US $25.59] vouchers) for some data collection rounds, as described in our trial protocol.

Participants

The NEON trial included participants who (1) had experience of psychosis in the previous 5 years, (2) had experience of mental health–related distress in the previous 6 months, (3) resided in England, (4) were aged ≥18 years, (5) were capable of accessing or being supported to access the internet on a PC or mobile device or at a community venue, (6) were able to understand written and spoken English, and (7) were capable of providing web-based informed consent.

The NEON-O trial included participants who (1) had experience of mental health problems other than psychosis in the previous 5 years, (2) had experience of mental health–related distress in the previous 6 months, (3) resided in England, (4) were aged ≥18 years, (5) were capable of accessing or being supported to access the internet on a PC or mobile device or at a community venue, (6) were able to understand written and spoken English, and (7) were capable of providing web-based informed consent. It excluded participants eligible for the NEON trial.

Our study included participants from the NEON trials’ intention-to-treat samples [].

Sample Size

Both trials were powered on the mean item score for the 12 subjective items in the Manchester Short Assessment of Quality of Life (MANSA) as collected at baseline and the 52-week follow-up [], and hence, the sample size was chosen on this basis.

For the NEON trial, a total sample size of 684 was chosen to provide 90% power to detect a minimal clinically important effect size (Cohen d) of 0.27 (SD 0.9 []; power=90%; P=.05), allowing for 20% attrition. The planned analyzable sample size was 546.

For the NEON-O trial, the SD of the MANSA scores for the study population was estimated from baseline data provided by the first 350 enrolled participants (see the study by Rennick-Egglestone et al [] for the rationale). A total sample size of 994 was selected to provide 90% power to detect a minimal clinically important effect size (Cohen d) of 0.27 (SD 0.94; power=90%; P=.05), allowing for 40% attrition, which was estimated from the completion rates for interim data. The planned analyzable sample size was 596.

Both trials recruited their planned samples and were allowed to overrecruit (N=739 for the NEON trial and N=1023 for the NEON-O trial). The final attrition rates were 23.5% (NEON trial) and 44.8% (NEON-O trial).

Evaluation Framework

describes the FEVR components that were selected to define the evaluation.

Table 2. Framework for evaluating recommender systems (FEVR) components defining the NarraGive evaluation.FEVR componentBrief descriptionEvaluation objectives
Overall goalTo evaluate whether the recommender system NarraGive supported participants in finding helpful narratives

StakeholdersParticipants in the NEONa and NEON-Ob trials’ ITTc samples

PropertiesPrediction accuracy, usage prediction, diversity, coverage, and unfairness across participants
Evaluation principles
Hypothesis or research questionObjective 1: To describe participant characteristics and patterns of narrative requests and feedback
Objective 2: To evaluate the NarraGive recommender system by comparing collaborative-based and content-based narrative recommendations

Control variablesRandomized data set that is split 75:25 between the training set (to train the algorithms) and the testing set (to evaluate the metrics)

Generalization powerUse of real-world data from participants with mental health problems; limited due to variation in system use

ReliabilityCross-validation with repeated initialization of collaborative filtering algorithms
Experiment typeEvaluation aspects
Types of data
Data collectionParticipant ratings (prompted after every narrative access)

Data quality and biasesPlatform bias from suggested narratives

Evaluation metricsNormalized mean absolute error (for prediction accuracy)
Mean average precision per participant (for usage prediction)
Intralist diversity (for diversity)
Item space coverage (for coverage)
Overestimation of unfairness (for unfairness across participants)

Evaluation systemNEON intervention web application

aNEON: Narrative Experiences Online.

bNEON-O trial: NEON for other [eg, nonpsychosis] mental health problems) trial.

cITT: intention to treat.

Recruitment

Participants were recruited across England from March 9, 2020 (both trials), to March 1, 2021 (NEON trial), or March 26, 2021 (NEON-O trial). The trials used a mixed web-based and offline approach to recruit participants. Recruitment was through paid web-based advertising on mental health websites; promotional messaging distributed by a range of community groups and health care practices; promotional messaging distributed on Facebook, Twitter (subsequently rebranded as X), and Google (with the reach of messages enhanced through payments); media appearances by the central study team; and the work of clinical research officers in 11 secondary care mental health trusts.

Clinical research officers approached participants in person and distributed promotional messaging through local authorized channels such as mailing lists of service users who had consented to be contacted about research studies. All promotional advertising and messaging conformed to principles approved in advance by the supervising research ethics committee [].

Registration

All recruitment approaches directed potential participants to a web-based eligibility checking interface that requested responses to a series of questions specified in the trial protocol. All responses were self-rated. No formal diagnosis of a mental health condition was required for participation. Trial allocation was determined through responses, and eligible potential participants were provided with access to a tailored web-based participant information sheet. Participants subsequently completed a web-based informed consent form by providing an email address and optional telephone number.

The consent process was concluded by clicking on a link in an auto-generated email to validate the email address. After confirming consent, participants completed web-based forms to collect baseline demographic and clinical data and were then randomized using a web-based system validated by a clinical trial unit to the intervention or control arm. Demographic items were age (in years), gender (female, male, or other), ethnicity, region of residence, highest educational qualification, lifetime use of primary care mental health services, lifetime use of specialist mental health services, current use of mental health services in relation to psychosis (NEON trial only), main mental health problem in the last month, best description of recovery status, residential status, and employment status.

Intervention arm users gained immediate access to the NEON intervention until trial end (September 22, 2022), whereas control arm users gained access after completing the 52-week follow-up questionnaires and until trial end. Data on NEON intervention use by both intervention and control group users are within the scope of this study.

AnalysisObjective 1: Describe Participant Characteristics and Patterns of Narrative Requests and FeedbackParticipant Characteristics

The demographic and clinical characteristics of participants randomized to each trial were described using means and SDs for normally distributed data and counts with percentages for categorical data. Descriptive statistics were calculated for all baseline demographic data items.

Following UK Data Service guidance on statistical disclosure [], ethnicity responses were grouped into 2 categories (White British and other ethnicity) due to the small number of participants in most ethnicity categories, although recognizing that this could be perceived as a reductive approach to ethnicity data. “Current mental health problem” also comprised categories with low numbers of participants, so relevant rows were shown as “<5” with no percentage, and other rows were shown as “<10” with no percentage to avoid being able to infer other values.

Patterns of Narrative Requests and Feedback

Data on participant narrative requests and narrative feedback were taken from log files and used to calculate per-trial summary statistics for the number of participants, number of participants who requested at least one narrative, number of narratives at the start and end of the trial, number of narratives given at least one rating, number of narrative requests, number of narrative ratings, number of optional ratings, number of ratings per narrative, number of ratings per rated narrative, length of intervention use by participants, and narrative access routes.

While providing feedback on narratives was encouraged, it was possible for the participant to navigate away from the page and not submit any feedback; therefore, the number of narrative ratings may be smaller than the number of narrative requests, so these figures were reported separately.

Statistics for the number of ratings per narrative present 2 sets of figures with different selection criteria: those including only data for narratives that received at least one rating and those including data for all narratives. This breakdown shows how many ratings NarraGive had access to as it could only access rated narratives.

Nonparametric data were presented as medians and IQRs. Category data were presented as counts with percentages.

Objective 2: Evaluate the NarraGive Recommender System by Comparing Collaborative-Based and Content-Based Narrative RecommendationsOverview

The 3 algorithms (kNN, SVD, and SVD++) were trained and tested using all the available data, representing the point in time at which the trials closed. Training an algorithm involves providing it with a set of data that it can use to create predictions for missing data points. Testing an algorithm involves obtaining these predictions and measuring a feature of those predictions.

The results for objective 2 were obtained using the SurPRISE library (version 1.1.3) for Python (version 3.10.7). Only participants who provided at least one rating and narratives that were given at least one rating were included (as SurPRISE uses participant-item rating pairs as the basis for its predictions), which mirrors the information that NarraGive had access to during the intervention.

This study evaluated NarraGive using the metrics outlined in , applied separately to the content-based algorithm (kNN) and the collaborative filtering–based algorithms (SVD and SVD++).

There are 2 types of metrics: metrics that compare predicted ratings with actual ratings (prediction-based metrics) and metrics that measure a feature of the top-n predicted items (feature-based metrics). Prediction-based metrics include prediction accuracy, usage prediction, and unfairness across participants. Feature-based metrics include diversity and coverage. For prediction-based metrics, there is no standard data-splitting strategy [], so the data set is split into a training set (75%) and a testing set (25%). For feature-based metrics, the entire data set is used as the training set.

NarraGive only used the first 3 sets of ratings (hopefulness, similarity to the narrator, and similarity to the narrative) to inform its recommendations as these 3 questions had been validated in a feasibility study [] and the remaining 2 questions were added after the feasibility study. Therefore, only the first 3 sets of ratings were used in the evaluation.

The hopefulness ratings were normalized, which in this case involved shifting the ratings to use the same rating scale as that of the 4 optional questions.

The evaluated version of NarraGive presented in this paper used the same training data as the intervention version of NarraGive with 3 minor modifications. First, where the narratives’ INCRESE characteristics were updated during the trials (eg, to correct human error in inputting characteristics), this evaluation only used the final set of uploaded characteristics. Second, during the intervention, NarraGive filtered out previously requested and blocked narratives. This evaluation included these narratives as the predictions themselves were not influenced by whether a narrative was blocked or previously requested (ie, blocked and previously requested narratives were filtered out after the prediction process in the trial implementation), which could affect, for example, coverage metrics. Third, during the NEON trials, some accounts were removed due to suspected repeat registrations []; this evaluation removed all ratings from those participants even though NarraGive may have initially used those ratings.

The logs that were recorded during the intervention did not include NarraGive’s internal recommendation lists and instead only recorded the single narrative that was selected to show to the participants. Therefore, using the intervention version of NarraGive would have prevented any comparison of its subsystems and would have allowed for only a limited analysis of its performance as a whole.

The results from objective 1 (about participants and their use of the system) used the data collected from the live intervention, whereas the results from objective 2 (about NarraGive and its subsystems) used the evaluation version of NarraGive.

During a previous feasibility study of NEON (N=25 mental health service users), 465 ratings were collected for the initial set of narratives in the NEON Collection []. NarraGive had access to these ratings in the NEON and NEON-O trials to reduce the “cold start” problem, where recommender systems perform poorly for new items and participants []. The evaluation excluded these ratings to ensure that NarraGive was only evaluated on data collected live during the NEON intervention.

The SVD and SVD++ algorithms were both randomly initialized according to a normal distribution [], and the 75:25 split between training and testing sets was also random and calculated using NumPy (a package for scientific computing with Python) [], where “fresh, unpredictable entropy will be pulled from the OS” []. To account for the randomness, cross-validation was performed. The data set was split into 4 folds, with a different fold used as the testing set each time, and the SVD and SVD++ algorithms were reinitialized each time. Medians and IQRs were reported.

An additional exploratory analysis was conducted to determine how the accuracy changed over time. For each month between June 2020 and July 2022 inclusive, data up to but not including the first day of each month were used for training and testing, and the accuracy was measured (using the same accuracy metric as for the main NarraGive evaluation).

Textbox 2. The 5 metrics for evaluating NarraGive.

Metric and metric category

Prediction accuracyNormalized mean absolute errorUsage predictionMean average precision per participantDiversityCoverageUnfairness across participantsOverestimation of unfairnessPrediction Accuracy

Prediction accuracy is the extent to which a recommender system can predict participant ratings []. The root-mean-square error (RMSE) and mean absolute error (MAE) [] are 2 of the most commonly used metrics for evaluating rating prediction accuracy. The MAE uses the absolute difference between the predicted and true ratings, whereas the RMSE squares this difference, which results in the RMSE penalizing inaccurate predictions more [].

The intervention was designed to be used over time rather than as a one-off, so the accuracy metric should primarily capture the overall accuracy rather than emphasizing occasional large inaccuracies (ie, an inaccurate prediction off by 2 points followed by a completely accurate prediction should be treated as no worse than 2 inaccurate predictions off by 1 point each), and this is better achieved using the MAE. Because the hopefulness ratings were normalized, the prediction accuracy metric was the normalized MAE (NMAE).

Different variations in the MAE have been reported in the literature. In particular, some versions square root the averaged summation [], whereas others do not [,]. This evaluation uses SurPRISE’s in-built MAE calculation, which does not use a square root.

A lower NMAE indicates greater prediction accuracy. For NarraGive, the scale ranges from 0 (greatest prediction accuracy) to 4 (equation 1 in ).

Usage prediction

Usage prediction is the rate of correct recommendations in a setting where recommendations are classified as 1 of 2 options: relevant or nonrelevant []. An item is relevant to a participant when the participant’s rating for it meets a predefined numerical threshold (where the threshold is participant independent and defined per question).

There are 2 common metrics for measuring usage prediction: precision and recall. Precision measures how likely it is that a recommended item is relevant and is defined as the ratio of relevant selected items to the total number of selected items []. Recall, conversely, measures how likely it is that a relevant item is selected and is defined as the ratio of relevant selected items to the total number of relevant items [].

As the length of the recommendation list increases, recall improves, whereas precision worsens [,]. The length of NarraGive’s internal recommendation list is 10, which is relatively short (compared to, for example, a search engine that recommends tens or hundreds of web pages), meaning that it is impossible to achieve a meaningfully high recall score, so the metric for usage prediction was precision.

As usage prediction is usually used for measuring how relevant a list of recommendations is, this evaluation used NarraGive’s internal recommendation list (consisting of a 10-narrative list produced using content-based filtering and two 10-narrative lists produced using collaborative filtering). As the participants do not see this list, only metrics that focus on the characteristics of the list as a whole—rather than focusing on the order within the list—were used (ie, where the list is treated more like a mathematical set than an ordered list as the ordering beyond the first item does not affect participants), and metrics that exclusively evaluate ranking order were not used.

The analysis of recommender system evaluations by Herlocker et al [] showed that accuracy metrics can be divided into equivalence classes. One of these classes comprises all metrics that are averaged overall, and one of these classes comprises per-user correlation metrics and the mean average precision per-user metric. To ensure that this analysis of NarraGive captured its performance as widely as possible, a variation of precision that falls into a different equivalence class from that of the NMAE was used, namely, the mean average precision per participant (hereafter, precision).

As the ratings are on a 4-point scale, they need to be converted to a binary scale that classifies recommendations as either relevant or nonrelevant. For optional questions, relevance was defined as “a bit,” “quite a lot,” or “very much.” For hopefulness, relevance was defined as “no change,” “a bit more hopeful,” or “much more hopeful.”

Higher precision indicates a greater proportion of relevant narratives. The scale ranges from 0 (least precision) to 1 (equation 2 in ).

Diversity

Diversity measures how varied the recommended items are []. The current metrics for diversity [,] are intralist diversity (ILD) and variations thereof. ILD was developed by Ziegler et al [], and variations include the rank-sensitive ILD metric by Vargas and Castells []. Similar to usage prediction, because the lists used to calculate diversity came from NarraGive’s internal recommendation list and the ILD by Ziegler et al [] is permutation insensitive (ie, the position of recommendations on the list does not affect the diversity score), this metric was used, with cosine similarity as the distance metric calculated using the narratives’ INCRESE characteristics.

The original study defined ILD on a per-list basis (ie, for the recommendation list of one participant). This metric has been expanded in this study to be averaged over all participants’ lists to produce an overall ILD value.

The lower the ILD value, the greater the diversity among the recommended items. The scale ranges from −1 (most diverse) to 1 (equation 3 in ).

Coverage

Coverage can be split into participant space coverage and item space coverage []. Participant space coverage is the proportion of participants who can be provided with recommendations by the recommender system []. The threshold for being provided recommendations is low—a participant needs to have rated at least one narrative (which is achieved when they first access the intervention as it is compulsory to provide a response for the first narrative); thus, participant space coverage was not used. A variation of participant space coverage assesses the proportion of participants that can be recommended high-quality items (ie, items with a predicted rating above a predefined threshold). This notion of variable quality among participants is addressed more thoroughly using an unfairness across participants metric instead.

Item space coverage is the proportion of items that the recommender system can recommend []. Ge et al [] further split item space coverage into prediction coverage and catalog coverage. They defined prediction coverage as the proportion of items for which the recommender system can produce a predicted rating and catalog coverage as the proportion of items that are recommended in a series of recommendation lists. Because there is no predefined limit to when NarraGive can produce a predicted rating for a narrative, prediction coverage was used.

The definition of catalog coverage by Ge et al [] captures the set of recommended items produced over time for a single participant (ie, the items that would have been recommended to the participant if they had asked for recommendations at that time; this is different from the set of recommended items that the participant requested and was actually presented with over time).

To capture the overall coverage, the proportion of narratives that are recommendable is measured, where a narrative is recommendable if, for at least one participant, the narrative appears in NarraGive’s internal recommendation list.

Other versions of coverage use only the top recommendation, but as there are more narratives than there are participants, this would upper bound the item space coverage at approximately three-quarters for the NEON trial—total number of recommendations (which is equal to the number of participants who rated at least one narrative as there is 1 recommendation per participant) divided by the number of narratives that were rated at least once. For longer recommendation lists (such as 10), because recommender system algorithms cannot always produce a predicted rating for each item, a participant’s list may be less than the desired length. For this evaluation, a length of 10 was sufficient to ensure that the total number of recommendations being considered across all participants was greater than the number of narratives.

A higher item space coverage value indicates greater item coverage. The scale ranges from 0 (lowest item coverage) to 1 (equation 4 in ).

Unfairness Across Participants

Unfairness across participants measures whether participants are treated fairly either at the group level (participants in the same group are treated fairly) or at the individual level (participants who are similar are treated fairly) [].

NarraGive is designed for use in a health care setting—a setting in which protected characteristics such as disability are critical to attend to. It would be crude to stipulate that, for example, all participants should have an equal probability of being recommended a narrative about wheelchair users as this would be far more relevant to some participants than others (and, indeed, a recommender system’s entire purpose is to provide personalized rather than generic recommendations). As acknowledged by Yao and Huang [], “in tasks such as recommendation, user preferences are indeed influenced by sensitive features such as gender, race, and age. Therefore, enforcing demographic parity may significantly damage the quality of recommendations.”

Thus, they proposed 4 metrics: value unfairness, absolute unfairness, underestimation of unfairness, and overestimation of unfairness. Value unfairness “occurs when one class of user is consistently given higher or lower predictions than their true preferences.” Absolute unfairness “measures inconsistency in absolute estimation error across user types.” Underestimation of unfairness “measures inconsistency in how much the predictions underestimate the true ratings.” Overestimation of unfairness “measures inconsistency in how much the predictions overestimate the true ratings.”

NarraGive is implemented in a health care context in which the principle of harm avoidance is crucial. Therefore, one of the most important factors to consider is whether NarraGive is recommending potentially harmful narratives to participants. The metric used to measure this aspect is the overestimation of unfairness.

Overestimation of unfairness measures how much NarraGive consistently overestimates the predicted rating of narratives (ie, how often a participant rates a narrative lower than NarraGive expected) within a disadvantaged subset of the participants and compares this to the overestimation in the nondisadvantaged group.

Participants were divided into groups based on their demographic characteristics. The first grouping was by ethnicity as having a minority ethnicity predicts mental health problems [], and the second grouping was by gender, informed by Sex and Gender Equity in Research guidelines [].

The disadvantaged group for the gender comparison was defined as either “Female” or “Other.” The disadvantaged group for the ethnicity comparison was defined as “Irish,” “Gypsy or Irish Traveller,” “Any other White background,” “White and Black Caribbean,” “White and Black African,” “White and Asian,” “Any other Mixed/Multiple ethnic background,” “Indian,” “Pakistani,” “Bangladeshi,” “Chinese,” “Any other Asian background,” “African,” “Caribbean,” “Any other Black/African/Caribbean background,” “Arab,” and “Any other ethnic group.”

The baseline demographic information was used for measuring unfairness between participants as the questions were compulsory, so there was higher completeness of the baseline data than of the personal profile as well as greater granularity with the range of possible answers. The overestimation of unfairness is defined according to the study by Yao and Huang [].

A lower overestimation of unfairness value indicates that there is less disparity between overestimation among disadvantaged participants and among nondisadvantaged participants. The scale ranges from 0 (least unfair) to 4 (equation 5 in ).

Other Categories

Zangerle and Bauer [] detailed 10 categories of evaluation metrics that can be used in the FEVR. Of these, 5 (discussed previously) were used in evaluating NarraGive, and the other 5—ranking, novelty, serendipity, fairness across items, and business oriented—were not used for the reasons described in [,,].

ResultsObjective 1: Describe Participant Characteristics and Patterns of Narrative Requests and FeedbackParticipant Characteristics

The baseline sociodemographic and clinical characteristics of participants in the NEON (N=739) and NEON-O (N=1023) trials are shown in .

An exploration of the baseline differences has been reported elsewhere [].

Table 3. Baseline sociodemographic and clinical characteristics of Narrative Experiences Online (NEON) and NEON for other (eg, nonpsychosis) mental health problems trial (NEON-O) participants.
NEON baseline (N=739)NEON-O baseline (N=1023)Gender, n (%)
Female443 (59.9)811 (79.3)
Male274 (37.1)184 (18)
Other16 (2.2)18 (1.8)Age (years), mean (SD)34.8 (12)38.4 (13.6)Ethnicity, n (%)
White British561 (75.9)827 (80.8)
Other ethnicity172 (23.3)185 (18.1)Region of residence, n (%)
East of England53 (7.2)61 (6)
London166 (22.5)210 (20.5)
Midlands112 (15.2)203 (19.8)
North East and Yorkshire80 (10.8)102 (10)
North West66 (8.9)98 (9.6)
South East133 (18)214 (20.9)
South West123 (16.6)125 (12.2)Highest educational qualification, n (%)
No qualification51 (6.9)30 (2.9)
O-levels or GCSEa117 (15.8)116 (11.3)
A-levels or ASb-levels or NVQc or equivalent278 (37.6)327 (32)
Degree-level qualification207 (28)349 (34.1)
Higher degree-level qualification80 (10.8)191 (18.7)Living arrangement, n (%)
Alone215 (29.1)229 (22.4)
With others524 (70.9)794 (77.6)Employment status, n (%)
Employed277 (37.5)586 (57.3)
Sheltered employment10 (1.4)6 (0.6)
Training and education76 (10.3)106 (10.4)
Unemployed356 (48.2)272 (26.6)
Retired20 (2.7)53 (5.2)Current mental health problem, n (%)
I don’t want to say20 (2.7)14 (1.4)
I did not experience mental health problems19 (2.6)31 (3)
Developmental disorder such as learning disability15 (2)12 (1.2)
Eating disorder15 (2)45 (4.4)
Mood disorder265 (35.9)626 (61.2)
Personality disorder138 (18.7)123 (12)
Schizophrenia or other psychosis154 (20.8)<5 (<1)
Stress-related disorder82 (11.1)152 (14.9)
Substance-related disorder25 (3.4)<10 (<1)Lifetime user of primary care mental health services, n (%)
Yes698 (94.5)949 (92.8)
No35 (4.7)64 (6.3)Current use of mental health services for psychosis, n (%)
No contact with any NHSd service100 (13.5)N/Ae
General practitioner234 (31.7)N/A
Primary care counselor59 (8)N/A
IAPTf56 (7.6)N/A
Specialist community mental health team261 (35.3)N/A

View original article

JMIR MENTAL HEALTH

分享书签

0 0 0 0 0 0 0

More from this channel

The Implementation of Recommender Systems for Mental Health Recovery Narratives: Evaluation of Use and Performance

留言 (0)