Lilobot: A Cognitive Conversational Agent to Train Counsellors at Children’s Helplines

Method

The experiment had a within-subject design with two conditions: text-based intervention as a simple text explaining the Five Phase Model, and the conversational agent (Lilobot), our interaction-based intervention. We evaluated Lilobot using four measures: (1) trainees’ self-efficacy in applying the Five Phase Model, (2) their perceived usefulness of the learning tool, (3) system usability, and (4) the conversation’s outcome (i.e., Lilobot’s end belief values). We also collected qualitative data through five open-ended questions to gain insight into the participants’ experiences. In total, we invited 39 counselling volunteers from the Dutch child helpline to participate in the experiment through email. We used a counterbalanced design to control for order effects. For this, we split participants into two groups, where each group experienced both interventions but in reverse order. After excluding 11 participants for not completing the questionnaires, we had a total of 28 helpline counsellors with varying years of counselling experience ranging from 0 to 16 years (M = 3.54 years, SD = 3.95). We asked the participants to complete all questionnaires through the Qualtrics platform. Seven participants did not complete all self-efficacy questions. For six of them, we calculated the average score based on the items they had answered, and one was excluded from the self-efficacy analysis as this person had not provided any responses. As for the outcome of the conversation, we calculated the average belief values held by the agent at the end of a session.

We requested the participants to complete the experiment in one sitting, taking about an hour. They signed an informed consent form and completed a pre-training questionnaire about their counselling experience at the helpline and initial counselling self-efficacy measurements. This was followed by the two training interventions. After each intervention, participants completed questionnaires on their counselling self-efficacy, inspired by established measures [1, 26], and checked by supervisors at the children’s helpline. The questionnaire included eight items ranging from -5 ‘strongly disagree’, 0 ‘neutral’ to +5 ‘strongly agree’, for which we analysed the mean. During the intervention with Lilobot, participants engaged with the agent in three consecutive sessions, each lasting approximately 15 minutes. The goal of the first and third sessions was to counsel Lilobot according to the Five Phase Model, while the second session allowed participants to explore the agent. After each session with Lilobot, the agent provided feedback based on the BDI status of the simulated child help-seeker. Upon completing the study, participants rated Lilobot’s perceived usefulness on eight items ranging from -5 ‘negative’ to +5 ‘positive’, with 0 indicating neutral. These items, adapted from previous research [17, 27, 39], were analysed separately. The participants also filled out the usability questionnaire, which was a Dutch version of the System Usability Scale (SUS) questionnaire [5] containing ten items [20, 41]. Each item was rated on a 5-point scale from 0 ‘strongly disagree’ to 4 ‘strongly agree’. To calculate an interpretive score out of 100, we reversed the score of four reverse wording questionnaire items and summed the scores of all ten items, then multiplied the score by 2.5. For the analysis, we conducted a repeated measures ANOVA on the self-efficacy data to evaluate the main effect and an interaction effect of the two independent variables - the training intervention and the time of measurement (e.g., before or after the specific training). For the remaining analyses, we used a one-sample Wilcoxon signed-rank for perceived usefulness and a paired sample t-test on the conversational outcome.

We analysed the responses to the three open questions through a thematic analysis [4], and used double-coding to check the reliability of the themes. The first author, with a background in computer science and artificial intelligence, identified the themes and the related coding scheme, which a second coder, a computer science graduate student, used to code responses independently. Beforehand, the second coder was trained on synthetic data generated by ChatGPT. The inter-reliability between the two coders showed a substantial level of agreement for the first (Cohen’s \(\kappa \) = 0.63) and third (Cohen’s \(\kappa \) = 0.68) qualitative questions, and moderate agreement for the second (Cohen’s \(\kappa \) = 0.52), according to Landis and Koch [23]Footnote 1. The coders discussed cases of disagreements to reach a consensus.

The experiment was approved by the TU Delft Human Ethics Research Committee (HREC reference number: 1622), and its design was pre-registered on the Open Science Framework (OSF) ahead of data collectionFootnote 2. All statistical analyses were done using R software (version 4.1.2). The questionnaires, dataset and the analysis R-script are available online through the 4TU research data repository.Footnote 3

Fig. 3figure 3

Comparing participants’ counselling self-efficacy across the text and conversational agent training interventions before and after training

ResultsQuantitative Results

The analysis revealed no significant main effect on counselling self-efficacy based on the type of intervention (F(1, 78) = 0.2, p = .65). However, we observed a significant main effect at different times of measurement (F(1, 78) = 17.32, p< .001), where post-counselling self-efficacy (M = 2.16, SD = 2.39) was lower than pre-counselling self-efficacy (M = 3.4, SD = 1.44). The analysis also found a significant two-way interaction effect (F(1, 78) = 6.52, p = .01) between these two variables. A follow-up simple effect analysis revealed a significant difference (t(78) = 4.75, p< .001) in counselling self-efficacy before (M = 3.72, SD = 0.93) and after (M = 1.71, SD = 2.61) training for the conversational agent intervention, but no significant effect was found (t(78) = 1.14, p = .26) in the text intervention across the two time points of measurement (Fig. 3).

Fig. 4figure 4

Thematic map of participants’ most liked features about their experience of using Lilobot

In our analysis of Lilobot’s perceived usefulness, participants’ ratings deviated from the neutral zero in two out of the eight items. Specifically, mean ratings were negative for participants’ self-efficacy concerning the Five Phase Model (M = -1.06, SD = 1.71 Z = -1.98, p = .02), and the usefulness of conversational agents as a learning tool (M = -1.62, SD = 2.56, Z = -2.29, p = .01). For usability, we report an average score of 67 (SD = 6.44), which can be interpreted as “ok” based on an adjective rating scale for the SUS questionnaire by Bangor et al. [2]. For the conversational outcome, a paired sample t-test showed no significant difference in the model’s conversational outcome (t(25) = -1.72, p = .1) of the first session interacting with Lilobot (M = 6.36, SD = 1.36) compared to the third session (M = 6.68, SD = 1.24).

Qualitative Results

The analysis identified two main themes for the question “What was the best thing about your experience using Lilobot?”: the conversation with Lilobot and the learning experience obtained from the interaction. Some participants liked that the conversation realistically simulated a child’s language style and behaviour (n = 4, 14%). Others appreciated the fast response time of the agent (n = 6, 21%). Regarding learning, participants indicated that through their experience with Lilobot, they could reflect on what they said and the Five Phase Model (n = 4, 14%) and see how their actions affected the agent’s behaviour (n = 2, 7%). Participants also noted the opportunity for self-directed learning with Lilobot as they did not have to depend on the involvement of other participants to role-play (n = 3, 10%). Figure 4 shows a thematic map of these responses.

Figure 5 shows a thematic map of participants’ responses to the question “What was the worst thing about your experience using Lilobot?”. The most common theme identified was issues related to Lilobot’s understanding which made it difficult to hold a natural conversation (n = 22, 79%). Participants indicated that Lilobot did not understand their utterances or gave no response to questions they posed to the agent. Others also mentioned they received repetitive answers (n = 4, 14%), had difficulties understanding Lilobot’s use of emoticons (n = 2, 7%) and found the segmentation of utterances demotivating (n = 1, 4%).

Fig. 5figure 5

Thematic map of participants’ least liked features about their experience of using Lilobot

We also asked the participants about the feedback given by Lilobot. Eight out of the 28 stated they did not receive any feedback. Some participants found it insightful to see Lilobot’s reasoning process and how their actions influenced the agent’s responses (n = 9, 32%). On the other hand, some participants noted the feedback was of little value to them (n = 2, 7%), as they could not proceed in the scenario. Figure 6 shows a thematic map of participants’ responses to this question.

Fig. 6figure 6

Thematic map of participants’ positive and negative remarks on feedback from Lilobot

The final question was about which group of users the participants were likely to recommend Lilobot to. The options included counsellors-in-training (n = 17, 61%), novice counsellors (n = 3, 11%), experienced counsellors (n = 3, 11%), and supervisors of the helpline (n = 0, 0%). For the counsellors-in-training at the helpline, one reason given was that it would allow them to experiment and gain familiarity with the conversation model without real-life consequences if they did something wrong. Other participants suggested that the conversational agent might be more suited for experienced counsellors, who already understand how children behave and could use it as an opportunity to revise question-answering techniques and how they relate to the phases of the conversation model.

留言 (0)

沒有登入
gif