Public Health-Driven Transformer for Social Skill Enhancement in Children with AutismIn recent years, social skill enhancement in children with Autism Spectrum Disorder (ASD) has garnered increasing attention due to the critical role these skills play in their cognitive, emotional, and behavioural development Rouhandeh et al. (1). Addressing these needs has become a public health priority, as improved social skills can significantly impact the quality of life, independence, and academic success of children with ASD Alharbi and Huang (2). Traditional interventions, such as Applied Behavior Analysis (ABA) and Social Skills Training (SST), though effective, often require intensive, time-consuming sessions with limited scalability, making them challenging for widespread implementation Loftus et al. (3). Advances in artificial intelligence (AI), particularly with Transformer-based deep learning models, offer new avenues to enhance these interventions by providing scalable, adaptive, and interactive social skill training. Leveraging these technologies can not only augment traditional methods but also enable new, personalized approaches that can reach a broader demographic, particularly through digital platforms that are increasingly accessible.
To address the limitations of conventional social skill enhancement methods, initial AI applications in this area were grounded in symbolic AI and knowledge representation Park et al. (4). These systems focused on rule-based decision-making to simulate socially appropriate responses, using predefined knowledge bases and if-then logic. Such methods allowed for the establishment of consistent, structured frameworks that attempted to emulate basic social interactions Lee et al. (5). However, these rule-based systems lacked flexibility, as they could not adapt to the complex and varied nature of social cues encountered in real-life scenarios. Consequently, the rigidity of symbolic AI limited its effectiveness in promoting dynamic social learning and was unable to cater to the individual learning needs of children with ASD, who often benefit from personalized feedback and varied social contexts Aldabas (6).
The evolution from symbolic AI to machine learning marked a significant step forward, as data-driven approaches enabled more adaptive social skill interventions Puglisi et al. (7). Machine learning models, particularly supervised learning techniques, allowed for pattern recognition from large datasets of social interactions, capturing more nuanced social behaviors and expressions Frolli et al. (8). Models trained on labeled data, such as facial expressions and verbal interactions, could identify social cues with greater accuracy and variation than rule-based systems Ioannou et al. (9). Nevertheless, these methods heavily relied on labeled data, which is costly and time-consuming to curate, and their performance was constrained by the quality and size of the datasets. Additionally, while they improved adaptability, they often struggled to generalize across diverse social settings and required extensive computational resources for real-time interactions, making them less accessible for public health implementations Hameed et al. (10).
The advent of deep learning, especially Transformer-based architectures and pre-trained models, has led to a substantial shift in the capabilities of AI for social skill enhancement Kouhbanani et al. (11). Transformer models, with their attention mechanisms and ability to process contextual information over long sequences, excel at modeling complex social interactions, as they can capture dependencies between varied social cues and context. Pre-trained models, such as BERT and GPT, have shown success in understanding nuanced language and behavioral patterns, enabling more context-aware and responsive interactions in ASD interventions Safi et al. (12). These models can be fine-tuned for specific social skill scenarios, which allows for personalization and adaptability without requiring massive labeled datasets. However, the computational intensity of Transformers and the risk of biases in pre-trained models remain challenges, as these limitations can hinder scalability and lead to inconsistent outputs in diverse social contexts Hernández-Espeso et al. (13).
Based on the aforementioned limitations, we propose a Public Health-Driven Transformer (PHDT) designed for scalable, personalized social skill enhancement in children with ASD. By integrating insights from both social skill development and deep learning, our approach addresses the drawbacks of traditional, rule-based, and machine learning methods by creating an adaptable, efficient, and publicly accessible solution.
● PHDT incorporates a novel attention-based module tailored for interpreting diverse social cues, such as facial expressions, gestures, and verbal tones, optimizing interaction specificity for children with ASD.
● The model is designed to operate efficiently across various scenarios, balancing performance with computational demand, making it accessible for broader use in public health interventions.
● We introduce a novel dynamic batch size adjustment mechanism during training, which accelerates convergence and enhances model generalization by effectively balancing computational efficiency and learning stability
2 Related work2.1 Public health approaches in autism interventionPublic health approaches have long been a focus of autism interventions due to their emphasis on scalable, community-wide solutions that address early diagnosis and intervention Terlouw et al. (14). These approaches view autism not solely as an individual developmental disorder but as a societal challenge with substantial public health implications. Population-based strategies in public health aim to ensure that children with autism, especially those from underserved communities, have access to early detection tools and intervention resources. By framing autism interventions within a public health context, researchers have pursued comprehensive methods that reduce barriers to access, often through community-based programs and policies Güler and Erdem (15). One promising area within this field includes community level frameworks that engage families, educators, and healthcare providers in identifying and addressing autism-related needs early on. Screening tools designed for early detection have demonstrated benefits in linking children to resources, but gaps in reaching diverse and rural populations remain. These frameworks have evolved to incorporate digital and AI-driven tools, capitalizing on the reach of technology to amplify detection and intervention access. Public health-driven models are thus shifting towards leveraging scalable digital platforms, aiming to integrate intervention approaches with other services in a holistic manner Ávila Álvarez et al. (16). Public health models increasingly prioritize collaborative, integrated systems that involve the community in recognizing early social skill deficits and facilitating social interaction enhancements Arora et al. (17). The inclusion of technology in public health approaches to autism intervention highlights how digital tools can extend the reach of social skills training, often a key area of developmental need. Machine learning and AI models, like transformer-based architectures, provide a means to deliver interventions that adapt to individual children’s progress. The potential to detect social skill deficits and tailor intervention pathways for large populations enhances the ability to address disparities. Particularly, AI tools can support real-time adaptation to a child’s performance, creating responsive learning environments even in remote or underserved areas. Studies indicate that AI-enabled interventions are feasible in community health settings, enabling therapists, educators, and families to integrate such tools seamlessly. By aligning autism interventions with public health goals, transformative technology-driven solutions have the potential to bridge gaps in access and efficacy Doulah et al. (18).
2.2 Transformer models in autism-specific social skill trainingTransformer models have recently demonstrated significant promise in advancing social skill training for children with autism, primarily due to their robust ability to process large-scale data and deliver individualized learning experiences. These deep learning models, initially developed for language tasks, have been adapted to understand complex social interactions, making them suitable for social skill development applications. Unlike traditional machine learning models, transformer architectures can capture nuanced relationships within social interaction data, learning to identify and enhance specific skills like eye contact, verbal reciprocity, and non-verbal communication Scarcella et al. (19). Research on transformer models in autism primarily focuses on their ability to analyze multimodal data—such as video, audio, and text—that reflect a child’s engagement in social scenarios Liu and Hu (20). This approach enables transformers to detect patterns in social behavior and adjust training content dynamically based on a child’s individual needs. Studies show that by training on diverse datasets of typical and atypical social interactions, transformers can learn effective intervention responses, simulating scenarios that encourage specific social behaviors. These models can analyze video interactions and suggest adjustments to a child’s social engagement strategies in real-time, providing a form of personalized feedback that can be particularly effective for autism therapy Mannion (21). Transformer-based models allow for enhanced adaptability in therapy, permitting flexible responses to various social challenges a child may encounter. They can also integrate feedback loops that continuously refine the training protocols based on the child’s progress, making these interventions highly responsive. This adaptability can also extend to group settings where children with autism interact with peers, offering tailored suggestions that help them manage diverse social dynamics Soltiyeva et al. (22). Integrating these systems into socially assistive technologies has shown potential for fostering social engagement, as they can respond to the unique interaction patterns of each child. Given their ability to generalize from complex social datasets, transformers present a compelling solution for scalable social skill training tools that align with both therapeutic and educational needs Fernandez-Fabeiro et al. (23).
2.3 Social skill development in autism through AI-enhanced interventionsAI-enhanced interventions have expanded the scope of autism therapy, with a specific focus on social skills critical for daily interaction and independence. Social skill development often challenges children with autism due to their difficulties in interpreting social cues, initiating interactions, and responding appropriately to social stimuli. AI-driven models, particularly those utilizing machine learning algorithms, have provided structured, adaptive training environments that support skill acquisition in areas like conversational turn-taking, emotional recognition, and empathy. By integrating AI into social skill interventions, researchers have developed tailored, data-driven approaches that facilitate meaningful engagement in real-world settings Güeita-Rodríguez et al. (24). A critical aspect of AI-driven social skill enhancement is the utilization of real-time feedback, allowing for immediate corrections and positive reinforcement. AI models can simulate a range of social situations, allowing children to practice and develop skills at their own pace while receiving guidance tailored to their progress. For example, virtual agents powered by AI provide a safe, low-stress environment for practicing conversations, identifying emotions, and developing adaptive responses. Studies indicate that these virtual settings can effectively replicate many social scenarios encountered in daily life, offering children a structured approach to practicing and refining social interactions. The dynamic adaptability of AI-based models means that they can assess a child’s level of social skill proficiency, personalize the training tasks accordingly, and scale the complexity as the child’s skills improve Terlouw et al. (25). Beyond individual sessions, AI-enhanced social skill interventions offer benefits in group contexts, enabling interactive exercises where children can develop skills alongside peers in controlled, simulated environments. Social robots equipped with AI algorithms further exemplify this trend, serving as mediators in group therapy by facilitating turn-taking, modeling appropriate social behaviors, and providing non-judgmental feedback Gengoux et al. (26). The data-driven approach of AI also provides valuable insights for therapists and educators, offering analytics on a child’s progress, specific skill deficiencies, and improvement areas. By incorporating these detailed insights into intervention strategies, AI-enhanced interventions support a more personalized and effective approach to social skill development for children with autism.
3 Method3.1 OverviewIn this work, we focus on enhancing social skills in children with Autism Spectrum Disorder (ASD) through technology-assisted interventions. Social skills are an essential component of social interaction and personal development, yet children with ASD often exhibit challenges in this area, particularly with skills such as initiating and maintaining conversation, social problem-solving, and recognizing social cues. Consequently, interventions in this domain aim to mitigate these challenges by introducing structured and evidence-based methods that foster communication and interaction skills. This section provides an overview of the proposed method to enhance these social skills through a novel framework of technology-aided instruction, structured into the following key segments.
In 3.2, we define the primary challenges in social skill acquisition faced by children with ASD, including a theoretical background on social communication deficits as identified in diagnostic criteria. Additionally, we analyze existing methods that employ technology to support social skill interventions, such as video modeling, audio prompting, and interactive digital environments. These methods demonstrate potential for effectively addressing ASD-related social difficulties by using digital solutions that simulate or reinforce social scenarios. The subsequent section, 3.3, outlines the mathematical foundation for modeling interactive learning environments tailored to the ASD population. Here, we formalize the problem by developing a set of models that quantify skill acquisition and engagement metrics across various technological interventions. Such a formulation is instrumental in tracking progress and adapting the instructional techniques based on real-time feedback and longitudinal data analysis, ensuring interventions remain personalized and effective. Finally, in 3.4, we introduce our unique model framework, which integrates the latest advancements in interactive digital media with adaptive feedback mechanisms to personalize instruction. This approach leverages elements like multi-modal engagement and reinforcement learning to cater to individual learning styles, allowing the intervention to dynamically adjust to each child’s responsiveness. This section will provide insights into the model architecture and the specific features designed to reinforce social behaviors, providing a structured pathway for skill generalization beyond the training environment.
3.2 PreliminariesChildren with Autism Spectrum Disorder (ASD) face notable challenges in developing social skills, a core aspect of social interaction, often characterized by difficulties in initiating interactions, understanding non-verbal cues, and maintaining reciprocal social exchanges. The goal of this study is to formalize these challenges into a structured mathematical framework, which allows for quantitative assessment and personalized intervention strategies. To address the multi-dimensional nature of social skill deficits, we introduce a set of notations and mathematical models that describe the problem space, with a focus on capturing the complex interactions involved in social skill acquisition and reinforcement.
Let U= denote a sequence of interactions or social exchanges undertaken by a child with ASD, where each interaction ui represents an instance of social behavior, such as a greeting or response to a peer. Each interaction ui can be characterized by a set of features, Xi=, where M represents the total number of observable behavioral cues, such as eye contact, vocal tone, and body posture. Each feature xij is a continuous or discrete variable representing the intensity or occurrence of that specific social cue.
To further model the quality of these interactions, we introduce a scoring function f:U→ℝ, where f(ui) assigns a numerical score to the interaction ui, quantifying its alignment with socially accepted norms. Let S= be the set of scores corresponding to U, where si=f(ui). The cumulative social skill score over a series of interactions can then be formalized as:
Skill_Score=1N∑i=1Nsi(1)where N represents the total number of skill components, and si denotes the score of the i-th skill component. By averaging the skill components, this equation ensures an equal contribution from each component, providing a balanced representation of the overall skill level. This formulation is particularly useful for aggregating multiple metrics into a single interpretable score while maintaining simplicity and consistency.
To understand the developmental trajectory, we define the learning rate function g:U×T→ℝ, where T denotes the time sequence over which interventions are applied. Here, g(ui,t) measures the rate of skill acquisition over time for each interaction ui, allowing us to capture improvements or regressions in behavior over time:
with g(ui,t)>0 indicating progress in skill acquisition.
Given the individualized nature of ASD, each child’s interaction sequence and response to interventions will differ, necessitating a personalized approach. To model the adaptation of the intervention based on individual performance, let I be the intervention strategy space, and define a mapping h:S→I, where h(si) suggests a specific intervention (e.g., video modeling or feedback prompt) based on the score si:
h(si)=arg maxj∈IEffectiveness(j|si),(3)where Effectiveness(j|si) represents the expected improvement in si by applying intervention j. This allows the model to select an optimal intervention from the strategy set I, thus tailoring support based on observed performance.
For continuous tracking and adjustment, we introduce a reinforcement mechanism defined by a feedback loop F:S×I→ℝ that updates the intervention choice based on real-time effectiveness:
where Δs is the observed improvement post-intervention, ensuring the model dynamically reinforces effective strategies and adjusts less effective ones.
In the case of f(ui), the selection of parameters is primarily influenced by the distribution statistics of the input features and the model’s robustness needs. We opt for specific types of nonlinear activation functions, such as ReLU or Sigmoid, which suit the dynamic range of the input features and maintain value stability. The parameters are fine-tuned via a grid search method to strike a balance between computational complexity and fitting accuracy. Regarding g(ui,t), the parameters are crucial for modeling time correlation. We employ a design based on a weighted moving average that helps to mitigate short-term fluctuations and capture long-term trends. The choice of weight parameters is based on empirical rules in the field, and their effectiveness is validated through experimental testing on various datasets.
3.3 Adaptive interaction modelOur primary contribution in this study is the development of an adaptive interaction model, herein named the Social Engagement Network (SEN), designed to optimize social skill interventions for children with ASD. The SEN model employs a structured representation of social interactions, integrates real-time feedback, and dynamically adapts to each individual’s progress in social skill acquisition. The model structure includes a multi-layered architecture to account for both immediate responses and long-term social skill trajectories (As shown in Figure 1).
Figure 1. Architecture of the Social Engagement Network (SEN) model, featuring a multi-stage structure with embedding stems, spatial aggregation, dynamic transition mechanisms, and a real-time feedback mechanism. This adaptive model is designed to optimize social skill interventions for children with ASD by dynamically responding to individual progress.
The choice of a transformer architecture for our task is primarily driven by its ability to handle the complexity and multi-modal nature of social skill training for children with ASD. Social interactions involve intricate relationships between textual, auditory, and visual cues, requiring a model capable of capturing these dependencies dynamically. Transformers, with their self-attention mechanism, excel at identifying key features across modalities and assigning context-dependent importance to them. This is crucial for accurately interpreting nuanced social behaviors, such as recognizing emotions or understanding conversational tone, which are central to our task. While transformers are computationally intensive, their ability to model long-range dependencies without the limitations of sequential processing (as seen in RNNs) is critical for our task, where understanding temporal and contextual relationships is essential. Furthermore, transformers offer flexibility in fusing multi-modal inputs, enabling seamless integration of text, audio, and facial cues. This adaptability ensures that the PHDT model can effectively simulate and respond to real-world social scenarios, enhancing the learning experience for children with ASD. The use of pre-trained transformer models significantly reduces the computational overhead during fine-tuning, as these models already capture rich, general-purpose representations. This is particularly beneficial for our task, where training data is limited but must reflect diverse social contexts. Despite the computational demands, the transformer’s ability to generalize across modalities and contexts makes it an ideal choice for addressing the challenges of our task, ultimately leading to a more robust and effective framework for social skill development.
3.3.1 Latent interaction state representationLet Z= represent a sequence of latent interaction states, where each zi captures the underlying cognitive and affective response of the child during an interaction ui. Each latent state zi holds complex information, encapsulating both the immediate response to current stimuli and residual effects from prior interactions. This dual influence is crucial to model the often-subtle dynamics of social engagement, which may involve delayed responses or evolving behavioral tendencies (As shown in Figure 2).
Figure 2. Diagram illustrating the latent interaction state representation within the Social Engagement Network (SEN). This model captures the sequence of latent interaction states to track both immediate responses and historical dependencies in social engagement. Key components include multi-order gated aggregation, convolutional layers with diverse dilations, and adaptive gating mechanisms. Together, these elements form a nuanced representation of each interaction, allowing SEN to model complex cognitive and affective responses in children with ASD.
These latent states are modeled as hidden variables that interact with both observable behavioral features Xi and past interaction states, providing a robust foundation to infer the child’s cognitive and affective trajectory. This interplay can be expressed by expanding the original function into separate terms for immediate input processing and historical dependency:
zi=φθ1(Xi)+ψθ2(zi−1),(5)where φθ1 encodes current behavior, while ψθ2 maps the previous state to capture time-series dependencies. Here, θ1 and θ2 are parameter sets that can evolve independently to adjust the weight of immediate versus sequential influences.
To further refine these latent states, we introduce an auxiliary transformation κθ3 that adjusts the residual state contributions from a broader historical window:
zi=φθ1(Xi)+∑j=1i−1κθ3(zj,i−j),(6)where κθ3 is a time-decay function modulated by θ3, weighting past interactions according to their temporal distance from ui. This approach enhances the model’s capability to emphasize recent interactions, while progressively diminishing the impact of older interactions, allowing a flexible yet decaying memory structure.
Moreover, the model incorporates an adaptive gating mechanism Γϕ to modulate the influence of latent states based on the interaction context, where:
zi=Γϕ(Xi,zi−1)⊙(φθ1(Xi)+∑j=1i−1κθ3(zj,i−j)),(7)and ⊙ denotes element-wise multiplication. Here, Γϕ is parameterized by ϕ and dynamically adjusts the contributions of immediate versus accumulated historical information. For instance, if Xi reflects a high-stress interaction, Γϕ can down-regulate the residual impact from prior states, allowing a more responsive adaptation to the child’s current state.
The final latent state representation combines the above elements, yielding a richly layered state model that supports the tracking of engagement patterns over time. Each state zi is thus fully defined as:
zi=Γϕ(Xi,zi−1)⊙(φθ1(Xi)+∑j=1i−1κθ3(zj,i−j))+ϵ(8)where ϵ represents a stochastic noise component that accounts for minor fluctuations in behavior. This comprehensive latent state model ensures that SEN can dynamically capture and adjust to complex interaction patterns, creating a foundation for accurate and adaptive intervention strategies.
3.3.2 Dynamic transition mechanismTo effectively model the temporal evolution of social engagement states, we propose a transition function T:Z → Z that describes how each latent state zi transforms into the subsequent state zi+1 given both the current state and the influence of new interaction features. This transition mechanism allows our Social Engagement Network (SEN) to capture the continuity of behavioral patterns and their adaptive shifts across interactions (As shown in Figure 3). Mathematically, the transition function can be expressed as:
Figure 3. Diagram of the dynamic transition mechanism in the Social Engagement Network (SEN), which models the temporal evolution of social engagement states. This mechanism integrates convolutional layers, GELU activation, and channel aggregation to process both past latent states and new interaction features. By incorporating decay factors, attention-weighted transformations, and regularization, the model dynamically adapts to shifts in engagement patterns, ensuring smooth transitions and continuity across interactions.
zi+1=T(zi,Xi+1)=gθ4(zi)+hθ5(Xi+1),(9)where gθ4 and hθ5 are separate functions parameterized by θ4 and θ5, respectively, allowing SEN to disentangle the effect of prior latent states from new interaction data.
This formulation enables SEN to dynamically adjust based on recent interactions and shifts in engagement patterns. The recurrent structure of gθ4 captures temporal dependencies by evolving the latent state based on historical patterns, while hθ5 brings in the influence of new interaction features Xi+1, which can significantly impact the trajectory of social engagement.
To incorporate more refined temporal adjustments, the transition function can further include a decay factor δi that modulates the persistence of previous states:
zi+1=δi·gθ4(zi)+(1−δi)·hθ5(Xi+1),(10)where 0 ≤ δi ≤ 1 is dynamically computed based on the context of interaction ui. This decay term enables the model to control the impact of past states on future states, with higher values of δi allowing more influence from prior interactions when the current interaction does not provide sufficient new information.
To enhance real-time adaptability, we introduce an attention-weighted transformation for the transition, allowing SEN to emphasize or downplay different aspects of each interaction based on its relevance to the engagement trajectory. Define an attention vector ai as follows:
ai=σ(Wa·[zi,Xi+1]+ba),(11)where Wa and ba are parameters, and σ is a softmax function that normalizes attention weights across features in Xi+1 and zi. The attention-modulated transition is then formulated as:
zi+1=ai⊙gθ4(zi)+(1−ai)⊙hθ5(Xi+1),(12)where ⊙ denotes element-wise multiplication, allowing selective focus on certain features based on attention weights, thus improving the predictive accuracy of SEN on engagement trends.
To stabilize this learning process, we define a regularization term Ω in the transition function’s optimization that penalizes abrupt transitions in the latent space:
Ω=λ∑i=1N−1||zi+1−zi||2,(13)where λ is a regularization parameter. This term discourages large state jumps between consecutive interactions, promoting smoother transitions and continuity in engagement patterns.
The final transition update for each state zi+1 combines all components, ensuring a balance between past influence, current interaction, and attention-weighted adjustment:
zi+1=ai⊙(δi·gθ4(zi)+(1−δi)·hθ5(Xi+1))+ϵ,(14)where ϵ is a noise term that allows for minor variability in transitions, reflecting natural fluctuations in social engagement. This dynamic transition mechanism enhances SEN’s predictive capabilities, enabling it to anticipate the child’s engagement in future interactions effectively.
We designed experiments to verify Equation 9 based on multiple real-world data sets. These datasets cover different dynamic scenarios, including user behavior prediction and environmental variable change modeling. By fitting model predictions to actual observations, we quantify the statistical significance and goodness of fit of key parameters in the equation. Furthermore, to evaluate the behavioral dependencies of the model assumptions, we perform a sensitivity analysis on the core variable dependencies in the equation (such as the relationship between t and ui) and provide the distribution of the impact of each parameter on the model prediction results.
In particular, to provide a basis for empirical validation, we introduce the following loss function to measure the deviation between model predictions and actual observed data:
where yi represents the observed value and
留言 (0)