Cognitive Impairment Detection Based on Frontal Camera Scene While Performing Handwriting Tasks

Head movements can provide important information about a person’s motor control, balance, and coordination. There are studies suggesting that certain head movements, such as voluntary jerks [17] or tremors [18], may be indicative of neurological or movement disorders. However, it is important to note that head movements alone may not be sufficient to accurately diagnose a medical condition. To accurately diagnose a pathological condition requires a comprehensive evaluation including a series of physical and neurological tests. Nevertheless, the goal of this research is not to provide a medical diagnose. It is to propose a simple method to rise a flag when there is a suspicious case of cognitive impairment.

Pentagon Copying Test

In the Pentagon Crossing Test of the Montreal Cognitive Assessment (MoCA) Nasreddine et al. [19] individuals are asked to draw a specific geometric figure, a pentagon, while following a set of instructions. The test assesses visuospatial abilities, executive function, and attention. The individual is given a piece of paper with a pre-drawn pentagon and is instructed to connect alternate corners of the pentagon with straight lines. The goal is to accurately complete the drawing according to the provided instructions.

The test is scored based on the correctness of the drawing and the adherence to the instructions. Errors such as incorrect line placement, extra lines, or failure to follow the instructions can indicate difficulties in visuospatial processing and executive function, which are cognitive domains often affected in certain neurological conditions. Most of the experts assign a zero or one score depending on correctness. It is considered correct if there are two pentagons, each pentagon has five sides, and the intersection between both pentagons is correct. Otherwise, the score is zero. This is a fast an easy-to-follow process.

In some scenarios, a more quantitative evaluation is done, especially when implemented in a computerized version. For instance, in Nagaratman et al. [20], it is possible to assign a score of 1–10 for each portrayal. Table 1 summarizes the scoring system. The rotation of the figures or tremor was overlooked according to the original criteria.

Table 1 Scoring system in a 1 to 10 scale for PDT

Figure 2 shows an example of PDT for a user affected by AD in basal situation, 6, 12, and 18 months after diagnose.

Fig. 2figure 2

PDT produced by a user affected by AD in basal situation, 6, 12, and 18 months after diagnosis

In this paper, we have used the binary classification as our aim is cognitive impairment detection rather than assessment. We have manually inspected the pentagons produced by all the users and assigned healthy control or cognitive impairment group.

Database

In this paper, we have acquired a new database called PECT-Tecnocampus. The acquisitions were carried out in several civic centers in the city of Mataró (Barcelona, Spain) and gathered 191 volunteers over the age of 60 from the Maresme region. All the donors signed an informed consent according to ethical regulations.

Individual physical abilities were assessed using questionnaires and physical tests (balance, mobility, cardiorespiratory fitness, among others), and an individualized report of the results was provided. Figure 3 shows the histogram of the age of donors split in males and females. Mean (m) and standard deviation (std) for ages of males and females are respectively: males (m = 71.76, std = 5.64), females (m = 71.26, std = 6.28).

Fig. 3figure 3

Histogram of ages for males (top) and females (bottom)

From the 191 users, 174 of them finished all the handwritten tasks on a Wacom Cintiq digitizing tablet.

In our previous published databases, we acquired all the handwritten tasks in a single DIN A4 sheet. However, for this new database, we decided to use two different DIN A4 sheets: one for handwritten text and signature and another one for drawings. This permitted larger sizes and helped visual impaired people to finish the tasks. Thus, the handwritten tasks can be classified into two groups:

Drawings: (a) two pentagon copy test, (b) house copy test, (c) spring drawing, (d) Archimedes spiral, (e) concentric circles performed at regular speed, (f) straight line connecting two dots without touching the lower and upper black bars. Figure 4 shows the template used for this tasks

Handwriting: (1) signature performed two times, (2) words in capital letters copy, (3) cursive letter sentence copy. For more information on different handwritten tasks check [10].

Fig. 4figure 4

Template used for drawing tests. It consists of six different tasks

PECT-Tecnocampus database was acquired with a Wacom Cintiq 16 tablet and a modified version of the original HandAQUS software Mucha [7]. The features of this tablet are 5080 lpi and 8192 pressure levels. From this database, a manual classification by visual inspection of the pentagon copying test (PDT) [8] was performed by a lecturer of the health sciences faculty at Tecnocampus. Each user was assigned to a cognitive impairment or healthy group. Those who fail to pass the PDT were classified as users with some cognitive impairment.

From the 174 users, 81 failed to pass the PDT (52 females and 29 males). The criterion to evaluate the quality of the PDT was the following:

The PDT is successful if there are two pentagons that intersect at two points.

Each pentagon must have exactly five sides and five angles and must interlock at two points of contact.

It does not matter if the angles are not equal, although it is necessary that the pentagons are not open at any corner.

Small errors are allowed when almost imperceptible, and also if tremors are evident, and the lines are not completely straight.

The MMSE alone does not provide any diagnosis, and although it is a useful tool when assessing to a patient with memory problems, the diagnosis of mild cognitive impairment (MCI) or dementia is done by complementing it with a good medical history in addition to a correct physical examination and performance of complementary tests. Even so, sometimes evolutionary monitoring of the patient is necessary to give a specific diagnosis.

Due to privacy issues and also due to the difficulty to know the pathologies of almost 200 users, each of them with a different family doctor, specialist, etc., we cannot know the exact pathology, if any, that affects each user. However, the goal of this study was not to diagnose patients. It was to study the health conditions of elder people including cognitive impairment.

The PDT is a sub-test of the Mini-Mental State Examination (MMSE) [8], used extensively in clinical and research settings as a measure of cognitive impairment. This manual classification is used as “ground truth” for automatic classification based on head movements acquired by frontal camera of eye-tracker system, described next.

While performing the handwriting tasks, the users wore the Eye-tracker Tobii Pro Glasses 3® (see Fig. 1). The glasses have several cameras, pointing to the user’s eye (from glass to eye) and one camera pointing to the general scene seen by the user (from the glass to outside). Eye tracker provides a set of data that can be used to analyze visual behavior, reading habits, attention, interest, and other related metrics.

One key point was that in a database acquisition, one of the most difficult tasks is the recruitment of donors. However, once you get them, it is a good practice to include as much as non-invasive sensors as possible. For this reason, we added the eye tracker to the handwriting task acquisitions.

The initial idea was to use a wearable eye tracker while performing handwriting tasks and analyze the eye-tracking signals. This eye tracker not only acquires info related to gaze, pupil sizes, eye-closing, etc., but it also provides a video image recording of the scene seen by the user wearing the glasses. However, during the database acquisition, we detected calibration problems for users wearing corrective lenses. In this case, the users had to add the eye-tracker glasses over the corrective lenses used to overcome visual impairment. Unfortunately, the use of corrective lenses is a quite usual situation for elder people when doing writing tasks.

Thus, the use of eye-tracker information from the current database would require to detect and probably remove from database and/or from the experiments those users with uncalibrated samples. However, we have not found failure to acquire issues with the frontal camera, which is also an interesting signal to analyze. Thus, we decided to propose a system based on this signal rather on eye-tracking ones, which are left for a future work (some acquired signals can exhibit low sensibility to calibration errors).

Worth to mention that wearing eye glasses such as Tobii Pro Glass 3 is quite comfortable for users and more convenient than some other specially designed and mounted devices on the head of the user.

For future work, we consider important to use the option of corrective lenses kit available from Tobii. The kit includes individual lenses for both left and right eyes ranging from − 8.0 to + 3 diopter in intervals of 0.5 diopters. This kit extends the applicability of Tobii Pro Glasses 3 to people with the most common forms of near- and farsightedness. However, this would increase the acquisition time per user as in addition to eye-tracker calibration, it requires another previous calibration for each user.

Frontal camera provides useful information about head movements during handwriting tasks. This camera provides 25 fps of 1920 × 1080 pixels each in RGB format.

Shot Boundary Detection with Background Subtraction

Shot boundary detection is a technique used to identify significant changes between shots in a video. This method relies on the pixel difference between two consecutive shots to determine if a scene change has occurred. Background subtraction is used to separate the foreground (moving objects) from the static background in an image or sequence of images. The algorithm can be found in Candela et al. [21]

Through this process, our method not only accurately identifies scene changes but also enables the extraction and saving of video segments corresponding to individual scenes, thereby providing a powerful tool for detailed analysis and segmentation of complex video content.

Video Dataset Creation with Shot Boundary Detection

To develop our training dataset, we employed the shot boundary detection technique to segment a continuous video capturing the entire user test into six separate sub-sequences. Each sub-sequence corresponds to a different task on the Cintiq tablet: drawing a pentagon, house, vertical spiral, Archimedean spiral, concentric circles, and a straight line. The ability of shot boundary detection to pinpoint scene changes tied to shifts in activity enabled us to isolate each graphic task into an individual video.

Video analysis revealed that users with cognitive deficits exhibited irregular eye and head movements, particularly an inability to maintain a steady gaze on the pen while performing the tasks.

The data subsequently were labeled by hand creating two macro-categories (pass/not pass pentagon test) including six tasks for each macro-category.

As a result, we categorized the training data based on the six tasks performed by the two types of users: six classes for those who passed the pentagon test and another six for those who failed, making a total of twelve categories.

Shot Boundary Transformer Detection

Transformers, proposed by Vaswani et al. [22], are particular deep learning models. The characteristic feature of these networks is self-attention, a process of differentially weighting the meaning of each part of the input data by working predominantly on sequential data.

For all sub-videos, a feature extractor is constructed using the DenseNet121 [23] model pre-trained on ImageNet [24], which is used to extract features from video frames. Each sub-video input, labeled as Si, is transformed into a three dimensional matrix Xi, with dimensions t × h × w (t = time, h = height, w = width). This conversion process is depicted in Fig. 5, where each array Xi is an element of the sub-video input Si.

Fig. 5figure 5

The input is parsed into i space-time tokens thanks to an encoder. The encoder consists of an encoding layer that processes the input iteratively one layer after another, allowing it to draw from the state at any previous point along the sequence. The encoder consists of two main components: a self-attention mechanism and a feed-forward neural network.

As the first step, for each level of attention, the transformer learns three weight matrices by defining: WQ query weights, WK the key weights, and Wv the value weights.

For each token, the input data embedding Xi is multiplied by each of the three weight matrices to produce the three corresponding vectors: qi = Xi·WQ, query weights; ki = Xi·WK, the key weights; vi = Xi·Wv, the value weights.

In the second step, the attention weights aij are calculated (Eq. 1) using the dot product between the query vector and the key vector for each token pair and are normalized by dividing by the square root of the dimension of the key vectors (dk = t × h × w). This helps in stabilizing the gradients during training. The output for a token i is the weighted sum of the value vectors of all tokens, weighted by the normalized attention weights aij:

$$_=\frac_\cdot _}_}}$$

(1)

In the third and final step, the encoder’s output is used as input for a multi-head attention layer, which then feeds data to the decoder. Each “attention head” includes a set of matrices (WQ, WK, Wv) and allows for the computation of attention across different subspaces simultaneously, handling the relevant tokens according to various definitions of relevance. Following the method by LeCun et al. [25], after the decoder, the output undergoes max pooling to reduce its spatial dimension, with this operation applied along the temporal axis of the frame sequence. Subsequently, as per Srivastava et al. [26], the max pooling output is processed through a dropout layer for regularization and to prevent overfitting during the network’s training. Finally, the processed output is sent to a densely connected feed-forward layer, with a number of neurons equivalent to the number of classes in the classification problem. This layer applies a linear transformation followed by a softmax activation function, proposed by Bridle [27], to compute the classification probabilities for each class \(}}_(Q,K,V)\).

The computation of attention for all tokens can be expressed as the computation of a large matrix using the function:

$$}}_(Q,K,V)= \mathrm\,(\frac^}_}})V$$

(2)

where softmax is taken on the horizontal axis and T means transpose.

留言 (0)

沒有登入
gif