An Interpretable Neuro-symbolic Model for Raven’s Progressive Matrices Reasoning

Appendix 1. Code

The code for this study is available at https://github.com/scientific-lab/Toward_Intelligent_Semantic_Reasoning_on_Raven-s_Progressive_Matrices.

Appendix 2. Datasets, Models, and Other ResourcesDatasets

We used the RAVEN [15], I-RAVEN [16], and RAVEN-fair [17] data generators to generate standard RPM problems.

The RAVEN dataset [15] (https://github.com/WellyZhang/RAVEN) uses a hierarchical generator to generate problems with different configurations, rules, and attributes. The dataset has 7 configurations: Center, 2 × 2Grid, 3 × 3Grid, L-R, U-D, O-IC, and O-IG; 4 rules: constant, progression, distribute three, and arithmetic. The objects in the problems have 6 attributes: number, position, type, size, color, and orientation. The wrong answers are generated by changing one attribute of the correct answer, which introduces an answer bias; thus, wrong answers indicate the right answer.

The I-RAVEN dataset [16] (https://github.com/husheng12345/SRAN) has the same hierarchical generator and generates wrong answers differently. Each wrong answer generated by I-RAVEN has a 50% chance of changing one of the attributes of the correct answer and differs from the correct answer in more than one attribute; thus, no answer bias exists.

The RAVEN-fair dataset [17] (https://github.com/yanivbenny/RAVEN_FAIR) also uses the same hierarchical generator and uses a different method to generate wrong answers. The algorithm generates one wrong answer at a time. After generating the first wrong answer by changing an attribute of the correct image, the algorithm randomly selects one of the generated incorrect or correct answers and changes one of the attributes of the selected image to generate a new incorrect answer. RAVEN-fair also has no answer bias.

β-VAE Module

The sVAE module was developed from the public code of the β-VAE module by Higgins [30]. The original code of the β-VAE module is available at https://github.com/AntixK/PyTorch-VAE.

Comparison Algorithms

We ran PrAE [21] on the RAVEN-fair dataset. The model is publicly available at https://github.com/WellyZhang/PrAE.

We ran Rel-base [19] on the I-RAVEN and RAVEN-fair datasets. The model is publicly available at https://github.com/SvenShade/Rel-AIR.

We ran MRNet [17] on the I-RAVEN dataset. The model is publicly available at https://github.com/yanivbenny/MRNet.

We ran SCL [20] on the RAVEN-fair dataset. The model is publicly available at https://github.com/dhh1995/SCL.

We ran SRAN [16] on the RAVEN and RAVEN-fair datasets. The model is publicly available at https://github.com/husheng12345/SRAN.

The model architectures and model parameters used are the same as those in the files.

Appendix 3. Training DetailsFCNN Training Details

We trained the FCNN model on a CPU platform (Windows 11 64-bit; Intel(R) Core(TM) i3-10110U CPU @ 2.10 GHz (4 CPUs)), and the model contains 4 convolutional layers (16, 32, 64, and 128 latent dimensions, kernel size 5 * 5, stride 1, padding 1) and a feedforward layer (16384 nodes). We generated 959 problems per configuration to train the FCNN module. For each problem, the model takes the first image as input and the configuration index of the problem as the label (cross-validation). The model was trained for up to 50 epochs and achieved 100% accuracy when classifying the configurations of the images.

sVAE Training Details

We trained sVAE on the institute’s GPU platform (NVIDIA SMI, 460.80; driver version, 460.80; CUDA version, 11.2).

Five hundred problems were generated for each configuration to train and validate the sVAE module (approximately 375 training problems and 125 validation problems). Each problem consists of 16 NumPy array figures containing one or more objects. We segmented the figures into individual object images according to the structural organization of the figures. We used the cropped object images and 1 * 29 vectors describing the objects’ meta-information (type, size, color, and angle) obtained from the corresponding problem xml file to train the sVAE module. We trained one sVAE model for each configuration, except for the O-IC and O-IG configurations, where we trained two separate sVAE models for the “in” and “out” configurations. The model was trained quickly, requiring less than 5 min and 10 epochs to obtain acceptable results. We trained the models up to the 100th epoch (or 50th epoch for the out configurations) to obtain good reconstructions.

The model used four losses to train: the object reconstruction loss, the latent variable reconstruction loss, the supervised loss (difference between the semantic features and the labels), and the regularization loss (divergence between the distribution of the latent variables and the assumed distribution). The supervised loss used the smooth l1 loss, as in Eq. 3, while the other losses used the mean squared loss (mse), as in Eq. 4:

$$_=\left\\sum_^0.5\times _-f\left(_\right))}^, |_-f\left(_\right)| < 1\\ \sum_^|_-f\left(_\right)|-0.5,\mathrm\end\right.$$

(3)

We evaluated the performance of the sVAE model on different training sample sizes (from 100 to 959 problem object segments). The model achieved top perceptual accuracy within 300 training problems and accurately answered RPM problems with human-designed cognitive maps. The reported model was trained on 500 problems (see Table 7).

Table 7 Model Perception and problem-solving performance for different sample sizes (number of object images)

The model did not need to see all objects to reason effectively. RAVEN contains 2400 objects. in this study, we used only 240 to 2400 images corresponding to 240 to 2400 objects (one image per object) to train the model. The performance is shown in Table 8.

Table 8 Perception and problem-solving performance of the model trained with different proportions of all 2400 objectssVAE Image Generation

sVAE can generate clear answer images for RPM problems. To generate an answer image, the algorithm first generated object images according to object features and then arranged them according to their predicted positions. There were two levels of prediction (panel-level and object-level) for both object features and positions. We first considered object-level position predictions, which specify the attribute (feature or position) of a single object. If there was no object-level prediction, we assigned attributes according to the panel-level predictions. The panel-level predictions did not specify which value corresponded to which object. Thus, the order of assignment was randomized when we assigned values according to panel-level positions. If there were no predictions at both levels, we randomly assigned values to the attribute.

The semantic features in the sVAE altered the images in an understandable way, but this result was not possible with β-VAE (Fig. 17) [54].

Fig. 17figure 17

When we changed the latent variables in the bottleneck of β-VAE, the change in the generated image was usually unpredictable. The attributes (shape, angle, and size) tended to change together. Many dimensions did not change the image at all. Conversely, when we changed the semantic features in sVAE, the change in the generated images was predictable. For example, we can change the shape of a triangle from a triangle to a circle (second row). Alternatively, we can change its color from light to dark (fourth row). We can also create objects with these semantic features (last row)

Cognitive Map Training Details

The cognitive map is trained on a CPU machine (Windows 10 64-bit Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz (12 CPUs)).

There are three types of feature maps: attribute feature maps, 2 × 2 position feature maps, and 3 × 3 position feature maps. Each type of cognitive map has two subtypes: 9 × 9 feature maps and 9 × 9 × 9 feature maps.

To build feature maps for feature vectors containing attributes of the first eight panels and the answer panel, we found categorical numerical relationships between two (three) elements in the feature vectors. The position (a,b) ((a,b,c)) in the 9 × 9 (9 × 9 × 9) feature maps stores the categorical relationship between the ath element and the bth element or among the ath, bth, and cth elements in the feature vector. For attribute feature maps, the relationships in 9 × 9 feature maps are “ + 1,” “-9,” “ = ,” etc., while the relationships in 9 × 9 × 9 feature maps are “a + b = c,” “a – b – 2 = c,” etc. The relationships for 2 × 2 location feature maps and 3 × 3 location feature maps are different from those for attribute feature maps. For example, “ + 1” defines a position relationship where all objects move to the right and the last object moves to the first. “a + b = c” means that the objects in the third panel occupy all the positions occupied by the first and second panels.

We generated 15,000 problems per configuration to train and validate the CMRB (10,000 for training and 5000 for validation). We used the “Center” configuration problems with 16 images per problem, one object per image, and three attributes per object to extract feature vectors and train the attribute feature maps. The sVAE model was used to acquire attributes from objects and construct feature vectors. A total of 30,000 feature vectors were constructed for the 3 attributes of 10,000 training problems. We provided additional candidate answers for attributes in the test set. If the predicted attribute value is not in the candidates, we allowed the algorithm to use other cognitive maps. The learned cognitive maps can be generalized to other configurations. Similarly, we used the positional information from 2 × 2 and 3 × 3 problems to train 2 × 2 and 3 × 3 positional feature maps.

We stored the acquired cognitive maps in each epoch (1000 training steps) and selected the best-performing model from the training epochs based on the validation performance. The model was trained quickly and reached 99.5% validation accuracy in the 2nd training epoch with 2000 feature vectors (approximately 700 problems). The model reached its best performance (99.7%) in the 10th training epoch with 10,000 feature vectors (approximately 3000 problems). The performance of position cognitive maps on problems with “position” or “number/position” rules in the metadata was 1.0 (the metadata information about which problem has position relations is not available to the algorithms during training). The threshold similarity score L0 for 9 × 9 feature maps and L1 for 9 × 9 × 9 feature maps are parameters for CMRB. The validation performance of the model at different threshold similarity scores is shown in Tables 7, 8, and 9. The parameters that led to the best performance with the least number of cognitive maps (numbers in parentheses) were selected (italics).

Table 9 Performance of attribute feature maps at different similarity thresholdsTable 10 Performance of 2 × 2 position feature maps at different similarity thresholdsTable 11 Performance of 3 × 3 position feature maps at different similarity thresholds

LTM has a capacity of 30 for each subtype of feature map and can store a maximum of 180 cognitive maps in total. Finally, the best model generated 49 cognitive maps. 7 (9 × 9) + 8 (9 × 9 × 9) attribute feature maps, 7 + 17 2 × 2 position feature maps, and 6 + 4 3 × 3 position feature maps. Many cognitive maps capture the underlying rules in RPM problems. The algorithms also discover some rules that are not considered in RPM problems. These maps can solve some problems efficiently but can also lead to predictions that differ from the data generator (Tables 11 and 12).

Cognitive maps reflect how algorithms see RPM problems. In some cases, they see RPM problems differently than humans. For example, the algorithms sometimes use the plus operation in the position feature maps to solve cases where three different positions have the same value. This view is logically correct, but humans rarely see the problem this way. The algorithm usually discovers relationships between nonadjacent panels, which is unusual for humans This unusual behavior is similar to alphaGo, which has produced some unusual strategies in the game of GO.

Model Testing

Using parameters from the trained FCNN and sVAE modules and cognitive maps from the CMRB module, we tested the algorithms on 70,000 new problems (10,000 problems per configuration), and the resulting performance is reported in the main text (Fig. 18).

Appendix 4. Designed Cognitive MapsFig. 18figure 18

We can also draw cognitive maps by hand and assign relationships between panels as shown above

We can draw a cognitive map by hand based on our understanding of the problem (i.e., Fig. 8) and represent it mathematically: (1) we draw the structure and define nine variables (× 1, × 2,… × 9) corresponding to the 9 positions in 3 by 3 matrices; and (2) we draw the edges and describe the edges (relations between the 9 variables) in mathematical terms and functions, i.e., × 2 =  × 1 + n, × 3 =  × 2 + n (n equals -9 to 9).

When given a new RPM problem, we fed its first 8 attribute values into the first 8 defined variables and observed if the mathematical equations were satisfied. If they were satisfied, we computed the 9th variable based on its relationships to the other variables. The performance of the hand-designed cognitive maps for 7 configurations in the RAVEN, I-RAVEN, and RAVEN-fair datasets is shown in Table 10.

Table 12 mean accuracy for the model with designed cognitive maps on RAVEN, I-RAVEN, and RAVEN-Fair datasetsAppendix 5. Generalization Experiment Datasets and Details

In the 3D-chairs dataset [39] (https://www.di.ens.fr/willow/research/seeing3Dchairs/), we selected 20 types of chairs, 21 images per chair (image size, cropped to center 592; resize 256,256; channel no, 3) from left, right, front, and back angles (4–6 images per angle, marginally different from each other), and applied four types of transfigurations (zoom, stretch, shift, and color change with 5, 5, 3, and 5 dimensions) to the images, creating a dataset of 157,500 images (7875 per chair type) and labels (documenting types, transfigurations, and angles).

For 3D face datasets [40] (https://faces.dmi.unibas.ch/bfm/bfm2019.html), we used the Basel face model to construct three-dimensional faces with 53,490 3D vertices. Each vertex had three position indices (x, y, z) describing the topological corresponding position of the vertex and three color indices (r, g, b) describing its texture. A new face was generated by randomly sampling from Gaussian distributions and determining the weight of the first 199 shape principal components (the principal component of the position indices) and the first 199 appearance principal components (the principal component of the color indices). We created a dataset of 25,000 images (image size, resize 256 \(\times\) 256; channel number, 3) and labels by taking a photo at the 60° left viewing angle for each constructed face and using the weight of the first 25 shape and appearance components as labels.

The CelebA dataset [41] (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) contains 202,599 images (image size, crop to center 148 resize 128 \(\times\) 128; channel number, 3; random horizontal flip) labeled with 40 binary attributes and ten landmark locations. We also trained an active appearance model (AAM, https://www.menpo.org/menpofit/aam.html) to place 68 landmark point markers carrying shape information of faces to acquire additional brain-like shape and appearance labels. With the acquired shape information from the landmark locations, we morphed the landmarks to match the average landmark locations to produce images carrying shape-free appearance information and projected shape information and shape-free appearance onto 24 principal components. The scores of the images on these shape and appearance principal components, along with the 50 semantic labels from the dataset, were used to train the model, resulting in 98-dimensional labels.

The LFW dataset [42] (http://vis-www.cs.umass.edu/lfw/) contains 13,233 images (image size, crop to center 148 resize 128 \(\times\) 128; channel number, 3; random horizontal flip) with 73-dimensional numerical labels. We generated an additional 48-dimensional brain-like shape and appearance labels using the active appearance model as the CelebA dataset.

留言 (0)

沒有登入
gif