Shift-Equivariant Similarity-Preserving Hypervector Representations of Sequences

Experimental evaluations of the proposed approach were carried out in several diverse tasks: spellchecking (“Spellchecking” section), classification of molecular biology data (“Classification of Molecular Biology Data” section), modeling the identification of visual images of words by humans (“Modeling Visual String Identification by Humans” section). In the scope of this paper, the intention of the experiments was to demonstrate the feasibility of the proposed approach to hypervector representation of sequences and its applicability to diverse problem setups.

Spellchecking

For misspelled words, a spellchecker suggests one or more variants of the correct word. We used similarity search for this problem. Dictionary words and misspelled (query) words were transformed to hypervectors by the methods of “Method” section. A specified number of dictionary words with the HVs most similar to the HV of a query word were selected.

As in [72,73,74,75], the measure Top-n = tn/t was used as an indicator of quality or accuracy, where tn is the number of cases where correct words are contained among the n words of the dictionary most similar to the query, t is the number of queries (i.e., the size of the test set). Two datasets were used: aspellFootnote 1 and wikipedia.Footnote 2 The tests contain misspellings for some English words and their correct spelling. Our results are obtained with the corncobFootnote 3 dictionary containing 58,109 lowercase English words.

Figure 3 shows the aspell Top-n vs R for n =  and their average (Top-mean) for n = 1…10. Here and thereafter, the dimension of HVs is D = 10,000. R = 1 corresponds to the lack of similarity between letter HVs at adjacent positions. As R increases, the results improve upto R = 6–8, then deteriorate slowly.

Fig. 3figure 3

The Top-1, Top 10, and Top-mean spellchecking accuracy on the aspell dataset vs the similarity radius R. SimHV,cos, no shifts in (4)

Our results obtained using simHV and simSym for various parameters are shown in Table 1. For HVs, means and stds are given (over 50 realizations). The results of Word®, Ispell [75], Aspell [75], and the spellcheckers from [72,73,74] are also shown. All these spellcheckers work with single words, i.e., do not take into account the adjacent words. However, the results of [72, 73, 75] were obtained using methods specialized for English (using rules, word frequencies, etc.). In contrast to those results, our approach extends naturally to other languages.

Table 1 Spellchecking accuracy on the aspell and wikipedia tests

Only [74] worked with HV representations and corncob. However, they used the HVs of all 2-grams in the forward direction and in the backward direction, as well as all subsequences (i.e., non-adjacent letters) of two letters in the backward direction. This is in striking contrast to our HV representations, which reflect only the similarity of the same individual letters at nearby positions. Our results are at the level of [74]. Our best results were obtained for s = 0 (no string shifts, see “Hypervector Similarity of Strings without Shift” section). Increasing s did not lead to a noticeable result change. Note that the similarity search using distLev and distLev/max (divided by the length of the longer word) produced results that are inferior to ours.

Classification of Molecular Biology Data

Experiments were carried out on two Molecular Biology datasets from [76]. For hypervectors, we used the following classifiers: nearest neighbors kNN (mainly with simHV,cos), Prototypes, and linear SVM. In Prototypes, class prototypes were obtained by summing the HVs of all training samples from the class; their max similarity with the test HV was used. For SVM, in some cases, the SVM hyperparameters for a single realization of hypervectors were selected by optimization on the training set. The same SVM hyperparameters (or default) were used for multiple HV realizations.

Splice Junction Recognition

The Splice-junction Gene Sequences datasetFootnote 4 [76] contains gene sequences for which one needs to recognize the class of splice junctions they correspond to: exon–intron (EI), intron–exon (IE), and no splice (Neither). Each sequence contains 60 nucleotides. The database consists of 3190 samples; 80% of each class was used for training and 20% for testing. Recognition results (accuracy) are shown in Table 2. The results obtained using hypervectors are on a par with the results of other methods from [76,77,78,79,80,81]. Note that a direct comparison of the results is not fair due to different data partitioning into training and test sets.

Table 2 Accuracy on the splice-junction gene sequences dataset

The best results were obtained for R = 1, s = 0. This corresponds to an element-wise comparison of the sequences. We explain this by the fact that the sequences in the database are well-aligned and the recognition result in this problem depends on the presence of certain nucleotides at strictly defined positions. Nevertheless, the introduced hypervector representations and similarity measures demonstrate competitive results for the selected parameters.

Protein Secondary Structure Prediction

The Protein Secondary Structure datasetFootnote 5 [76] contains some globular proteins data from [82], and the task is to predict their secondary structure: random-coil, beta-sheet, or alpha-helix. As input data, a window of 13 consecutive amino acids was used, which was shifted over proteins. For each window position and for the amino acid in the middle of the window, the task was to predict what secondary structure it is a part of within the protein. The training/test sets contained 18,105/3520 samples. The prediction results are shown in Table 3. The results of HV and linear SVM for R = 1 are at the level of 62.7% [82] obtained by multilayer perceptron for the same experimental design. Using R = 2 slightly improved the results obtained with R = 1.

Table 3 Prediction accuracy on the protein secondary structure dataset

Note, that the results obtained on this dataset under this very setup using other methods are inferior to ours (see, e.g., [82]). To improve the results in this and similar tasks, some techniques after [82] used additional information, such as the "similarity" of amino acids, etc. This information can be taken into account in the varied similarity of HVs representing different amino acids; however, this is beyond the scope of this paper (see also Discussion).

The results of this section show that not all string processing tasks benefit from accounting for symbol insertions/deletions (in our approach, regulated by R) and string shifts (regulated by s). For instance, for the Splice-junction dataset, R > 1 worsened the results, and for the Secondary-structure dataset, R = 2 only slightly improved them. However, we see that the HV representations provide worthy classification results using linear vector classifiers.

The tasks in which the results depend significantly on both parameters R and s are considered in the next section.

Modeling Visual String Identification by Humans

Here we present the results of experiments on the similarity of words using their hypervector representation. The results are compared to those obtained by psycholinguists for human subjects and provided in [24, 26]. Those experiments investigated priming for visual (printed) words in humans.

Modeling Restrictions on the Perception of Word Similarity

In [25], the properties of visual word similarity obtained by psycholinguists in experiments with human subjects have been summarized and classified into 4 types of constraints, i.e., stability (similarity of a string to any other is less than to itself); edge effect (the greater importance of the outer letters coincidence vs the inner ones); transposed letter (TL) effects (transposing letters reduces similarity less than replacing them with others); relative position (RP) (breaking the absolute letter order while keeping the relative one still gives effective priming).

Table 4 shows which constraints on human perception of visual word similarity are satisfied in various models. Our results were obtained for simSym and simHV (D = 10,000, m = 11, 50 realizations). To reflect the edge effect, we used the "db" option: the HVs were formed in a special way equivalent to the HV representation of strings with doubled first and last letters.

Table 4 Constraints on human perception of visual word similarity that are satisfied by various models

For s > 2, the results coincided with s = 2. The best fit to the human constraints was for R = , s = 2. For simSym,cos and for simHV,Simp all constraints are satisfied for R = 3, s = 2.

For comparison, the results of other models are shown:

Hannagan et al. [25] used the HV representations of the BSC model [46], with the following options. Slot: superposition of HVs obtained by binding HVs of each letter and its (random) position HV. COB: all subsequences of two letters with a position difference of up to 3. UOB: all subsequences of two letters. LCD: a combination of Slot and COB. [25] also tested non-hypervector models: Seriol and Spatial [24].

Cohen et al. [58] used the BSC model and real- and complex-valued HVs of HRR [45]. Position HVs were correlated, and their similarity decreased linearly along the length of a word.

Cox et al. [26] proposed the “terminal-relative” string representation scheme. It used the representation of letters and 2-grams without position, as well as the representation of the positions of letters and 2-grams relative to the terminal letters of the word. This scheme was implemented in the HRR model and met all the constraints from [25].

Modeling the Visual Similarity of Words

In [27], the experimental data on the visual word identification by humans were adapted from [24]. 45 pairs of prime-target strings were obtained, for which there exist the times of human word identification under different types of priming.

Figure 4 shows the average value of the Pearson correlation coefficient Corr (between the simHV values and priming times) vs R for different s (D = 10,000, m = 111, 50 HV realizations). It can be seen that the value of Corr depends substantially on both R and s. The maximum values were obtained at R = 3 and s = 2.

Fig. 4figure 4

Mean values of the Pearson correlation coefficient Corr between word hypervector similarity values and the times of the word forward priming vs R. SimHV,cos, shift configurations s = 1, 2, 3, 4 in (4)

Table 5 shows the Corr between the times of human identification and the values of similarity. The results for HVs were obtained for R = 3, s = 2 (D = 10,000, m = 111, 50 HV realizations) and for similarity measures simHV,Jac, simHV,cos, simHV,Simp. The db option was used. We also provide results for distLev and simSym,Simp (R = 3, s = 2) without the db option.

Table 5 The Pearson correlation coefficient Corr between 45 word pairs similarity values and the times of the forward priming

The results from [27] are shown as well, where strings were transformed to vectors whose components correspond to certain combinations of letters, with the following variants. Spatial Coding: adapted from [24]. GvH UOB: all subsequences of two letters are used. Kernel UOB (Gappy String kernel): uses counters of all subsequences of two letters within a window. 3-WildCard (gappy kernel): kernel string similarity [27] (all subsequences of two letters are padded with * in all acceptable positions, the vector contains the frequency of each obtained combination of three symbols).

It can be seen that with the proper parameters, the results of hypervector similarity measures are competitive with other best results, such as distLev and simSym.

留言 (0)

沒有登入
gif