Fast and accurate protein structure search with Foldseek

Overview

Foldseek enables fast and sensitive comparison of large structure sets. It encodes structures as sequences over the 20-state 3Di alphabet and, thereby, reduces structural alignments to 3Di sequence alignments. The 3Di alphabet developed for Foldseek describes tertiary residue–residue interactions instead of backbone conformations and proved critical for reaching high sensitivities. Foldseek’s prefilter finds two similar, spaced 3Di k-mer matches in the same diagonal of the dynamic programming matrix. By not restricting itself to exact matches, the prefilter achieves high sensitivity while reducing the number of sequences for which full alignments are computed by several orders of magnitude. Further speed-ups are achieved by multi-threading and using single instruction, multiple data (SIMD) vector units. Owing to the SIMDe library (https://github.com/simd-everywhere/simde), Foldseek runs on a wide range of CPU architectures (x86_64, arm64 and ppc64le) and operating systems (Linux and macOS). The core modules of Foldseek, which build on the MMseqs2 framework6, are described in the following paragraphs.

Create database

The createdb module converts a set of PDB (ref. 29), macromolecular crystallographic information file (mmCIF) formatted files or Foldcomp compressed structure (FCZ (ref. 30)) files into an internal Foldseek database format using the Gemmi package (https://gemmi.readthedocs.io/en/latest/) or the Foldcomp library. The format is compatible with the MMseqs2 database format, which is optimized for parallel access. We store each chain as a separate entry in the database. The module follows the MMseqs2 createdb module logic. However, in addition to the amino acid sequence, it computes the 3Di sequence from the 3D atom coordinates of the backbone atom and Cβ coordinates (see the ‘Descriptors for 3Di structural alphabet’ and ‘Optimize nearest-neighbor selection’ subsections). Backbone atom and Cβ coordinates are needed only for the nearest-neighbor selection. For Cα-only structures, Foldseek reconstructs backbone atom coordinates using PULCHRA31. Missing Cβ coordinates (for example, in glycines) are defined such that the four groups attached to the Cα are arranged at the vertices of a regular tetrahedron. The 3Di and amino acid sequences and the Cα coordinates are stored in the Foldseek database. To save disk space, we optionally compress the Cα coordinates losslessly, beginning with three uncompressed 4-byte floating-point Cα coordinates and storing all subsequent coordinates as 2-byte signed integer differences32. If any difference is too large to be represented with a 2-byte signed integer, we fall back to 4-byte floats for all Cα coordinates.

Prefilter

The prefilter module detects double matches of similar, spaced words (k-mers) that occur on the same diagonal. The k-mer size is dynamically set to k = 6 or k = 7 depending on the size of the target database. Similar k-mers are those with a 3Di substitution matrix score above a certain threshold, whereas MMseqs2 uses an amino acid substitution matrix to compute the similarity (see the ‘3Di substitution score matrix’ subsection). The gapless double-match criterion suppresses hits to non-homologous structures effectively, as they are less likely to have consecutive k-mer matches on the same diagonal by chance. To avoid FP matches due to regions with biased 3Di sequence composition, a compositional bias correction is applied in a way analogous to MMseqs2 (ref. 33). For each hit, we perform an ungapped alignment over the diagonals with double, consecutive, similar k-mer matches and sort those by the maximum ungapped diagonal score. Alignments with a score of at least 15 bits are passed on to the next stage. We implemented an optional taxonomy filter within the prefiltering step to help users search through taxonomic subsets of the target database. After the gapless double-diagonal matching stage and before the ungapped alignment stage, we reject all potential target hits that do not lie within a taxonomic clade specified by the user.

Pairwise local structural alignments

After the prefilter has removed the vast majority of non-homologous sequences, the structurealign module computes pairwise alignments for the remaining sequences using an SIMD-accelerated Smith–Waterman algorithm34,35. We extended this implementation to support amino acid and 3Di scoring, compositional bias correction and 256-bit-wide vectorization. The score linearly combines amino acid and 3Di substitution scores with weights 1.4 and 2.1, respectively. We optimized these two weights and the ratio of gap extend to gap open penalty on ~1% of alignments (all-versus-all on 10% of randomly selected SCOPe40 domains). A compositional bias correction is applied to the amino acid and 3Di scores. To further suppress high-scoring FP matches, for each match we align the reversed query sequence against the target and subtract the reverse bit score from the forward bit score.

Structural bit score

We rank hits by a ‘structural bit’ score—that is, the product of the bit score produce by the Smith–Waterman algorithm and the geometric mean of average alignment LDDT and the alignment TM-score.

Fast alignment LDDT computation

To improve the LDDT score computation speed, we store the 3D coordinates of the query in a grid using spatial hashing. Each grid cell spans 15 Å, which is the default radius considered for the LDDT computation. For each aligned query residue i, we compute the distances to all Cα atoms within a 15 Å radius by searching all neighboring grid cells of the query residue’s grid cell. For each residue j, we compute the distance between the Cα atoms of i and j and the distance of the corresponding aligned target residues. Query and target distances for the aligned pairs are subtracted, and the differences d are transformed into LDDT scores s = 0.25 × ((d < 0.5) + (d < 1.0) + (d < 2.0) + (d < 4.0)). For each i, we obtain the means of the scores for all Cα atoms j within the 15 Å radius of i. The LDDT score is the mean of these means over all query residues i.

E values

To estimate E values for each match, we trained a neural network to predict the mean μ and scale parameter λ of the extreme value distribution for each query. The module computemulambda takes a query and database structures as input and aligns the query against a randomly shuffled version of the database sequences. For each query sequence, the module produces N random alignments and fits to their scores an extreme value (Gumbel) distribution. The maximum likelihood fitting is done using the Gumbel fitting function taken from HMMER3 (hmmcalibrate)36. To train the neural network, it is critical to use query and target proteins that include problematic regions, such as structurally biased, disordered or badly modeled regions that occur ubiquitously in full-length proteins or modeled structures. We, therefore, trained the network on 100,000 structures sampled from the AlphaFoldDB (version 1). We trained a neural network to predict μ and λ from the amino acid composition of the query and its length (so a scrambled version of the query sequence would produce the same μ and λ). The network has 22 input nodes, two fully connected layers with 32 nodes each (ReLU activation) and two linear output nodes. The optimizer Adam with learning rate 0.001 was used for training. When testing the resulting E values on searches with scrambled sequences, the log of the mean number of FPs per query turned out to have an accurately linear dependence on the log of the reported E values, albeit with a slope of 0.32 instead of 1. We, therefore, correct the E values from the neural network by taking them to the power of 0.32. We compared how well the mean number of FPs at a given E value agreed with the E values reported by Foldseek, MMseqs2 and 3D-Blast (Supplementary Fig. 10; see Supplementary Fig. 11 for AlphaFoldDB). We considered a hit as FP if it was in a different fold and had a TM-score lower than 0.3. Furthermore, we ignored all cross-fold hits within the four-bladed to eight-bladed β-propeller superfamilies (SCOPe b.66-b.70) and within the Rossman-like folds (c.2–c.5, c.27, c.28, c.30 and c.31) because of the extensive cross-fold homologies within these groups37.

Probability of TP match

Foldseek computes for each match a simple estimate for the probability that the match is a TP match given its structural bit score. Here, hits within the same superfamily are TP; hits to another fold are FP; and hits to the same family or to another superfamily are ignored. We estimate the structural bit score distributions of TP and FP hits (p(score∣TP) and p(score∣FP)), which allow us to calculate the probability of a TP \(p(TP| score)=\frac}}\,| TP)\,p(TP)}}}\,| TP)\,p(TP)+p(\,}}\,| FP)\,p(FP)}\). Both score distributions were fitted on SCOPe40 with a mixture model consisting of two gamma distributions (resulting in five parameters for each function). For the fitting, the function gammamixEM from the R package mixtools38 was used. We excluded cross-fold hits between certain folds as in the E value estimation. For example, Foldseek finds around the same number of FPs and TPs with a score of 51 in SCOPe40. The probability for a hit with score 51 is, therefore, 50%.

Pairwise global structural alignments using TM-align

We also offer the option to use TM-align for pairwise structure alignment instead of the 3Di-based alignment. We implemented TM-align based on the Cα atom coordinates and made adjustments to improve the (1) speed and (2) memory usage. (1) TM-align performs multiple floating-point-based Needleman–Wunsch (NW) alignment steps while applying different scoring functions (for example, score secondary structure, Euclidean distance of superposed structures or fragments). TM-align’s NW code did not take advantage of SIMD instructions; therefore, we replaced it by parasail’s39 SIMD-based NW implementation and extended it to support the different scoring functions. We also replaced the TM-score computation using fast_protein_cluster’s SIMD-based implementation40. Our NW implementation does not compute exactly the same alignment because we apply affine gap costs, whereas TM-align does not (Supplementary Fig. 12). (2) TM-align requires 17 bytes × query length × target length of memory, and we reduce the constant overhead from 17 bytes to 4 bytes. If Foldseek is used in TM-align mode (parameter --alignment-type 1), TM-align is used for the alignment stage after the prefilter step, where we replace the reported E value column with TM-scores normalized by the query length. The results are ordered in descending order by average TM-score by default.

Descriptors for 3Di structural alphabet

The 3Di alphabet describes the tertiary contacts between residues and their nearest neighbors in 3D space. For each residue i, the conformation of the local backbone around i, together with the local backbone around its nearest neighbor j, is approximated by 20 discrete states (Supplementary Fig. 3). We chose the alphabet size A = 20 as a tradeoff between encoding as much information as possible (large A; Supplementary Fig. 13) and limiting the number of similar 3Di k-mers that we need to generate in the k-mer-based prefilter, which scales with Ak. The discrete single-letter states are formed from neighborhood descriptors containing 10 features encoding the conformation of backbones around residues i and j represented by the Cα atoms (Cα,i−1, Cα,i, Cα,i+1) and (Cα, j−1, Cα, j, Cα, j+1). The descriptors use the five unit vectors along the following directions:

$$\begin\mathbf_}:}}}_\to }}}_&\mathbf_}:}}}_\to }}}_\\ \mathbf_}:}}}_\to }}}_&\mathbf_}:}}}_\to }}}_\\ \mathbf_}:}}}_\to }}}_.&\end$$

We define the angle between uk and ul as ϕkl, so \(\cos _=\mathbf_}^\mathbf_}\). The seven features \(\cos _,\cos _,\cos _,\cos _,\cos _,\cos _,\cos _\) and the distance ∣Cα,i − Cα, j∣ describe the conformation between the backbone fragments. In addition, we encode the sequence distance with the two features \(\,}}\,(i-j)\,\min (| i-j| ,4)\) and \(\,}}\,(i-j)\,\log (| i-j| +1)\).

Learning the 3Di states using a vector quantized variational autoencoder

The 10-dimensional descriptors were discretized into an alphabet of 20 states using a vector quantized variational autoencoder (VQ-VAE)28. In contrast to standard clustering approaches such as k-means, VQ-VAE is a nonlinear approach that can optimize decision surfaces for each of its states. In contrast to the standard VQ-VAE, we trained the VQ-VAE not as a simple generative model but, rather, to learn states that are maximally conserved in evolution. To that end, we trained it with pairs of descriptors \(}}}}_,}}}}_\in }}^\) from structurally aligned residues, to predict the distribution of yn from xn.

The VQ-VAE consists of an encoder and decoder network with the discrete latent 3Di state as a bottleneck in between. The encoder network embeds the 10-dimensional descriptor xn into a 2D continuous latent space, where the embedding is then discretized by the nearest centroid, each centroid representing a 3Di state. Given the centroid, the decoder predicts the probability distribution of the descriptor yn of the aligned residue. After training, only encoder and centroids are used to discretize descriptors. Encoder and decoder networks are both fully connected with two hidden layers of dimension 10, a batch normalization after each hidden layer and ReLU as activation functions. The encoder, centroids and decoder have 242, 40 and 352 parameters, respectively. The output layer of the decoder consists of 20 units predicting μ and σ2 of the descriptors x of the aligned residue, such that the decoder predicts \(}}}(\mathbf| \mathbf ,I\mathbf^})\) (with diagonal covariance).

We trained the VQ-VAE on the loss function defined in Equation (3) in ref. 28 (with commitment loss = 0.25) using the deep learning framework PyTorch (version 1.9.0), the Adam optimizer, with a batch size of 512, and a learning rate of 10−3 over four epochs. Using Kerasify (https://github.com/moof2k/kerasify), we integrated the encoder network into Foldseek. The domains from SCOPe40 were split 80%/20% by fold into training and validation sets. For the training, we aligned the structures with TM-align, removed all alignments with a TM-score below 0.6 and removed all aligned residue pairs with a distance between their Cα atoms of more than 5 Å. We trained the VQ-VAE with 100 different initial parameters and chose the model that was performing best in the benchmark on the validation dataset (the highest sum of ratios between 3Di AUC and TM-align AUC for family, superfamily and fold level).

3Di substitution score matrix

We trained a BLOSUM-like substitution matrix for 3Di sequences from pairs of structurally aligned residues used for the ‘VQ-VAE training’. First, we determined the 3Di states of all residues. Next, the substitution frequencies among 3Di states were calculated by counting how often two 3Di states were structurally aligned. (Note that the substitution frequencies from state A to state B and the opposite direction are equal.) Finally, the score \(\,}}\,(x,y)=2\,_\frac\) for substituting state x through state y is the log-ratio between the substitution frequency p(x, y) and the probability that the two states occur independently, scaled by the factor 2.

3Di alphabet cross-validation

We trained the 3Di alphabet (the VQ-VAE weights) and the substitution matrix by four-fold cross-validation on SCOPe40. We split the SCOPe40 dataset into four parts, such that all domains of each fold ended up in the same part of the four parts. 3Di alphabets were trained on three parts and tested on the remaining part, selecting each of the four parts in turn as a test set. The 80:20 split between training and validation sets to select the best alphabet out of the 100 VQ-VAE runs happens within the 3/4 of the cross-validation training data. Supplementary Fig. 14 shows the mean sensitivity (black) and the standard deviation (gray area) in comparison to the final 3Di alphabet, for which we trained the 3Di alphabet on the entire SCOPe40 (red). No overfitting was observed, despite training 492 parameters (282 neural network and 210 substitution matrix entries). In Fig. 2, we, therefore, show the benchmark results for the final 3Di alphabet, trained on the entire SCOPe40.

Nearest-neighbor selection

To select nearest-neighbor residues that maximize the performance of the resulting 3Di alphabet in finding and aligning homologous structures, we introduced the virtual center V of a residue. The virtual center position is defined by the angle θ (V-Cα-Cβ), the dihedral angle τ (V-Cα-Cβ-N) and the length l (∣V − Cα∣) (Supplementary Fig. 1). For each residue i, we selected the residue j with the smallest distance between their virtual centers. The virtual center was optimized on the training and validation structure sets used for the VQ-VAE training by creating alphabets for positions with θ ∈ [0, 2π], τ ∈ [ − π, π] in 45∘ steps and l ∈ } (1.53 Å is the distance between Cα and Cβ). The virtual center defined by θ = 270∘, τ = 0∘ and l = 2 performed best in the SCOPe benchmark.

This virtual center preferably selects long-range, tertiary interactions and only falls back to selecting interactions to i + 1 or i − 1 when no other residues are nearby. In that case, the interaction captures only the backbone conformation.

SCOPe benchmark

We downloaded the SCOPe40 structures (available at https://wwwuser.gwdg.de/~compbiol/foldseek/scop40pdb.tar.gz).

The SCOPe benchmark set consists of single domains with an average length of 174 residues. In our benchmark, we compare the domains all-versus-all. Per domain, we measured the fraction of detected TPs up to the first FP. For family-level, superfamily-level and fold-level recognition, TPs were defined as same family, same superfamily and not same family and same fold and not same superfamily, respectively. Hits from different folds are FPs.

Evaluation SCOPe benchmark

After sorting the alignment result of each query (described in the ‘Tools and options for benchmark comparison’ subsection), we calculated the sensitivity as the fraction of TPs in the sorted list up to the first FP, all excluding self-hits. For comparison, we took the mean sensitivity over all queries for family-level, superfamily-level and fold-level classifications. We evaluated only SCOPe members with at least one other family, superfamily and fold member. We measure the sensitivity up to the 1st FP (ROC1) instead, for example, up to the 5th FP, because ROC1 better reflects the requirements for low false discovery rates in automatic searches.

Additionally, we plotted precision-recall curves for each tool (Fig. 2b and Supplementary Fig. 4). After sorting the alignment results by the structural similarity scores (as described in the ‘Tools and options for benchmark comparison’ subsection) and excluding self-matches, we generated a weighted precision-recall curve for family-level, superfamily-level and fold-level classifications (precision = TP / (TP + FP) and recall = TP / (TP + FN)). All counts (TP, FP and FN) were weighted by the reciprocal of their family, superfamily or fold size. In this way, folds, superfamilies and families contribute linearly with their size instead of quadratically36.

Runtime evaluations on SCOPe and AlphaFoldDB

We measured the speed of structural aligners on a server with an AMD EPYC 7702P 64-core CPU and 1,024 GB RAM memory. On SCOPe40, we measured or estimated the runtime for an all-versus-all comparison. To avoid excessive runtimes for TM-align, Dali and CE, we estimated the runtime by randomly selecting 10% of the 11,211 SCOPe domains as queries. We measured runtimes on AlphaFoldDB for searches with the same 100 randomly selected queries used for the sensitivity and alignment quality benchmarks (Fig. 2d,e). Tools with multi-threading support (MMseqs2 and Foldseek) were executed with 64 threads; tools without were parallelized by breaking the query set into 64 equally sized chunks and executing them in parallel.

Reference-free multi-domain benchmarks

We devised two reference-free benchmarks that do not rely on any reference structural alignments. We clustered the AlphaFoldDB (version 1)3 using SPICi22. For this, we first aligned all protein sequences all against all using an E value threshold <10−3 using BLAST (2.5.0+)5. SPICi produced 34,270 clusters from the search result. For each cluster, we picked the longest protein as representative. We randomly selected 100 representatives as queries and searched the set of remaining structures. The top five alignments of all queries are listed at https://wwwuser.gwdg.de/~compbiol/foldseek/multi_domain_top5_alignments/.

For the evaluation, we needed to adjust the LDDT score function taken from AlphaFold2 (ref. 1). LDDT calculates for each residue i in the query the fraction of residues in the 15 Å neighborhood that have a distance within 0.5, 1, 2 or 4 Å of the distance between the corresponding residues in the target23. The denominator of the fraction is the number of 15 Å neighbors of i that are aligned to some residue in the target. This does not properly penalize non-compact models in which each residue has few neighbors within 15 Å. We, therefore, use as denominator the total number of neighboring residues within 15 Å of i.

For the alignment quality benchmark (Fig. 2e), we classified each aligned residue pair as TP or FP depending on its residue-wise LDDT score—that is, the fraction of distances to its 15 Å neighbors that are within 0.5, 1, 2 and 4 Å of the distance to the corresponding residues in the query, averaged over the four distance thresholds. TP residues are those with a residue-wise LDDT score of at least 0.6 and FPs below 0.25, ignoring matches in between. For the search sensitivity benchmark (Fig. 2d), TP residue–residue matches are those with an LDDT score of the query-target alignment of at least 0.6 and FPs below 0.25, ignoring matches in between. (The LDDT score of the query-target alignment is the average of the residue-wise LDDT score over all aligned residue pairs.) The choice of thresholds is illustrated in Supplementary Fig. 6. The benchmark for other thresholds is shown in Supplementary Fig. 7.

All-versus-all search of AlphaFoldDB with Foldseek

We downloaded the AlphaFoldDB (version 1)3 containing 365,198 protein models and searched it all-versus-all using Foldseek -s 9.5 –max-seqs 2000. For our second-best hit analysis, we consider only models with (1) an average Cαʼs predicted LDDT (pLDDT) greater than or equal to 80 and (2) models of non-fragmented domains. We also computed the structural similarity for each pair using TM-align (default options).

Tools and options for benchmark comparison

Owing to dataset overlap, we excluded methods from the benchmark that are likely to be overfitted on SCOPe. This applies to methods that trained many thousands of parameters (for example, deep neural networks) with strong data leakage among training, validation and test sets. For example, several tools allowed up to 40% sequence identity between sets. The following command lines were used in the SCOPe as well as the multi-domain benchmark:

Foldseek

We used Foldseek commit aeb5e during this analysis. Foldseek was run with the following parameters: --threads 64 -s 9.5 -e 10 --max-seqs 2000.

Foldseek-TM

For the Foldseek-TM benchmark, we first run a regular 3Di/AA-based Foldseek search using the following parameters: --threads 64 -s 9.5 -e 10 --max-seqs 4000 --alignment-mode 1. All hits passing are then aligned by Foldseeks’s tmalign --tmalign-fast 1 --tmscore-threshold 0.0 -a. We used Foldseek commit aeb5e during this analysis. We expose Foldseek-TM in our command-line interface as a search mode that combines regular Foldseek 3Di/AA-based workflow with our TM-align implementation within the tmalign module.

MMseqs2

We used the default MMseqs2 (release 13-45111) search algorithm to obtain the sequence-based alignment result. MMseqs2 sorts the results by E value and score. We searched with: --threads 64 -s 7.5 -e 10000 --max-seqs 2000.

CLE-Smith–Waterman

We used PDB Tool version 4.80 (https://github.com/realbigws/PDB_Tool) to convert the benchmark structure set to CLE sequences. After the conversion, we used SSW35 (commit ad452e) to align CLE sequences all-versus-all. We sorted the results by alignment score. The following parameters were used to run SSW: (1) protein alignment mode (-p); (2) gap open penalty of 100 (-o 100); (3) gap extend penalty of 10 (-e 10); (4) CLE’s optimized substitution matrix (-a cle.shen.mat); and (5) returning alignment (-c). The gap open and extend values were inferred from DeepAlign41. The results are sorted by score in descending order.

ssw_test -p -o 100 -e 10 -a cle.shen.mat -c

3D-BLAST

We used 3D-BLAST (beta102) with BLAST+ (2.2.26) and SSW34 (version ad452e). We first converted the PDB structures to a 3D-BLAST database using 3d-blast -sq_write and 3d-blast -sq_append. We searched the structural sequences against the database using blastp with the following parameters: (1) 3D-BLAST’s optimized substitution matrix (-M 3DBLAST); (2) number of hits and alignments shown of 12,000 (-v 12000 -b 12000); (3) E value threshold of 1,000 (-e 1000); (4) disabling query sequence filter (-F F); (5) gap open of 8 (-G 8); and (6) gap extend of 2 (-E 2). 3D-BLAST’s results are sorted by E value in ascending order:

blastall -p blastp -M 3DBLAST -v 12000 -b 12000 -e 1000 -F F -G 8 -E 2

For Smith–Waterman, we used (1) gap open of 8; (2) gap extend of 2; (3) returning alignments (-c); (4) 3D-BLAST’s optimized substitution matrix (-a 3DBLAST); and (5) protein alignment mode (-p): ssw_test -o 8 -e 2 -c -a 3DBLAST -p. We noticed that the 3D-BLAST matrix with Smith–Waterman resulted in a similar performance to CLE: 0.717, 0.230 and 0.011 for family classification, superfamily classification and fold classification, respectively. We excluded 3D-BLAST’s measurement from the multi-domain benchmark because it produced occasionally high scores (>107) for single residue alignments.

TM-align

We downloaded and compiled the TMalign.cpp source code (version 2019/08/22) from the Zhang group website. We ran the benchmark using default parameters and -fast for the fast version. TM-align reports two TM-scores: (1) normalized by the length of 1st chain (query) or (2) normalized by the length of the 2nd chain (target). We used the average of TM-scores normalized by the 1st chain (query) and 2nd chain (target) in all our analyses. We evaluated TM-align’s performance by sorting the results based on both the query TM-score and the minimum, maximum and average TM-score for both the query and target. Our results showed that the average TM-score performed the best in our single-domain benchmark.

Default: TMalign query.pdb target.pdb

Fast: TMalign query.pdb target.pdb -fast

Dali

We installed the standalone DaliLite.v5. For the SCOPe40 benchmark set, input files were formatted in DAT files with Dali’s import.pl. The conversion to DAT format produced 11,137 valid structures out of the 11,211 initial structures for the SCOPe benchmark and 34,252 structures out of 34,270 SPICi clusters. After formatting the input files, we calculated the protein alignment with Dali’s structural alignment algorithm. The results were sorted by Dali’s z-score in descending order:

import.pl –pdbfile query.pdb –pdbid PDBid –dat DAT dali.pl –cd1 queryDATid –db targetDB.list –TITLE systematic –dat1 DAT –dat2 DAT –outfmt "summary" –clean

CE

We used BioJava’s42 (version 5.4.0) implementation of the combinatorial extension (CE) alignment algorithm. We modified one of the modules of BioJava under shape configuration to calculate the CE value. Our modified CEalign.jar file requires a list of query files, path to the target PDB files and an output path as input parameters. This Java module runs an all-versus-all CE calculation with unlimited gap size (maxGapSize -1) to improve alignment results14. The results were sorted by z-score in descending order. For the multi-domain benchmark, we excluded one query that was running over 16 d. The Jar file of our implementation of CE calculation is provided (see ‘Code availability’).

java -jar CEalign.jar querylist.txt TargetPDBDirectory OutputDirectory

Geometricus

We included Geometricus20 in the SCOPe benchmark as a representative of alignment-free tools, which are fast but can find only globally similar structures. Geometricus discretizes fixed-length backbone fragments (shape-mers) using their 3D moment invariants and represents structures as a fixed-length count vector over the shape-mers. To calculate the shape-mer-based structural similarity of the benchmark set, we used Caretta-shape’s Python implementation (1e3adb0) of multiple structure alignment (https://github.com/TurtleTools/caretta/caretta/multiple_alignment.py), which computes the Bray–Curtis similarity between the Geometricus shape-mer vectors. Our modified version extracts structural information from the input files and generates all-versus-all pairwise structural similarity score as an output. We ran Geometricus on a single core because it would require substantial engineering efforts to implement parallelization on multiple cores. With an efficient multi-core implementation, Geometricus might be as fast as MMseqs2 on 64 cores. The Python code of our implementation of Geometricus is provided:

python runGeometricus_caretta.py -i querylist.txt -o OutputDirectory

HOMSTRAD alignment benchmark

The HOMSTRAD database contains expert-curated homologous structural alignments for 1,032 protein families24. We downloaded the latest HOMSTRAD version (https://mizuguchilab.org/homstrad/data/homstrad_with_PDB_2022_Aug_1.tar.gz) and picked the pairwise alignments between the first and last members of each family, which resulted in structures of a median length of 182 residues. We used the same parameters as in the SCOPe and multi-domain benchmark. We forced Foldseek, MMseqs2 and CLE-Smith–Waterman to return an alignment by switching off the prefilter and E value threshold. With the HOMSTRAD alignments as reference, we measured for each pairwise alignment the sensitivity (fraction of residue pairs of the HOMSTRAD alignment that were correctly aligned) and the precision (fraction of correctly aligned residue pairs in the predicted alignment). Dali, CE and CLE-Smith–Waterman failed to produce an alignment for 35, 1 and 1 out of 1,032 pairs, respectively, which were rated with a sensitivity of 0. The mean sensitivity and precision are shown in Fig. 2e, and all individual alignments are listed in homstrad_alignments.txt at https://wwwuser.gwdg.de/~compbiol/foldseek/.

Limitations of benchmarks

The SCOPe benchmark to measure search sensitivity uses only single-domain proteins as queries and targets (Fig. 2a–c). It, therefore, cannot assess the ability of tools to find local similarities—for example, finding homologous domains shared between two multi-domain proteins. The alignment benchmark based on HOMSTRAD (Fig. 2e) has the same limitation, as the vast majority of proteins in HOMSTRAD have a single domain (median length = 182 residues). A drawback of our reference-free multi-domain benchmark is the need to choose thresholds for TPs and FPs (Supplementary Fig. 6).

Pre-built and ready-to-download databases

Foldseek includes the databases module to aid users with the download and setup of structural databases. Currently, we include the four variants of the AlphaFoldDB (version 4): UniProt (214 million structures), UniProt50, a clustered database to 50% sequence identity and 90% bi-directional coverage using MMseqs2 (parameters -c 0.9 --min-seq-id 0.5 --cluster-reassign; 54 million structures), Proteome (564,000 structures) and Swiss-Prot (542,000 structures). Additionally, we regularly build and offer a 100% sequence identity clustered PDB. The update pipeline is available in the util/update_webserver_pdb folder in the main Foldseek repository. These databases are hosted on Cloudflare R2 for fast downloading. We additionally link to and provide an automatic setup procedure for the ESM Atlas High-Quality Clu304 database.

Webserver

The Foldseek webserver is based on the MMseqs2 webserver43. To allow for searches in seconds, we implemented MMseqs2ʼs pre-computed database indexing capabilities in Foldseek. Using these, the search databases can be fully cached in system memory by the operating system and instantly accessed by each Foldseek process, thus avoiding expensive accesses to slow disk drives. A similar mechanism was used to store and read the associated taxonomic information. The AlpaFoldDB/UniProt50 (version 4), AlphaFoldDB/Proteome (version 4), AlphaFoldDB/Swiss-Prot (version 4), CATH50, ESM Atlas High-Quality Clu30 and PDB100 require 191 GB, 3.8 GB, 3.4 GB, 1.4 GB, 110 GB and 2.0 GB RAM, respectively. The databases are kept in memory using vmtouch (https://github.com/hoytech/vmtouch). Databases are only required to remain resident in RAM if Foldseek is used as a webserver. During batch searches, Foldseek adapts its memory use to the available RAM of the machine. We implemented a structural visualization using the NGL viewer44 to aid the investigation of pairwise hits. Because we only store Cα traces of the database proteins, we use PULCHRA30 to complete the backbone of these sequences, and also of the query if necessary, to enable a ribbon visualization45 of the proteins. For a high-quality superposition, we use TM-align11 to superpose the structures based on the Foldseek alignment. Both PULCHRA and TM-align are executed within the users’ browser using WebAssembly. They are available as pulchra-wasm and tmalign-wasm on the npm package repository as free open-source software.

Structure prediction in the webserver

We use the ESM Atlas API to predict structures of user-supplied sequences that are, at most, 400 residues long. This enables sequence-to-structure searches in the webserver.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif