Towards the accurate alignment of over a million protein sequences: Current state of the art

Multiple sequence alignment (MSA) methods explicitly model the relationships between evolutionarily related sequences at residue resolution. MSA is one of the most widely used modelling techniques in biology [1] and is critical in inferring evolutionary, functional, and structural predictions. MSA methods are now under the pressure of dealing with unprecedented amounts of data. Within ten years, when the Earth BioGenome project delivers 1.5 million eukaryotic genomes [2,3], MSAs for the largest superfamilies, such as the kinases or ABC transporters, will feature close to a billion homolog sequences, as opposed to the few millions currently available. None of the existing MSA methods can deal with such a scale-up. This issue is well recognised [4] and the purpose of this review is to provide a critical discussion of methods specifically designed over the last few years to address MSA scaling-up.

MSA computation is NP-Hard under any useful formulation [5]. This limitation implies that no method can guarantee optimality on a realistic sequence dataset, therefore forcing the reliance on heuristic solutions. A plethora of such methods has been reported [6], with the most popular being based on the Hogeweg progressive algorithm [7]. The progressive approach involves clustering the sequences into a rooted dendrogram, also known as a guide tree, and aligning them from leaf to root following the guide tree order. Despite all of its known short-comings, the accuracy of the progressive algorithm was expected to scale well, as established on reference datasets featuring less than 100 sequences [8] but this expectation had to be questioned when larger datasets revealed the tendency of most aligners to decrease in accuracy when dealing with over 1000 sequences [4]. This observation suggested that the scaling-up problem was not merely a computational issue and that the algorithms would have to be reconsidered on top of being optimized. Owing to the highly modular nature of the MSA algorithmic framework, we will see that these improvements have separately focused on the guide tree estimation and on the assembly procedure. This scale-up effort has also been complemented by the development of novel evaluation strategies befit for the new scales.

留言 (0)

沒有登入
gif