On the Design of Optimization Criteria for Multiple Sequence Alignment
Multiple sequence alignment (MSA) is important in functional, structural and evolutionary studies of sequence data. While MSA construction has traditionally been an interactive process, the rapid growth of genetic sequence data has engendered a need for automated sequence analysis without human intervention. This requires more accurate methods based on rigorous mathematical models that reflect sequence biology in a realistic way. Focusing on MSA as an optimization problem, we examine the problem of unifying mathematical tractability with biological accuracy in cost function design. In particular, we consider tree alignment, which is often viewed as the most “biological” of the rigorous approaches to MSA. We point out several important pitfalls in current optimization approaches to MSA and identify characteristics for good cost function design. Design issues specific to approximation algorithms are also addressed. We hope these ideas will lead to future research on a biologically realistic and mathematically rigorous approach to MSA.