Alignment: Mathematical Introduction

1. Alignments as Sequences of Deletions, Insertions, and Substitutions

Mathematically, the pairwise DNA-sequence alignment problem begins by providing two sequences S₁ and S₂ composed from the four characters A, C, G, or T. The following sequences present an example:

S₁: AGTGTTCCAG

S₂: AATCGTTACAG

An alignment of two sequences is a record of “edits” (or lack thereof) in the bases in S₁ that leads to the sequence S₂. Allowable edit operations are:

Deletion (D): delete a base

Insertion (I): insert a new base

Substitution (S): replace a base with another base

Notice that a substitution may not represent an actual “edit” as it does not rule out that a base may be “replaced” by exactly the same base. Biologically, the latter would correspond to a match. A substitution reflecting an actual change of the base would correspond to a mutation.

Example 1. Below are some possible alignments of the sequences S₁ and S₂ given as tables of corresponding pairs, together with the list of operations in L, describing the alignment. A hyphen in S₁ represents an insertion, while a hyphen in S₂ corresponds to a deletion. The sequence L represents the transformation of S₁ into S₂ by enacting the edits listed in L on S₁ form left to right.

a) S₁:A-GTGTTCCAG

S₂:AATCGTTACAG

L: SISSSSSSSSS (1 insertion, 9 substitutions, 0 deletions)

b) S₁:-AGT-GTTCCAG

S₂:AA-TCGTTACAG

L: ISDSISSSSSSS (2 insertions, 8 substitutions, 1 deletion)

c) S₁:---------AGTGTTCCAG

S₂:AATCGTTACAG--------

L: IIIIIIIIIIDDDDDDDDDD (9 insertions, 2 substitutions, 8 deletions)

□

Notice that not any sequence of edits represents an alignment. This only happens when the table of corresponding pairs will result in two sequences of equal length. For instance,

L: IISSSSSSSSSS cannot represent an alignment for S₁ and S₂, since two insertions at the beginning of S1 will result in a string of 12 characters that cannot be matched to a string of 11 characters. If #I, #D, and #S denote respectively the number of insertions, deletions, and substitutions in the sequence L, this observation leads to the following condition:

A sequence L represents an alignment for S₁ and S₂ only if #I + #S = 10 and #D + #S = 11.

For sequences S₁ and S₂ of arbitrary lengths n and m respectively, this condition generalizes to the following:

A sequence L represents an alignment for S₁ and S₂ only if #I + #S = n and #D + #S = m.

Exercise 1. Are all three alignments in Example 1 of equal biological importance? Explain why or why not.

Exercise 2. Verify that the conditions #I + #S = 10 and #D + #S = 11 are satisfied for all of the alignments presented in Example 1.

Exercise 3. Let S₁and S₂ are the same sequences as above. Does each of the sequences L represent an alignment? Explain why or why not. In case L is an alignment, give the table of corresponding pairs as in Example 1.

a) L: SSSISSSSSSS

b) L: ISSSSSSSSSS

c) L: SSSDSSSSSSII

d) L: SSIDSSSIDISS

2. Alignments as Paths on a Graph

Once an alignment is understood as a sequence L formed of the characters I, D, and S, a more visual way to represent that alignment is to view it as a path on the alignment graph of the sequences of S₁ and S₂. If S₁ has length n and S₂ has length m, the alignment graph is a given as triangular lattice of height n and width m such as that in Figure 1. The DNA bases in S₁ label the rows of the lattice while the DNA bases in S₂ label the columns. The alignment graph of our example sequences S₁ and S₂ is presented in Figure 1A.

The alignment graph is a directed graph. In directed graphs, access from one vertex to an adjacent vertex via a connecting edge is allowed only in the specified direction. For the alignment graphs we just described, the possible directions are: one horizontal step from left to right; one vertical step from top to bottom, and one diagonal step from the upper left to the lower right corner. These possibilities are depicted in Figure 1B.

A path between two vertices on the alignment graph is a walk along the edges of the graph in the permissible directions that connects these two vertices. In what follows we will only be interested in paths on the alignment graph from the upper left to the lower right vertices of the graph. In Figure 1A those vertices are marked by the symbol . Many such paths exist, of course. Figure 2 depicts three paths that connect these vertices.

Text Box:
Figure 2. Possible paths in the alignment graph. The blue path corresponds to the alignment presented in Exercise 1 – b).

Consider now Figure 1B again. If for any path on the alignment graph, a record is made for the directions followed on each step, using the labels H, I, and S, as depicted in the figure, each path on the alignment graph can be viewed as a sequence formed from these three characters. The black path in Figure 2 then corresponds to the sequence L: SSIDDSSSDIISSI and the red path corresponds to L: DDDISSSSSSIIIID. Notice that for these paths #I + #S = 10 and #D + #S = 11. The opposite is also true: any sequence formed of the characters H, I, and S for which #I + #S = 10 and #D + #S = 11, defines a path between the upper left and lower right vertices of the alignment graph. For instance, the blue path in Figure 2 represents the alignment of S₁ and S₂ defined in Example 1-b).

The following should now be clear: An alignment of two sequences S₁ and S₂ can be represented by a path from the upper left to the lower right vertices on the corresponding alignment graph.

Exercise 4. In Figure 2, trace the paths that correspond to the alignments from Exercise 1-a) and 1-c).

3. Alignments as Weighted Paths on a Graph

Although every path on the alignment graph between the upper left and lower right vertices corresponds to an alignment, there are many paths that correspond to biologically meaningless alignments. For instance, a path that goes straight down to the bottom of the graph and then all the way to the right can be used to align any two sequences, even when there may not be a single base-pair match. On the other hand, alignments corresponding to paths with more diagonal steps are more likely to produce biologically meaningful matches. The goal is to find alignments that have high biological likelihood.

One way to approach this problem is to assign a cost (also sometimes referred to as penalty or weight) to each of the edges in the alignment graph with exact matches incurring no cost. Insertions, deletions and actual substitutions of bases, on the other hand, are performed at a cost. The actual values of the penalties are based on biological considerations. In this project, we use the following very simple penalty scheme to illustrate the concept.

Text Box:

Panel A Panel B

Figure 3. Distinguishing between substitutions: an exact match is represented by a diagonal arrow (Panel A) and a mutation is depicted by and arc (Panel B).

Exact matches have cost zero;

There is a cost m for a substitution of one base with another (mutation);

There is a cost (gap penalty) x for an insertion or a deletion.

A good way to visualize these is to consider a refined version of the alignment graph where we can distinguish between exact matches and mutations within the group of substitutions. Figure 3 is a modified version of Figure 1B. Now, either a diagonal or an arc depicts a substitution. A diagonal edge represents an exact match (Figure 3A). A directed arc depicts a mutation (Figure 3B). As before, any vertical edges correspond to deletions and horizontal edges correspond to insertions.

Figure 4A depicts the alignment graph of S₁ and S₂ that utilizes straight diagonals for exact matches and arcs for mutations. Figure 4B depicts the red and blue paths from Figure 2 on this more refined graph.

Text Box:

Figure 4. Panel A: The weighted alignment graph for the sequences S1 and S2; Panel B: The weighted paths on the graph corresponding, respectively, to the red and blue paths from Figure 2.

The alignment graph with designated costs for each of its edges is the weighted alignment graph for the sequences S₁ and S₂. The cost of a path is defined as the sum of the costs of all of its edges.

Example 2. The cost of the red path in Figure 4B is 9x + 5p. To see this, notice that the path corresponds to an alignment that contains 5 insertions (corresponding to five vertical edges), 4 deletions (corresponding to the total of horizontal edges in the path), 5 mutations and 1 exact match. Recall that that the cost of any insertion/deletion is x, the cost of a mutation is p, and the cost of an exact alignment is zero. Adding the individual weights for all edges in the path establishes the claim.

□

Exercise 4. Find the costs of the blue and black paths depicted in Figure 2A.

Clearly, the optimal alignment will correspond to a path of minimal cost. Thus, finding the optimal alignment(s) is the same as determining all paths of minimal cost on the weighted alignment graph.

This problem may not seem too difficult at first glance, since the alignment graph has a finite number of edges. One might think that determining all paths on the graph, calculating their weights, and then singling those of minimal weight would not be unreasonable. However, since the number of different paths grows extremely rapidly with the increase of the sequences’ lengths, the task quickly becomes computationally overwhelming. It can be shown that there are 224,143 different paths from the upper left to the lower right verteces on the alignment graph of two sequences of length n = 7 and m = 9. When, as common in biology, we are attempting to align sequences that may have hundreds of DNA bases, the number of all paths becomes intractable.

The mathematical problem of fining the paths of minimal cost on weighted graphs is a classical problem and a number of computationally efficient solution strategies and algorithms are available. An important feature of these algorithms is that they do not need to compute the costs of all paths from the upper left to the lower right vertex of the alignment graph. One such algorithm is the Needelman-Wunsch algorithm. We will not discuss here the details of this and other similar algorithms as they can be found in most standard text of discrete mathematics such as (provide reference). Instead, we have included an interactive applet that implements this algorithm and allows for changeing the costs x and p and observing how these changes affect the optimal alignment.

In closing, we emphasize that we have now described four equivalent ways of representing an alignment of two sequences S₁ and S₂: 1) as tables of pairwise edits with hyphens in S₁ standing for insertions and hyphens in S₂ for deletions; 2) as sequences comprised of the letters I, D, and S; 3) as paths on the alignment graph; and 4) as paths on the weighted alignment graph. The last representation is one for which computationally efficient mathematical solution are availables.