1. Alignments as Sequences of Deletions, Insertions, and Substitutions
Mathematically, the pairwise DNA-sequence alignment problem begins by providing two sequences S1 and S2 composed from the four characters A, C, G, or T. The following sequences present an example:
S1: AGTGTTCCAG
S2: AATCGTTACAG
An alignment of two sequences is a record of “edits” (or lack thereof) in the bases in S1 that leads to the sequence S2. Allowable edit operations are:
Deletion (D): delete a base
Insertion (I): insert a new base
Substitution (S): replace a base with another base
Notice that a substitution may not represent an actual “edit” as it does not rule out that a base may be “replaced” by exactly the same base. Biologically, the latter would correspond to a match. A substitution reflecting an actual change of the base would correspond to a mutation.
Example 1. Below are some possible alignments of the sequences S1 and S2 given as tables of corresponding pairs, together with the list of operations in L, describing the alignment. A hyphen in S1 represents an insertion, while a hyphen in S2 corresponds to a deletion. The sequence L represents the transformation of S1 into S2 by enacting the edits listed in L on S1 form left to right.
a) S1:A-GTGTTCCAG
S2:AATCGTTACAG
L: SISSSSSSSSS (1 insertion, 9 substitutions, 0 deletions)
b) S1:-AGT-GTTCCAG
S2:AA-TCGTTACAG
L: ISDSISSSSSSS (2 insertions, 8 substitutions, 1 deletion)
c) S1:---------AGTGTTCCAG
S2:AATCGTTACAG--------
L: IIIIIIIIIIDDDDDDDDDD (9 insertions, 2 substitutions, 8 deletions)
□
Notice that not any sequence of edits represents an alignment. This only happens when the table of corresponding pairs will result in two sequences of equal length. For instance,
L: IISSSSSSSSSS cannot represent an alignment for S1 and S2, since two insertions at the beginning of S1 will result in a string of 12 characters that cannot be matched to a string of 11 characters. If #I, #D, and #S denote respectively the number of insertions, deletions, and substitutions in the sequence L, this observation leads to the following condition:
A sequence L represents an alignment for S1 and S2 only if #I + #S = 10 and #D + #S = 11.
For sequences S1 and S2 of arbitrary lengths n and m respectively, this condition generalizes to the following:
A sequence L represents an alignment for S1 and S2 only if #I + #S = n and #D + #S = m.
Exercise 1. Are all three alignments in Example 1 of equal biological importance? Explain why or why not.
Exercise 2. Verify that the conditions #I + #S = 10 and #D + #S = 11 are satisfied for all of the alignments presented in Example 1.
Exercise 3. Let S1 and S2 are the same sequences as above. Does each of the sequences L represent an alignment? Explain why or why not. In case L is an alignment, give the table of corresponding pairs as in Example 1.
a) L: SSSISSSSSSS
b) L: ISSSSSSSSSS
c) L: SSSDSSSSSSII
d) L: SSIDSSSIDISS
2. Alignments as Paths on a Graph
Once an alignment is understood as a sequence L formed of the characters I, D, and S, a more visual way to represent that alignment is to view it as a path on the alignment graph of the sequences of S1 and S2. If S1 has length n and S2 has length m, the alignment graph is a given as triangular lattice of height n and width m such as that in Figure 1. The DNA bases in S1 label the rows of the lattice while the DNA bases in S2 label the columns. The alignment graph of our example sequences S1 and S2 is presented in Figure 1A.
![]() |
The alignment graph is a directed graph. In directed graphs, access from one vertex to an adjacent vertex via a connecting edge is allowed only in the specified direction. For the alignment graphs we just described, the possible directions are: one horizontal step from left to right; one vertical step from top to bottom, and one diagonal step from the upper left to the lower right corner. These possibilities are depicted in Figure 1B.
A path between
two vertices on the alignment graph is a walk along the edges of the graph
in the permissible directions that connects these two vertices. In what
follows we will only be interested in paths on the alignment graph from
the upper left to the lower right vertices of the graph. In Figure 1A those
vertices are marked by the symbol
. Many such paths exist, of course.
Figure 2 depicts three paths that connect these vertices.
Consider now Figure 1B again. If for any path on the alignment graph, a record is made for the directions followed on each step, using the labels H, I, and S, as depicted in the figure, each path on the alignment graph can be viewed as a sequence formed from these three characters. The black path in Figure 2 then corresponds to the sequence L: SSIDDSSSDIISSI and the red path corresponds to L: DDDISSSSSSIIIID. Notice that for these paths #I + #S = 10 and #D + #S = 11. The opposite is also true: any sequence formed of the characters H, I, and S for which #I + #S = 10 and #D + #S = 11, defines a path between the upper left and lower right vertices of the alignment graph. For instance, the blue path in Figure 2 represents the alignment of S1 and S2 defined in Example 1-b).
The following should now be clear: An alignment of two sequences S1 and S2 can be represented by a path from the upper left to the lower right vertices on the corresponding alignment graph.
Exercise 4. In Figure 2, trace the paths that correspond to the alignments from Exercise 1-a) and 1-c).
3. Alignments as Weighted Paths on a Graph
Although every path on the alignment graph between the upper left and lower right vertices corresponds to an alignment, there are many paths that correspond to biologically meaningless alignments. For instance, a path that goes straight down to the bottom of the graph and then all the way to the right can be used to align any two sequences, even when there may not be a single base-pair match. On the other hand, alignments corresponding to paths with more diagonal steps are more likely to produce biologically meaningful matches. The goal is to find alignments that have high biological likelihood.
One way to approach this problem is to assign a cost (also sometimes referred to as penalty or weight) to each of the edges in the alignment graph with exact matches incurring no cost. Insertions, deletions and actual substitutions of bases, on the other hand, are performed at a cost. The actual values of the penalties are based on biological considerations. In this project, we use the following very simple penalty scheme to illustrate the concept.
Exact matches have cost zero;
There is a cost m for a substitution of one base with another (mutation);
There is a cost (gap penalty) x for an insertion or a deletion.
A
good way to visualize these is to consider a refined version of the alignment
graph where we can distinguish between exact matches and mutations within
the group of substitutions. Figure 3 is a modified version of Figure 1B.
Now, either a diagonal or an arc depicts a substitution. A diagonal edge
represents an exact match (Figure 3A). A directed arc depicts a mutation
(Figure 3B). As before, any vertical edges correspond to deletions and
horizontal edges correspond to insertions.
Figure 4A depicts the alignment graph of S1 and S2 that utilizes straight diagonals for exact matches and arcs for mutations. Figure 4B depicts the red and blue paths from Figure 2 on this more refined graph.
The alignment graph with designated costs for each of its edges is the weighted alignment graph for the sequences S1 and S2. The cost of a path is defined as the sum of the costs of all of its edges.
Example 2. The cost of the red path in Figure 4B is 9x + 5p. To see this, notice that the path corresponds to an alignment that contains 5 insertions (corresponding to five vertical edges), 4 deletions (corresponding to the total of horizontal edges in the path), 5 mutations and 1 exact match. Recall that that the cost of any insertion/deletion is x, the cost of a mutation is p, and the cost of an exact alignment is zero. Adding the individual weights for all edges in the path establishes the claim.
□
Exercise 4. Find the costs of the blue and black paths depicted in Figure 2A.
Clearly, the optimal alignment will correspond to a path of minimal cost. Thus, finding the optimal alignment(s) is the same as determining all paths of minimal cost on the weighted alignment graph.
This problem may not seem too difficult at first glance, since the alignment graph has a finite number of edges. One might think that determining all paths on the graph, calculating their weights, and then singling those of minimal weight would not be unreasonable. However, since the number of different paths grows extremely rapidly with the increase of the sequences’ lengths, the task quickly becomes computationally overwhelming. It can be shown that there are 224,143 different paths from the upper left to the lower right verteces on the alignment graph of two sequences of length n = 7 and m = 9. When, as common in biology, we are attempting to align sequences that may have hundreds of DNA bases, the number of all paths becomes intractable.
The mathematical problem of fining the paths of minimal cost on weighted graphs is a classical problem and a number of computationally efficient solution strategies and algorithms are available. An important feature of these algorithms is that they do not need to compute the costs of all paths from the upper left to the lower right vertex of the alignment graph. One such algorithm is the Needelman-Wunsch algorithm. We will not discuss here the details of this and other similar algorithms as they can be found in most standard text of discrete mathematics such as (provide reference). Instead, we have included an interactive applet that implements this algorithm and allows for changeing the costs x and p and observing how these changes affect the optimal alignment.
In closing, we emphasize that we have now described four equivalent ways of representing an alignment of two sequences S1 and S2: 1) as tables of pairwise edits with hyphens in S1 standing for insertions and hyphens in S2 for deletions; 2) as sequences comprised of the letters I, D, and S; 3) as paths on the alignment graph; and 4) as paths on the weighted alignment graph. The last representation is one for which computationally efficient mathematical solution are availables.