1.
Alignments as Sequences of Deletions, Insertions, and Substitutions
Mathematically,
the pairwise DNA-sequence alignment problem begins by providing two
sequences S1
and S2 composed from the four characters A, C, G, or T. The
following sequences present an example:
S1:
AGTGTTCCAG
S2: AATCGTTACAG
An
alignment
of two sequences is a record of “edits” (or lack thereof) in the bases
in S1
that leads to the sequence S2. Allowable edit operations are:
Deletion
(D):
delete a base
Insertion
(I): insert a new base
Substitution
(S): replace a base with another base
Notice
that a
substitution may not represent an actual “edit” as it does not rule out
that a
base may be “replaced” by exactly the same base. Biologically, the
latter would
correspond to a match. A substitution reflecting an actual change of
the base
would correspond to a mutation.
Example
1. Below
are some possible alignments
of the sequences S1 and S2 given as tables of
corresponding pairs, together with the list of operations in L,
describing
the alignment. A hyphen in S1 represents an insertion, while
a
hyphen in S2 corresponds to a deletion. The sequence L
represents
the transformation of S1 into S2 by enacting the
edits
listed in L on S1 form left to right.
a) S1:
A-GTGTTCCAG
S2: AATCGTTACAG
L: SISSSSSSSSS
(1 insertion, 9
substitutions, 0
deletions)
b) S1:
-AGT-GTTCCAG
S2: AA-TCGTTACAG
L: ISDSISSSSSSS
(2 insertions, 8
substitutions, 1
deletion)
c) S1:
---------AGTGTTCCAG
S2: AATCGTTACAG--------
L: IIIIIIIIIIDDDDDDDDDD
(9
insertions, 2 substitutions, 8 deletions)
□
Notice
that
not any sequence of edits represents an alignment. This only happens
when the
table of corresponding pairs will result in two sequences of equal
length. For
instance,
L:
IISSSSSSSSSS cannot
represent an alignment for S1
and S2, since two insertions at the beginning of S1 will
result in a
string of 12 characters that cannot be matched to a string of 11
characters.
If #I, #D, and #S denote respectively
the number of insertions, deletions, and substitutions in the sequence
L, this
observation leads to the following condition:
A
sequence
L represents an alignment for S1 and S2 only if
#I + #S = 10 and #D + #S = 11.
For
sequences
S1 and S2 of arbitrary lengths n and m
respectively, this condition generalizes to the following:
A
sequence
L represents an alignment for S1 and S2 only if
#I + #S = n and #D + #S = m.
Exercise
1. Are all
three
alignments in Example 1 of equal biological importance? Explain why or
why not.
Exercise
2. Verify
that the
conditions #I + #S = 10 and #D + #S = 11
are satisfied for all of the alignments presented in Example 1.
Exercise 3. Let S1 and S2 are the same sequences as above. Does each of the sequences L represent an alignment? Explain why or why not. In case L is an alignment, give the table of corresponding pairs as in Example 1.
a) L: SSSISSSSSSS
b) L: ISSSSSSSSSS
c) L: SSSDSSSSSSII
d) L: SSIDSSSIDISS
2.
Alignments as Paths on a Graph
Once
an
alignment is understood as a sequence L formed of the characters I, D,
and S, a
more visual way to represent that alignment is to view it as a path on
the alignment
graph of the sequences of S1 and S2. If S1 has length n and S2
has length m, the alignment graph is a given as triangular
lattice of
height n and width m such as that in Figure 1. The DNA
bases in S1
label the rows of the lattice while the DNA bases in S2
label the
columns. The alignment graph of our example sequences S1 and
S2
is presented in Figure 1A.
The
alignment
graph is a directed graph. In directed graphs, access from one
vertex to
an adjacent vertex via a connecting edge is allowed only in the
specified
direction. For the alignment graphs we just described, the possible
directions
are: one horizontal step from left to right; one vertical step from top
to
bottom, and one diagonal step from the upper left to the lower right
corner.
These possibilities are depicted in Figure 1B.
A path between two
vertices on
the alignment graph is a walk along the edges of the graph in the
permissible
directions that connects these two vertices. In what follows we will
only be
interested in paths on the alignment graph from the upper left to the
lower
right vertices of the graph. In Figure 1A those vertices are marked by
the
symbol
. Many such paths
exist,
of course. Figure 2 depicts three paths that connect these vertices.
Consider
now
Figure 1B again. If for any path on the alignment graph, a record is
made for
the directions followed on each step, using the labels H, I, and S, as
depicted
in the figure, each path on the alignment graph can be viewed as a
sequence
formed from these three characters. The black path in Figure 2 then
corresponds
to the sequence L:
SSIDDSSSDIISSI and
the
red path corresponds to
L:
DDDISSSSSSIIIID. Notice
that for these paths #I + #S = 10 and #D + #S = 11. The opposite is
also true:
any sequence formed of the characters H, I, and S for which #I + #S =
10 and #D
+ #S = 11, defines a path between the upper left and lower right
vertices of
the alignment graph. For instance, the blue path in Figure 2 represents
the alignment
of S1 and S2 defined in Example 1-b).
The
following
should now be clear: An alignment of two sequences S1
and S2
can be represented by a path from the upper left to the lower right
vertices on
the corresponding alignment graph.
Exercise 4. In Figure 2, trace the paths that correspond to the alignments from Exercise 1-a) and 1-c).
3.
Alignments as Weighted Paths on a Graph
Although
every path on the alignment graph between the upper left and lower
right
vertices corresponds to an alignment, there are many paths that
correspond to
biologically meaningless alignments. For instance, a path that goes
straight
down to the bottom of the graph and then all the way to the right can
be used to align any
two sequences, even when there may not be a single base-pair match. On
the
other hand, alignments corresponding to paths with more diagonal steps
are more
likely to produce biologically meaningful matches. The goal is to find
alignments that have high biological likelihood.
One
way to
approach this problem is to assign a cost (also sometimes
referred to as
penalty or weight) to each of the edges in the alignment
graph
with exact matches incurring no cost. Insertions, deletions and actual
substitutions
of bases, on the other hand, are performed at a cost. The actual values
of the
penalties are based on biological considerations. In this project, we
use the
following very simple penalty scheme to illustrate the concept.
Exact matches
have cost zero;
There
is a cost m for a substitution of one base with another
(mutation);
There is a cost (gap penalty) x
for an insertion or a deletion.
A
good way to
visualize these is to consider a refined version of the
alignment graph where we can distinguish between exact matches and
mutations
within the group of substitutions. Figure 3 is a modified version of
Figure 1B.
Now, either a diagonal or an arc depicts a substitution. A diagonal
edge
represents an exact match (Figure 3A). A directed arc depicts a
mutation
(Figure 3B). As before, any vertical edges correspond to deletions and
horizontal edges correspond to insertions.
Figure 4A depicts the alignment graph of S1 and S2 that utilizes straight diagonals for exact matches and arcs for mutations. Figure 4B depicts the red and blue paths from Figure 2 on this more refined graph.
The
alignment
graph with designated costs for each of its edges is the weighted
alignment
graph for the sequences S1 and S2. The cost of a path is defined as the
sum of the costs of all of its edges.
Example
2. The cost
of the red path in Figure
4B is 9x + 5p. To see this, notice that the path
corresponds to
an alignment that contains 5 insertions (corresponding to five vertical
edges),
4 deletions (corresponding to the total of horizontal edges in the
path), 5
mutations and 1 exact match. Recall that that the cost of any
insertion/deletion is x, the cost of a mutation is p, and
the cost of an exact alignment is zero. Adding
the individual weights for all edges in the path establishes the claim.
□
Exercise
4. Find
the costs of
the blue and black paths depicted in Figure 2A.
Clearly,
the
optimal alignment will correspond to a path of minimal cost. Thus, finding
the optimal alignment(s) is the same as determining all paths of
minimal cost
on the weighted alignment graph.
This
problem
may not seem too difficult at first glance, since the alignment graph
has a
finite number of edges. One might think that determining all paths on
the
graph, calculating their weights, and then singling those of minimal
weight
would not be unreasonable. However, since the number of different paths
grows
extremely rapidly with the increase of the sequences’ lengths, the task
quickly
becomes computationally overwhelming. It can be shown that there are
224,143
different paths from the upper left to the lower right verteces on the
alignment
graph of two sequences of length n = 7 and m = 9. When, as common in
biology,
we are attempting to align sequences that may have hundreds of DNA
bases, the
number of all paths becomes intractable.
The
mathematical problem of fining the paths of minimal cost on weighted
graphs is
a classical problem and a number of computationally efficient solution
strategies and algorithms are available. An important feature of these
algorithms is that they do not need to compute the costs of all paths
from the upper left to the lower right vertex of the alignment graph.
One such
algorithm is the Needelman-Wunsch algorithm. We will not
discuss here the
details of this and other similar algorithms as they can be found in
most
standard text of discrete mathematics such as (provide
reference). Instead, we have included an
interactive applet that implements this algorithm and allows for
changeing the costs x and p and observing how these
changes affect the
optimal alignment.
In closing, we emphasize that we have now described four equivalent ways of representing an alignment of two sequences S1 and S2: 1) as tables of pairwise edits with hyphens in S1 standing for insertions and hyphens in S2 for deletions; 2) as sequences comprised of the letters I, D, and S; 3) as paths on the alignment graph; and 4) as paths on the weighted alignment graph. The last representation is one for which computationally efficient mathematical solution are availables.