Aligning Sequences

We will refer to any member of the set {A, C, G, T, -} as a character and to any subset an alphabet. Throughout this paper σ denotes a sequence over the alphabet {A, C, G, T}. Sequences can be expanded by inserting the character at - anywhere in the sequence. This includes inserting - at the very beginning and at the very end. Expanded sequences are denoted by μ.

As an example, given the sequence ACCTGAC, the following are expanded sequences by inserting the character -: A - CCTGAC, and -ACCTGAC.

An alignment of two sequences σ¹ and σ² is to expand them by inclusion of - so that the expanded sequences are of same length and are in 1-1 correspondence. We will momentarily precisely define this notion, but first let us look at an example. Let σ¹ = ACTGT and σ² = CACCCG be two sequences. Let μ¹ = -ACT - GT, and μ² = CACCCG- be the corresponding expanded sequences. Now μ¹ and μ² can be put in 1-1 correspondence:

Definition 1.1. A substitution, denoted by S, is an assignment of a character from the set {A, C, G, T }to a character of the set {A, C, G, T }, i.e., it is an ordered pair in {A, C, G, T }×{A, C, G, T }.

Thus, for example, in the previous correspondence μ¹ ↔ μ², the second, third, fourth, and sixth assignments (read from left to right) are substitutions, the fourth substitution is a mismatch; others are identities.

Definition 1.2. An insertion, denoted by I, is an assignment -→{A, C, G, T }, i.e, an ordered pair in {-}×{A, C, G, T }.

Thus, for example, in μ¹ ↔ μ², the first and fifth assignments are insertions.

Definition 1.3. A deletion, denoted by D, is an assignment of a character from {A, C, G, T }to -, i.e, an ordered pair in {A, C, G, T }×{-}.

Definition 1.4. An edit string is a finite sequence of operations S, D, I.

Each edit string induce two expansions of a given sequence. One, literally an insertion operation, is to insert the character - where an I is found in the string. In order to execute all the edits and edit every member of the original sequence, the length of the sequence must be the sum of the number of S and the number of D in the edit sequence.

The other is a deletion operation formed by inserting - where a D is found (It is rather ironic that deletion operation is formed to expand a sequence, but note the name was used simply because D are used in someway to expand the sequence). In this case we tacitly assume that the sum of number of S and number of I in the edit string is length of the original sequence.

Definition 1.5. An alignment of an ordered pair of sequences (σ¹, σ²) is an edit string that induces an insertion expansion μ¹ on σ¹ and a deletion expansion μ² on σ² such that μ¹ is in one-to-one correspondence with μ².

Let n and m be the lengths of σ¹ and σ². From the discussion preceding the definition, it is clear that an edit string is an alignment iff

Example 1.1. Let σ¹ = ACGTAGC and σ² = ACCGAGACC. Then the edit string SSISISSIDS is an alignment of σ¹ and σ² because

S S I S I S S I D S μ1 = A C - G - T A - G C ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 2 μ = A C C G A G A C - C

2. Alignment graph

Alignment of sequences can be made into a problem of determining paths in directed graphs.

Definition 2.1. The alignment graph _n,m is the directed graph on the set of nodes {0, 1, …, n}×{0, 1, …, m}and three classes of directed edges horizontal, vertical, and diagonal defined below:

Horizontal:: These directed edges [(i, j) → (i, j + 1)], 0 ≤ i ≤ n, 0 ≤ j ≤ m - 1.
Vertical:: These are the edges [(i, j) → (i + 1, j)] 0 ≤ i ≤ n- 1, 0 ≤ j ≤ m.
Diagonal:: These are the edges [(i, j) → (i + 1,j + 1)], 0 ≤ i ≤ n - 1, 0 ≤ j ≤ m - 1.

2.1. Identification of sequences on the graph. To align two sequences σ¹ and σ² of length n and m, the left most edges of the zeroth column of

_n,m are identified with the terms of the sequence σ¹, i.e., σ₁¹ ↔ [(0, 0) → (1, 0)], …σ_n¹ ↔ [(n - 1, 0) → (n, 0)]. Likewise, the m directed edges of the zeroth row of

_n,m are identified with the elements of σ². Notice we go with rows and columns rather than x and y coordinates.

2.2. Interpretation of S, I, and D on the graph. Insertions I are interpreted as the directed horizontal edges, more precisely, [(i, j) → (i, j + 1)], 1 ≤ i ≤ n - 1, 0 ≤ j ≤ m - 1 is the insertion of - between the ith and i + 1 characters of the first sequence σ¹. That - corresponds to σ_j² which was earlier identified with the edge [(0, j), (0,j + 1)]. Segments [(0, j) → (i, j + 1)], 0 ≤ j ≤ m- 1 are the insertions - at the very beginning of σ¹ and assigning that to σ_j². Similarly interpret the insertions along the very lower edges.

Deletions D are interpreted as directed vertical segments. More precisely, the directed vertical edge [(i, j) → (i + 1, j)], 0 ≤ i ≤ n - 1, 0 ≤ j ≤ m is the assignment of σ_i+1¹ (which was identified with the segment [(i, 0), (i + 1, 0)]) to -.

Directed diagonals [(i, j) → (i + 1, j + 1)], 1 ≤ i ≤ n, 1 ≤ j ≤ m are substitutions S. Thus the substitution [(i, j) → (i + 1, j + 1)], 0 ≤ i ≤ n - 1, 0 ≤ j ≤ m - 1 means the assignment σ_i+1¹ → σ_j+1².

Thus every edit string corresponds to a unique path from (0, 0 ) to (n, m) and conversely. In other words aligning two sequences is equivalent to determining a (continuous) path from (0, 0) to (n, m).

An immediate consequence of the identification is the following recurrence relation.

Corollary 2.1. Let a_n,m, n ≥ 1, m ≥ 1 be the total number of paths from (0, 0) to (n, m). Then

an,m = an -1,m + an,m-1 + an-1,m-1, a0,0 = 1,

and we have set a_n,-1 = 0 and a_-1,m = 0.

The proof is obvious once we look at the 3 nodes in

_n,m from which (n, m) can be reached.

Example 2.1.

σ21 σ22 σ23 σ24 σ25 σ26 σ27 σ28 σ29 o----|---|----|----|---|----|---|----|----| σ11 \\\ \\ \\ \\ \\ \\ \\ |\\ \\ | |-\\\---\----\----\-|-\-|--\-|--\|--\-|--\-| σ12 \\ \\ \\ \\ \\\ \\ \\ |\\ \\ | |--\-|-\-|--\-|--\-|\\-|--\-|--\|--\-|--\-| σ13 \\ \\ \\ \\ \\ \\\\ \\ |\\ \\ | |--\-|-\-|--\-|--\-|-\-|--\\-|--\|--\-|--\-| σ14 \\ \\ \\ \\ \\ \\ \\\ |\\ \\ | |--\-|-\-|--\-|--\-|-\-|--\-|-\\|--\-|--\-| σ15 \\ \\ \\ \\ \\ \\ \\ \\\ \\ | |--\-|-\-|--\-|--\-|-\-|--\-|--\|-\\\-|--\-| σ1 \\ \\ \\ \\ \\ \\ \\ |\\ \\ | 6 |--\-|-\-|--\-|--\-|-\-|--\-|--\|--\-|--\-| σ1 \\ \\ \\ \\ \\ \\ \\ |\\ \\\ | 7 |--\-|-\-|--\-|--\-|-\-|--\-|--\|--\-|-\\\-o

The above path is the alignment

S I I I S S S S D S μ1 = σ11 - - - σ12 σ13 σ14 σ15 σ16 σ17 ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ μ2 = σ2 σ2 σ2 σ2 σ2 σ2 σ2 σ2 - σ2 1 2 3 4 5 6 7 8 9

2.3. Weighted paths. As a means of distinguishing paths, we now assign weights to the edges of the graph, that is, a weight is assigned for each substitution, insertion, or deletion. The insertion [(i, j) → (i, j + 1)] is assigned the weight w(-, σ_j+1²). The deletion [(i, j) → (i + 1, j)] is assigned the weight w(σ_i+1¹, -). The substitution [(i- 1, j - 1) → (i, j)] is assigned the weight w(σ_i¹, σ_j²). These weight matrices are given, below; the (5, 5) entry is left blank since a correspondence insertion to deletion (-, -) is, of course, not used. The weight of a path is the total weight of edge of the path and it is viewed as the cost of the path.

Example 2.2. The following illustrate the path corresponding to the edit sequence SSISISSIDS and the weights of the first few edges.

A C C G A G A C C o-----|----|----|-----|----|----|-----|----|----| A \\ \\ \\ \ \ \ \ \\ \ \ \ \ \\ \ \ | |---\\-|--\-|--\-|---\-|--\-|--\-|---\-|--\-|--\-| \ \\ \ \ \ \ \ \ \ | C |-\-\-|\\\\---\\---\\--|\-\-|-\\-|-\-\-|\-\-|-\\-| \ \ \ \\ \ \ \ \\ \ | G | \ |\ | \ | \\ | \ | \ | \ |\\ | \ | |---\-|--\-|--\-|---\\----\----\-|---\-|--\\-|--\-| T \ \ \\ \ \ \ \ \\ |--- \ \ \\ \ \ | |---\-|--\-|--\-|---\-|--\-|--------\-|--\-|--\-| \ \ \ \ \ \ \ \ \ | A |-\-\-|\-\-|-\\-|-\-\-|\-\-|-\\-|-\\---\-\-|-\\-| \ \ \ \ \ \ \ \ \ | G | \ |\ | \ | \ |\ | \ | \ | \ | \ | |---\-|--\-|--\-|---\-|--\-|--\-|---\-|--\-|--\-| C \ \ \\ \ \ \ \ \\ \ \ \ \ \\ \\ \\ | |---\-|--\-|--\-|---\-|--\-|--\-|---\-|--\-|--\\-o

A C C |-----w(- ,A)------|-----w(-,C)------------w(- ,C)------| | | | | A w(A,- ) w(A,A) w(A,-) w(A,C) w(A,- ) w(A,C) w(A,-) | | | | | | | | |-----w(- ,A)------|-----w(-,C)------|-----w(- ,C)------| | | | | | | | | C w(C,-) w(C,A) w(C,-) w(C,C) w(C,- ) w(C,C) w(C,-) | | | | |------w(- ,A)------|-----w(-,C)------------w(-,C)------| | | | | | | | | G w(G,- ) w(G,A) w(G,-) w(G,C) w(G,- ) w(G,C) w(G,-) | | | | | | | | ------w(- ,A)------ -----w(-,C)------------w(-,C)------

The weight of the edit string (path) SSISISSIDS applied to σ¹ and σ² in example (1.1)

w(A, A) + w(C, C) + w( - , C) + w(G, G) + w( - , A) + w(T, G)+
w(A, A) + w(- , C) + w(G, - ) + w(C, C).

For simplicity, the diagonal entries are taken as zero. All deletions and insertions will be given the same weight. Thus the all fifth column entries and 5th row entries are the same. Finally all mismatch substitutions will have the same weight.

3. The algorithm

The algorithm provides the path(s) of minimal cost to reach all points on the graph from (0, 0).

The concept of the algorithm is that given a path of total minimal weight, the minimal weight path to any point on the graph is the sub path of the path reaching the point.

The weight of any path lying wholly on the top row (zeroth row) is the sum of weights of the horizontal segments of the graph. The only path that can reach a point on the top row has to lie wholly on the top row. Similarly for the vertical path lying wholly on the left most column (zeroth column).

Now consider the three paths that can reach (1, 1): S, ID, or DI. Pick the path with minimal (total) weight and color it in blue. Now consider the point (1, 2). It can be reach from and only from (1, 1), (0, 1), or ( 0, 2). The minimal weight path all those 3 points is now known. Hence obtain the minimal path to (1, 2) and color it blue. Continuing in this manner we will have the minimal path to every point.

Now we will formally write the algorithm. Here M(i, j) is the total weight of along the minimal path of reaching i the row jth column position. First initialize the algorithm by defining M(0, 0) = 0. Then M(0, j) = M(0, j - 1) + w(-, σ_j²), 1 ≤ j ≤ m and M(i, 0) = M(i - 1, 0) + w(σ_i¹, -), 1 ≤ i ≤ n.

ALIGNING SEQUENCES

1. Aligning two sequences

2. Alignment graph

3. The algorithm