This page is based on an older tutorial by Andreas Mattern.
When we align and look at two sequences we are looking for evidence that they have diverged from a common ancestor through mutation and selection (Durbin et al pg 13). For purposes of this page, the difference between global and local is not important, but please remember such difference.
There are a number of ways in which
sequences diverge from each other:
Substitution: changing one
translated product into another
Insertion: adding residues
to a sequence
Deletion: removing residues
from a sequence
Insertions and Deletions are at times called "indels" and have the complicating factor that they may or may not be "in frame".
Let's
learn through example; Let's say
your sequence looks like this:
gattacacanttccaagacct
So,
we have a 21 nt sequence -- base composition: g=2, a=7, c=6, t=5, n=1
Let's
look at this series of variations of this sequence and the alignments it
generates:
Query: gattacacanttccaagacct
Database: gattacacanttccaagacct -->
perfect match!
Translated: DYTXPRP
Translated: DYTXPRP
--> perfect match!
7/7 = 100% similarity
Now we make one change:
Query: gattacacanttccaagacct
Database: gattacacatttccaagacct
(remember
that the n might still be a t! )
20/21 = 95% similarity
Translated: DYTXPRP
Translated: DYTFPRP (maybe
the X = F ! )
6/7 = 86% similarity
One new ( silent) mismatch!:
Query: gattacacanttccaagacct
Database: gattacacctttccaagacct
19/21 = 90% similarity
Translated: DYTXPRP
Translated: DYTFPRP --> no
change in the amino acid base!
6/7 = 86% similarity
Then we produce more silent mutations:
Query: gattacacanttccaagacct
Database: gactatacgttcccccggccc
9 mismatches!
12/21 = 57% similarity
Translated: DYTXPRP
Translated: DYTFPRP --> no
change in the aa!
6/7 = 86% similarity
See how the DNA score degrades, yet the protein score remains the same
Now we add other types of mutations...
Query: gattacacanttccaagacct
Database: gattacaca---ccaagacct --> a
3 nt gap!
Translated: DYTXPRP
Translated: DYT-PRP --> producing a 1 aa gap!
...
Query: gattacacanttccaagacct
Database: gattaca---ttccaagacct -->
an out-of-frame 3 nt gap!
Translated: DYTXPRP --> which results in a 1 aa
substitution...
Translated: DYI-PRP ...and
a 1 aa gap!
...
Query:
gattacacan-----ttccaagacct
Database: gattacacanaatgcttccaagacct --> a
5 bp insertion!
Translated: DYTXPRP --> now things are getting..
Translated: DYTXCFQD ..a little
more complicated, the reading frame has been modified!
Even in this small nucleotide sequence with very simplistic changes, you
can see how quickly things can get messy in pairwise alignment strategies.
How
could we devise a scoring scheme that would:
Allow us to see the
sequences which are "pretty similar" to ours (i.e. putative
homologues)
Let us ignore those sequences
which are "pretty similar" just by chance.
Let's look at
the simplest case: aligning two nucleotide sequences.
Using our example...
Query: gattacacanttccaagacct
--> perfect match!
Database: gattacacanttccaagacct ... 21/21 =
100% similarity
...we
could use the percent similarity score, but this can be tricky because
similarity between two sequences is not taking into account the length of the
alignment, so how do we balance between an alignment of lenght = 30 with 60%
similarity and a second alignment of lenght 200 with 60% similarity?.... ...
A short discussion , of percentage similarity (or
identity) as a single reported statistic.
...so
we see that percent similarity alone is not a good statistic to report, but as
a part of the overall report, it is very helpful to have this information. Now
look at this case ...
Query: gattacacanttccaagacct
Database: gattacaca---ccaagacct a 3 nt
gap!
… how do we deal
with insertions or deletions (indels)?
Generally,
a score is determined by:
1. giving a positive score to "matches" and a negative score to "mismatches"
2. Examining a "substitution matrix" comprised of those scores
For arguments sake let's make a match = +2 and a mismatch = -1
The
substitution matrix would look like this:
|
A |
C |
T |
G |
A |
+2 |
-1 |
-1 |
-1 |
C |
-1 |
+2 |
-1 |
-1 |
T |
-1 |
-1 |
+2 |
-1 |
G |
-1 |
-1 |
-1 |
+2 |
So
without giving any values to gaps in...
Query: gattacacanttccaagacct
Database: gattacaca---ccaagacct --> a 3 nt gap!
..we
obtain:
Score
= +2+2+2+2+2+2+2+2+2-1-1-1+2+2+2+2+2+2+2+2+2 = 33
How
does that compare with:
Query: gattacacanttccaagacct
--> perfect match!
Database: gattacacanttccaagacct 21/21 = 100%
similarity
Score
= 21*2 = 42
or
with:
Query: gattacacanttccaagacct
Database: gactatacgttcccccggccc
--> 9 mismatches
Score
= +2+2-1+2+2-1+2+2-1-1+2-1+2+2-1-1+2-1+2+2-1 = 15
Pretty simple! isn't it?. But we know that this isn't really the
greatest scoring scheme in the world. The example with the gap for instance --
a larger gap (some 60 nucleotides, at least) could easily be explained by an
intron, and it should be fared better than a pair of sequences with the same
amount of mismatches; this does not apply in all gap vs. mismatch cases though,
and we might want to favor placing mismatches instead of gaps in other alignment
cases. So, depending on what we are looking for, we can also give a more or
less negative score to gaps relative to the value given to mismatches (and as
we will see later, to the extension of gaps). Modern local alignment programs (such as version 2 of BLAST
and all versions of FASTA) let’s us penalize gaps as well as modifying the
scoring matrix (some include the “-“ gap symbol as another letter whithin the
matrix).
Would the substitution scoring matrix apply for amino acids?
Also, but a substitution matrix for comparing 2 amino acid sequences
gets more complicated, as we know that some amino acids, while not identical in
chemical structure, are considered functionally conserved (i.e. D,
E, and K are all charged amino acids, while V, I, and L are all hydrophobic),
and might be functionally equivalent when replaced by members of the same
chemical properties. Remember the neutral
theory of mutation by kimura?
Here's an example of a commonly used
substitution matrix,
the PAM 250 matrix. Notice how some substitutions
don't have a negative score:
Now
let's return to out discussion of sequence comparisons. (close this window)