We can see that BLAST is able to align two sequences - but how does it pick the best sequence? To answer this, we have to look at scoring matrices, which assign a score to each gap or residue alignment.
If we had the following alignment, what score would it have? How would it assign residues are similar vs. identical?
...ARFSGTWYAMAK... : .||||.:| ...QKVAGTWYSLAM...
One key point to notice is that the substitution for one amino acid can be more physiochemically accepted than another. For example, arginine mutating into lysine isn't that bad since both have electrically charged side chains. However, if arginine mutated to glutamic acid, the charge would be changed from +1 to -1. Such a drastic change may render the protein useless!
We also must account that some amino acids are more commonly available from DNA. For example, serine has four different codon sequences, while tryptophan only has one, making it statistically more probably for serines to show up.
There are two amino acid substitution matrices that help score alignments. The first is the PAM substitution matrix, which is based on the rate of divergence between species. BLOSUM, an alternative to PAM, is based on the conservation of domains in proteins.
Let's first discuss PAMs.
In 1978, Dayhoff and her group came up with the Accepted Point Mutations (pronounced PAM since it's easier to pronounce than APM). In short, a PAM is the muation of an amino acid residue that is accepted by natural selection.
To see which amino acids are accepted in protein evolution, Dayhoff et al. examined 1572 changes in 71 groups of closely related proteins (shared at least 85% identity), and observed all amino acid substitutions. Thus, this experiment was based solely on observation among closely related species.
The Dayhoff group calculated the relative mutability (Rij) per amino acid.
Here, Mij is the probability of residue j changing to i in a given evolutionary interval. The denominator fi represents the frequency of residue i occurring by chance.
Thus, the mutability index is an odds ratio, which is an indicator of how much more authentic the mutation is than it occuring by chance.
With this, Dayhoff et al., generated a table listing the relative mutabilities of amino acids, normalized to integers:
They found that Asn, Ser, and Asp were most likely to mutate, while Leu, Cys and Trp were least likely.
Dayhoff also tallied up the frequencies of AA's across all proteins. If all amino acids were equally probable in protein sequences, the would all have frequencies of 1.00/20 = 0.05, but they don't.
As you can see, mutation rates vary across the amino acids, and Dayhoff found an empirical way to normalize for variations in AA composition and mutation rate. From Dayhoff's experiment, we can see the characteristics of mutable and less mutable residues.
More mutable residues:
Less mutable residues:
With these results, Dayhoff built a 20 x 20 mutation probability matrix that tallied a score per each amino acid change. This was known as the PAM1 matrix, which showed the probabilities of proteins undergoing 1% change (1 accepted point mutation per 100 amino acid residues). The PAM1 matrix is based on the alignment of protein sequences that shared at least 85% identity.
One important assumption made in the generation of PAM matrices are that each change in amino acid is assumed to be independent of previous mutational events at that site. This type of assumption shows that PAM is a Markov model.
So why is this assumption important? From this, we can generate more PAM matrices to be used for sequences that are separated by longer periods of evolutionary history. For example, a PAM250 matrix is simply a PAM1 matrix multipled by itself 250 times. This matrix applies to alignments that share about 20% amino acid identity, which represents about 2500 million years of evolution.
When BLAST scores an alignment, it doesn't use the probability matrix as seen above. It converts the matrix elements into integer numbers and produces a log-odds scoring matrix.
Here, si,j is score for aligning any two residues. Simply put, if the value is high, that means it aligns well. The qi,j is the amount a certain residue i would mutate to residue j. The denominator is the probability of the residue mutation occurring by chance.
Let's take the residue M (methionine) and calculate its mutation to L (leucine). Both of amino acids have a hydrophobic side chain, so they should align well - we expect a positive score. We will use the PAM250 Mutation Matrix
This is the value found in the log-odds matrix for PAM250. With this matrix, we can now score our alignments! But wait - what about gaps? Let's see how to handle those in the next page!
Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.$ Check price
In this completely revised second edition of the perennial best seller How Linux Works, author Brian Ward makes the concepts behind Linux internals accessible to anyone curious about the inner workings of the operating system. Inside, you'll find the kind of knowledge that normally comes from years of experience doing things the hard way.$ Check price