05. Dayhoff Model & accepted point mutations (PAMs)

We can see that BLAST is able to align two sequences - but how does it pick the best sequence? To answer this, we have to look at scoring matrices, which assign a score to each gap or residue alignment.

If we had the following alignment, what score would it have? How would it assign residues are similar vs. identical?

...ARFSGTWYAMAK...
   : .||||.:|
...QKVAGTWYSLAM...

Considerations for a scoring matrix

One key point to notice is that the substitution for one amino acid can be more physiochemically accepted than another. For example, arginine mutating into lysine isn't that bad since both have electrically charged side chains. However, if arginine mutated to glutamic acid, the charge would be changed from +1 to -1. Such a drastic change may render the protein useless!

Lysine and Arginine have a positively charged side chain (+1)
Lysine and Arginine both have a positively charged side chain (+1).

We also must account that some amino acids are more commonly available from DNA. For example, serine has four different codon sequences, while tryptophan only has one, making it statistically more probably for serines to show up.

There are two amino acid substitution matrices that help score alignments. The first is the PAM substitution matrix, which is based on the rate of divergence between species. BLOSUM, an alternative to PAM, is based on the conservation of domains in proteins.

Let's first discuss PAMs.

Accepted Point Mutations (PAM)

In 1978, Dayhoff and her group came up with the Accepted Point Mutations (pronounced PAM since it's easier to pronounce than APM). In short, a PAM is the muation of an amino acid residue that is accepted by natural selection.

To see which amino acids are accepted in protein evolution, Dayhoff et al. examined 1572 changes in 71 groups of closely related proteins (shared at least 85% identity), and observed all amino acid substitutions. Thus, this experiment was based solely on observation among closely related species.

Relative Mutability

The Dayhoff group calculated the relative mutability (Rij) per amino acid.

Rij = Mij / fi

Here, Mij is the probability of residue j changing to i in a given evolutionary interval. The denominator fi represents the frequency of residue i occurring by chance.

Thus, the mutability index is an odds ratio, which is an indicator of how much more authentic the mutation is than it occuring by chance.

With this, Dayhoff et al., generated a table listing the relative mutabilities of amino acids, normalized to integers:

Asn
134
Ser
120
Asp
106
...
...
Leu
40
Cys
20
Trp
18

They found that Asn, Ser, and Asp were most likely to mutate, while Leu, Cys and Trp were least likely.

Normalized frequencies of AA

Dayhoff also tallied up the frequencies of AA's across all proteins. If all amino acids were equally probable in protein sequences, the would all have frequencies of 1.00/20 = 0.05, but they don't.

Gly
0.089
Ala
0.087
Leu
0.085
...
...
Tyr
0.030
Met
0.015
Trp
0.010

Mutation rates vary

As you can see, mutation rates vary across the amino acids, and Dayhoff found an empirical way to normalize for variations in AA composition and mutation rate. From Dayhoff's experiment, we can see the characteristics of mutable and less mutable residues.

More mutable residues:

  • are much easier to replace. Asparagine is part of the uncharged polar amino acids group, which contains six other amino acids with similar properties

Less mutable residues:

  • serve important functions. If it has a particular charge, the protein will most likely need that charge to stay there.
  • are difficult to replace. Take a look at Tryptophan's unique structure below and you'll see why it'd be so difficult to replace it!
Tryptophan structure
Tryptophan's unique amino acid structure.

PAM matrices

With these results, Dayhoff built a 20 x 20 mutation probability matrix that tallied a score per each amino acid change. This was known as the PAM1 matrix, which showed the probabilities of proteins undergoing 1% change (1 accepted point mutation per 100 amino acid residues). The PAM1 matrix is based on the alignment of protein sequences that shared at least 85% identity.

PAM1 mutation probabiltiy matrix from Dayhoff group
PAM1 mutation probabilty matrix from the Dayhoff group. All values multipled by 10,000.

One important assumption made in the generation of PAM matrices are that each change in amino acid is assumed to be independent of previous mutational events at that site. This type of assumption shows that PAM is a Markov model.

So why is this assumption important? From this, we can generate more PAM matrices to be used for sequences that are separated by longer periods of evolutionary history. For example, a PAM250 matrix is simply a PAM1 matrix multipled by itself 250 times. This matrix applies to alignments that share about 20% amino acid identity, which represents about 2500 million years of evolution.

PAM250 matrix normalized to integers
A PAM250 matrix. Each column has been adjusted so that the columns sum to 100.

Probabilty to Log-Odds Scoring

When BLAST scores an alignment, it doesn't use the probability matrix as seen above. It converts the matrix elements into integer numbers and produces a log-odds scoring matrix.

si,j = 10 * log (qi,j / pi)

Here, si,j is score for aligning any two residues. Simply put, if the value is high, that means it aligns well. The qi,j is the amount a certain residue i would mutate to residue j. The denominator is the probability of the residue mutation occurring by chance.

Let's take the residue M (methionine) and calculate its mutation to L (leucine). Both of amino acids have a hydrophobic side chain, so they should align well - we expect a positive score. We will use the PAM250 Mutation Matrix

sM,M = 10 * log (qM,M / pM)
sM,M = 10 * log (0.06 / 0.015)
sM,M = 6

This is the value found in the log-odds matrix for PAM250. With this matrix, we can now score our alignments! But wait - what about gaps? Let's see how to handle those in the next page!

References

Dayhoff et al. (1978). A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 345–352. National Biomedical Research Foundation, Silver Spring, MD, 1978.

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 2

Become a Bioinformatics Whiz! Try Bioinformatics

This is Volume 2 of Bioinformatics Algorithms: An Active Learning Approach. This book presents students with a light-hearted and analogy-filled companion to the author's acclaimed course on Coursera. Each chapter begins with an interesting biological question that further evolves into more and more efficiently solutions of solving it.

$ Check price
49.9949.99Amazon 5 logo(5+ reviews)

More Bioinformatics resources

Take your Linux skills to the next level!

How Linux Works

Take your Linux skills to the next level! Try Linux & UNIX

In this completely revised second edition of the perennial best seller How Linux Works, author Brian Ward makes the concepts behind Linux internals accessible to anyone curious about the inner workings of the operating system. Inside, you'll find the kind of knowledge that normally comes from years of experience doing things the hard way.

$ Check price
39.9539.95Amazon 5 logo(114+ reviews)

More Linux & UNIX resources

Ad