An alternative to the PAM matrix is BLOSUM (BLocks SUbstitution Matrix), which was derived by Henikoff and Henikoff in 1992. NCBI uses BLOSUM62 as its the default matrix for protein BLAST.

BLOSUM matrices are derived from comparisons of blocks of sequences from the Blocks database.

A block is an ungapped multiple alignments of highly conserved, short regions. Here is what a sample block looks like:

The blocks database contains multiple alignments of conserved regions in protein families.

The Henikoffs developed a database of "blocks" based on sequences with shared motifs. More than 2,000 blocks of aligned sequence segments were analyzed from more than 500 groups of related proteins. Within each block, they counted the relative frequencies of amino acids and their substitution probabilities

The Henikoffs used blocks due to several reasons:

- Need to have multiple alignments and it's easier to align with similar sequences.
- They didn't want to complicate calculations with insertions/deletions.
- Wanted to focus on conserved regions for computing the scoring matrix.

A BLOSUM tells us the likelihood of occurrence of each pairwise substitution, and we can use these values to score a pairwise comparison.

Each scoring matrix is constructed based on how identical the ungapped multiple sequence alignments are. For example, BLOSUM62 is derived from blocks containing *at most* 62% identity in the ungapped sequence aligments.

Here we'll show you how to calculate a BLOSUM.

Before we start constructing a matrix BLOSUM r, we have to eliminate the sequences that are more than r% identical. This solves us from the bias we get from databases over-representing certain classes of proteins. To do this, we have two options:

- Remove sequences from the block.
- Replace the similar sequences with a new sequence that represents the cluster.

```
ACD
DCE
DCE
DCE
BCE
BCD
ACB
```

Since most databases today have an over-representation of proteins, the extraneous DCE sequences should be eliminated in order to make our database more representative.

Thus, after elminating redundancies, we look at the first vertical column in our block:

```
A
D
B
B
A
```

Let's find out how many possible pairwise combinations we can see for each possible pair.

For the AA pair, we have 2 possible combinations, for AB or BA we have 4. For AD we have 2. We continue these calculations until the occurrence of all possible pairs are found.

Pair | Column 1 score | Column 2 score | Column 3 score | Total |
---|---|---|---|---|

AA | 1 | 0 | 0 | 1 |

AB or BA | 4 | 0 | 0 | 4 |

AD or DA | 2 | 0 | 0 | 2 |

BB | 1 | 0 | 0 | 1 |

BD or DB | 2 | 0 | 2 | 4 |

CC | 0 | 10 | 0 | 10 |

DD | 0 | 0 | 1 | 1 |

DE or ED | 0 | 0 | 4 | 4 |

EE | 0 | 0 | 1 | 1 |

Note that the total sum is 26, which we can use to normalize our matrix.

A | B | C | D | E | |
---|---|---|---|---|---|

A | 1 | ||||

B | 4 | 1 | |||

C | 0 | 0 | 10 | ||

D | 2 | 4 | 0 | 1 | |

E | 0 | 0 | 0 | 4 | 1 |

To obtain integer values for our scoring matrix, we need to find the score per cell. We can do this with the following equation:

s_{ij} = log_{2}(q_{ij}/e_{ij})

Where `q _{ij}` is observed frequency and

q_{ij} = c_{ij} / T

`c _{ij}` is the cell value as calculated above.

To calculate the Total `T`:

T = w * n(n-1) / 2

Where `w` is the number of columns and `n` is the number of sequences. With `T`, we can calculate `q _{ij}`, which is the rate of change of residue

In our case, `T = 30`, so let's divide all our cells by `30`.

A | B | C | D | E | |
---|---|---|---|---|---|

A | 0.0333 | ||||

B | 0.133 | 0.0333 | |||

C | 0 | 0 | 0.333 | ||

D | 0.0667 | 0.133 | 0 | 0.0333 | |

E | 0 | 0 | 0 | 0.133 | 0.0333 |

Now `p _{i}` can be found with the following equation:

p_{i} = q_{ii} + ∑(q_{ij}/2)

p_{A} = ( 1 + 6/2 ) / 30 = 0.133

p_{B} = ( 1 + 8/2 ) / 30 = 0.167

p_{C} = 10 / 30

p_{D} = ( 1 + 10/2 ) /30 = 0.200

p_{E} = ( 1 + 4/2 ) / 30 = 0.0133

p

p

p

p

The expected frequencies:

e_{ii} = p_{i}^{2}

e_{ij} = 2p_{i}p_{j} (i ≠ j)

e

A | B | C | D | E | |
---|---|---|---|---|---|

A | 0.0178 | ||||

B | 0.0444 | 0.0278 | |||

C | ? | ? | 0.111 | ||

D | 0.0533 | 0.0667 | ? | 0.04 | |

E | ? | ? | ? | 0.04 | 0.01 |

Notice how I didn't calculate cell values that had a value of `0` - you'll see that we don't need these values in the actual scoring matrix.

Now we have all we need! Just plug in values from the two matrices above into the equation below to obtain our scoring matrix.

s_{ij} = log_{2}(q_{ij}/e_{ij})

To obtain scores, we multiple `s _{ij}` by two and round.

s_{ij} = round (2 * log_{2}(q_{ij}/e_{ij}))

A | B | C | D | E | |
---|---|---|---|---|---|

A | 0.9 | ||||

B | 1.58 | 0.0278 | |||

C | 0 | 0 | 0.111 | ||

D | 0.0533 | 0.0667 | 0 | 0.04 | |

E | 0 | 0 | 0 | 0.04 | 0.01 |

What is a blocks database?

BLOSUM Matrices Lecture. IA State.

Columbia CS Department.

Have you always wanted to learn computer programming but are afraid it'll be too difficult for you? Or you're familiar with some programming but are interested in learning Python fast? Then this book is for you. You no longer have to waste your time and money learning Python from lengthy books, expensive online courses or complicated Python tutorials.

$ Check price(185+ reviews)

Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.

$ Check price(7+ reviews)

Ad