03. Identity and Similarity - a quantitative measure

To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.

After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.

Identity

Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.

What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.

Similarity (aka Positives)

When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.

Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.

Similarity = Positives

Keep in mind that on a BLAST search, similarity is also known as Positives!

What it looks like in BLAST

In a BLAST search, part of your results will come out like so:

From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.

Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.

Calculations

Let's look at some a quick example to see how identity and similarity are calculated.

Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical. Thus, we can say that our % identity is

Identity = 100 / 320 = 31.25%.

We always use the smaller sequence length as the denominator.

Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.

To calculate similarity (a.k.a. positives),

Similarity = (100 + 23) / 320 = 38.44%

Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?

Learn to be a Pythonista!

Python Programming

Learn to be a Pythonista! Try Python

This book is designed to be used as the primary textbook in a college-level first course in computing. It takes a fairly traditional approach, emphasizing problem solving, design, and programming as the core skills of computer science. However, these ideas are illustrated using a non-traditional language, namely Python.

$ Check price
45.9945.99Amazon 4.5 logo(211+ reviews)

More Python resources

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 1

Become a Bioinformatics Whiz! Try Bioinformatics

If you're looking for a fun and easy entry point into bioinformatics algorithms, this book it just for you! Filled with graphics, and written in a light-hearted and humorous story-telling persona, Bioinformatics Algorithms guides you through the intricacies of the problems faced in biology, and the clever solutions used to solve them.

$ Check price
49.9949.99Amazon 4.5 logo(4+ reviews)

More Bioinformatics resources

Ad