03. Identity and Similarity - a quantitative measure

To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.

After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.


Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.

What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.

Similarity (aka Positives)

When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.

Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.

Similarity = Positives

Keep in mind that on a BLAST search, similarity is also known as Positives!

What it looks like in BLAST

In a BLAST search, part of your results will come out like so:

From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.

Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.


Let's look at some a quick example to see how identity and similarity are calculated.

Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical. Thus, we can say that our % identity is

Identity = 100 / 320 = 31.25%.

We always use the smaller sequence length as the denominator.

Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.

To calculate similarity (a.k.a. positives),

Similarity = (100 + 23) / 320 = 38.44%

Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?

Take your Linux skills to the next level!

Linux for Beginners

Take your Linux skills to the next level! Try Linux & UNIX

Linux for Beginners doesn't make any assumptions about your background or knowledge of Linux. You need no prior knowledge to benefit from this book. You will be guided step by step using a logical and systematic approach. As new concepts, commands, or jargon are encountered they are explained in plain language, making it easy for anyone to understand.

$ Check price
24.9924.99Amazon 4.5 logo(101+ reviews)

More Linux & UNIX resources

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 2

Become a Bioinformatics Whiz! Try Bioinformatics

This is Volume 2 of Bioinformatics Algorithms: An Active Learning Approach. This book presents students with a light-hearted and analogy-filled companion to the author's acclaimed course on Coursera. Each chapter begins with an interesting biological question that further evolves into more and more efficiently solutions of solving it.

$ Check price
49.9949.99Amazon 5 logo(5+ reviews)

More Bioinformatics resources