03. Identity and Similarity - a quantitative measure

To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.

After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.


Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.

What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.

Similarity (aka Positives)

When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.

Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.

Similarity = Positives

Keep in mind that on a BLAST search, similarity is also known as Positives!

What it looks like in BLAST

In a BLAST search, part of your results will come out like so:

From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.

Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.


Let's look at some a quick example to see how identity and similarity are calculated.

Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical. Thus, we can say that our % identity is

Identity = 100 / 320 = 31.25%.

We always use the smaller sequence length as the denominator.

Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.

To calculate similarity (a.k.a. positives),

Similarity = (100 + 23) / 320 = 38.44%

Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 2

Become a Bioinformatics Whiz! Try Bioinformatics

This is Volume 2 of Bioinformatics Algorithms: An Active Learning Approach. This book presents students with a light-hearted and analogy-filled companion to the author's acclaimed course on Coursera. Each chapter begins with an interesting biological question that further evolves into more and more efficiently solutions of solving it.

$ Check price
49.9949.99Amazon 5 logo(5+ reviews)

More Bioinformatics resources

Learn to be a Pythonista!

Introducing Python

Learn to be a Pythonista! Try Python

Easy to understand and fun to read, Introducing Python is ideal for beginning programmers as well as those new to the language. Author Bill Lubanovic takes you from the basics to more involved and varied topics, mixing tutorials with cookbook-style code recipes to explain concepts in Python 3. End-of-chapter exercises help you practice what you learned.

$ Check price
39.9939.99Amazon 4.5 logo(37+ reviews)

More Python resources