03. Identity and Similarity - a quantitative measure

To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.

After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.


Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.

What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.

Similarity (aka Positives)

When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.

Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.

Similarity = Positives

Keep in mind that on a BLAST search, similarity is also known as Positives!

What it looks like in BLAST

In a BLAST search, part of your results will come out like so:

From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.

Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.


Let's look at some a quick example to see how identity and similarity are calculated.

Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical. Thus, we can say that our % identity is

Identity = 100 / 320 = 31.25%.

We always use the smaller sequence length as the denominator.

Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.

To calculate similarity (a.k.a. positives),

Similarity = (100 + 23) / 320 = 38.44%

Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?

Become a Bioinformatics Whiz!

Bioinformatics Data Skills

Become a Bioinformatics Whiz! Try Bioinformatics

Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.

$ Check price
49.9949.99Amazon 4.5 logo(7+ reviews)

More Bioinformatics resources

Take your Linux skills to the next level!

Command Line Kung Fu

Take your Linux skills to the next level! Try Linux & UNIX

Command Line Kung Fu is packed with dozens of tips and practical real-world examples. You won't find theoretical examples in this book. The examples demonstrate how to solve actual problems. The tactics are easy to find, too. Each chapter covers a specific topic and groups related tips and examples together.

$ Check price
14.9914.99Amazon 4.5 logo(27+ reviews)

More Linux & UNIX resources