To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.
After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.
Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.
What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.
When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.
Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.
Keep in mind that on a BLAST search, similarity is also known as Positives!
In a BLAST search, part of your results will come out like so:
From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.
Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.
Let's look at some a quick example to see how identity and similarity are calculated.
Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical. Thus, we can say that our % identity is
We always use the smaller sequence length as the denominator.
Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.
To calculate similarity (a.k.a. positives),
Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?
Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.$ Check price
Linux for Beginners doesn't make any assumptions about your background or knowledge of Linux. You need no prior knowledge to benefit from this book. You will be guided step by step using a logical and systematic approach. As new concepts, commands, or jargon are encountered they are explained in plain language, making it easy for anyone to understand.$ Check price