01. Pairwise Alignment Introduction

What is Pairwise Alignment?

Pairwise alignment is the process of aligning two DNA, RNA or protein sequences such that the regions of similarity are maximized. This is often performed to find functional, structural or evolutionary commonalities.

In most cases, scientists use two protein sequences to quantitatively find relatedness (aka homology). With this, they are able to identify common domains and motifs, and sequence ancestry.

Domains and Motifs

Domains are parts of a DNA or amino acid strand that code for a physiochemically similar feature as found in other sequences and proteins. Domains refer to specific functionalities. For example, you could have a ATP-binding domain or polar domain.

Motifs are similar, but reference the structural characteristics rather than functional regions. Motifs are often found in domains, although that's not always the case.

Protein vs. DNA sequence alignment

Protein amino acid sequences are preferred over DNA sequences for a list of reasons.

  • Protein residues are more informative - a change in DNA (especially the 3rd position) does not necessarily change the AA.
  • The larger number of amino acids than nucleic acids makes it easier to find significance.
  • Some amino acids share related biochemical properties, which can be accounted for when scoring multiple pairwise alignments.
  • Protein sequence comparisons can link back to over a billion years ago, whereas DNA sequence comparisons can only go back up to 600 mya. Thus, protein sequences are far better for evolutionary studies.

However, there are some obvious instances when DNA alignments are needed.

  • When confirming the identity of cDNA (forensic sequencing).
  • When studying noncoding regions of DNA. These regions evolve at a faster rate than coding DNA, while mitochondrial noncoding DNA evolves even faster.
  • When studying DNA mutations.
  • When researching on very similar organisms such as Neanderthals and modern humans.

Biochemistry 101 Review

Before we move on, let's take a quick review on some elementary biochemistry and notations.

Nucleotide Codes

We're all familiar with the four nucleotide bases - however, there are other symbols used for more ambiguous nucleotides.

Symbol Meaning Explanation
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
R A or G puRine
Y C or T pYrimidine
M A or C aMino
K G or T Keto
S C or G Strong interaction (3 bonds)
W A or T Weak interaction (2 bonds)
H A, C or T (not G) H is after G
B C, G, or T (not A) B is after A
V A, C or G (not T) V is after T and U
D A, G or T (not C) D is after C
N A, C, G or T aNything
CG DNA interaction vs. AT interaction
CG DNA interaction vs. AT interaction
A CG bond is stronger than an AT bond due to it having one more hydrogen bond. Source: Wikipedia

Amino Acid Residue Codes

Amino acids can be represented with one or three letters. Take some time to review these.

1-letter 3-letters Amino Acid
A Ala Alanine
C Cys Cysteine
D Asp Aspartic Acid
E Glu Glutamic Acid
F Phe Phenylaline
G Gly Glycine
H His Histidine
I Ile Isoleucine
K Lys Lysine
L Leu Leucine
M Met Methionine
N Asn Asparagine
O Pyl Pyrrolysine
P Pro Proline
Q Gln Glutamine
R Arg Arginine
S Ser Serine
T Thr Threonine
U Sec Selenocysteine
V Val Valine
W Trp Tryptophan
X Xaa Undetermined
Y Tyr Tyrosine
Z Gln Glutamic acid or glutamine

Amino Acids license plate game

A good tip to memorizing these is to play the amino acids license plate game! Keep a printout of the following table. When you and your cool friends are out for a drive, try to translate each license plate letter into amino acids. Sounds nerdy, but very effective in learning. Bonus points for knowing the properties and/or structures!

Amino acids grouping

There are several ways to group amino acids, depending on their functionalities and biochemical properties.

Amino Acids and their biochemical properties
Amino acids and their biochemical properties. From Wikipedia.

With nonpolar (hydrophobic) side chains: alanine, valine, leucine, isoleucine, proline, methionine, phenylaline, tryptophan

With uncharged polar side chains: tyrosine, asparagine, glutamine, glycine, serine, threnine, cystein

With positively charged side chains: histidine, lysine, arginine

With negatively charged side chains: aspartic acid, glutamic acid

Become a Bioinformatics Whiz!

Bioinformatics Data Skills

Become a Bioinformatics Whiz! Try Bioinformatics

Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.

$ Check price
49.9949.99Amazon 4.5 logo(7+ reviews)

More Bioinformatics resources

Take your Linux skills to the next level!

Command Line Kung Fu

Take your Linux skills to the next level! Try Linux & UNIX

Command Line Kung Fu is packed with dozens of tips and practical real-world examples. You won't find theoretical examples in this book. The examples demonstrate how to solve actual problems. The tactics are easy to find, too. Each chapter covers a specific topic and groups related tips and examples together.

$ Check price
14.9914.99Amazon 4.5 logo(27+ reviews)

More Linux & UNIX resources