02. FASTA format

FASTA (pronounced "fast-A") format is a simple type of format that bioinformaticians use to represent either nucleotide or protein sequences. It is written in text format, allowing for processing tools to easily parse the data. The general file extension is .fas.

The FASTA file format originated from a DNA and protein sequence alignment software package called FASTP created in the mid-1980's. The format allows you to precede each sequence with a comment.

There are two lines per sequence - 1) the identifier (comments, annotations) and 2) the sequence itself.

Sample FASTA sequence

Before we dig into a FASTA sequence, let's see what one looks like. Here is an example of a standard FASTA format. Pretty simple, right?

>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor
MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN
MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS
VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC
NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP

1) Identifier

The top line holds information pertaining to the sequence below. It is preceded by with a ">". Without this informative first line, we just have a raw format.

Sequence identifiers

When the FASTA sequence comes from a biological database, the identifier marks which database. Here is a list of major database sequence identifers:

GenBank/EMBL/DDBJ
gi|gi_number|*|accession.version|locus
The * is gb, embl, or dbj depending on the database.
NCBI refseq
ref|accession|locus
PRF
Protein Research Foundation
pir|entry
SWISS-PROT
sp|accesion|locus
PDB
Protein Data Bank
pdb|entry|chain

2) Sequence

The line immediately proceeding the identifier is the raw sequence. For both DNA and proteins, standard nucleic acid and amino acid IUB/IUPAC codes are used.

Additionally, there are a few more notes to consider:

  • Lower-case letters are mapped to upper-case.
  • Hyphens represent a gap character.
  • Amino acid sequences, U and * are acceptable.
  • It is recommended that each line be shorter than 80 characters.

IUB/IUPAC DNA nucleic acid code

Here is a list of the standard IUB/IUPAC nucleic acid codes.

A
A
C
C
G
G
T
T
U
U
R
A or G (puRine)
Y
C, T or U (pYrimidines)
K
G, T or U (bases with Ketone)
M
A or C (bases with an aMino group
S
C or G (Strong interaction)
W
A, T or U (Weak interaction)
B
not A (B comes after A)
D
not C (D comes after C
H
not G (H comes after G)
V
neither T nor U (V comes after U)
N
A C G T U (Nucleic acid)
X
masked
-
Gap of unknown length

IUB/IUPAC amino acid residue code

Here's a list of the 24 amino acids and 3 special codons.

A
Alanine
B
Aspartic Acid (D) or Asparagine (N)
C
Cysteine
D
Aspartic Acid
E
Glutamic Acid
F
Phenylalanine
G
Glycine
H
Histidine
I
Isoleucine
J
Leucine (L) or Isoleucine (I)
K
Lysine
L
Leucine
M
Methionine
N
Asparagine
O
Pyrrolysine
P
Proline
Q
Glutamine
R
Arginine
S
Serine
T
Threonine
U
Selenocysteine
V
Valine
W
Tryptophan
Y
Tyrosine
Z
Gluatmic acid (E) or Glutamine (Q)
X
Any
*
Translation Stop
-
Gap of unknown length

Specific file extensions

The generic form of FASTA file has the .fas extension. For more specific types, we can use the following:

fna
FASTA nucleic acid
Specifies nucleic acids.
ffn
FASTA nucleotide coding regions
Contains coding regions for a genome.
faa
fasta amino acid
Contains amino acids.
frn
FASTA non-coding RNA
Non-coding RNA regions for a genome.

Multi-FASTA format

If we just append multiple sequences in FASTA format, we get multi-FASTA format. This is a single file with several sequences, and is often used for multi-alignment programs like ClustalW or multialign.

Obtaining FASTA-format

To get FASTA-formatted sequence from GenBank NCBI database, simply click the display near the top of the record and click FASTA.

Obtaining FASTA-format from NCBI
Obtaining FASTA-format for the insulin protein from the NCBI protein database. Simply click Display Settings, then FASTA.

Converting FASTA sequences

Keep in mind that there are prorams out there like READSEQ that allow you to convert formats to and from FASTA.

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 2

Become a Bioinformatics Whiz! Try Bioinformatics

This is Volume 2 of Bioinformatics Algorithms: An Active Learning Approach. This book presents students with a light-hearted and analogy-filled companion to the author's acclaimed course on Coursera. Each chapter begins with an interesting biological question that further evolves into more and more efficiently solutions of solving it.

$ Check price
49.9949.99Amazon 5 logo(5+ reviews)

More Bioinformatics resources

Learn to be a Pythonista!

Python Programming

Learn to be a Pythonista! Try Python

This book is designed to be used as the primary textbook in a college-level first course in computing. It takes a fairly traditional approach, emphasizing problem solving, design, and programming as the core skills of computer science. However, these ideas are illustrated using a non-traditional language, namely Python.

$ Check price
45.9945.99Amazon 4.5 logo(211+ reviews)

More Python resources

Ad