02. FASTA format

FASTA (pronounced "fast-A") format is a simple type of format that bioinformaticians use to represent either nucleotide or protein sequences. It is written in text format, allowing for processing tools to easily parse the data. The general file extension is .fas.

The FASTA file format originated from a DNA and protein sequence alignment software package called FASTP created in the mid-1980's. The format allows you to precede each sequence with a comment.

There are two lines per sequence - 1) the identifier (comments, annotations) and 2) the sequence itself.

Sample FASTA sequence

Before we dig into a FASTA sequence, let's see what one looks like. Here is an example of a standard FASTA format. Pretty simple, right?

>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor
MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN
MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS
VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC
NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP

1) Identifier

The top line holds information pertaining to the sequence below. It is preceded by with a ">". Without this informative first line, we just have a raw format.

Sequence identifiers

When the FASTA sequence comes from a biological database, the identifier marks which database. Here is a list of major database sequence identifers:

GenBank/EMBL/DDBJ
gi|gi_number|*|accession.version|locus
The * is gb, embl, or dbj depending on the database.
NCBI refseq
ref|accession|locus
PRF
Protein Research Foundation
pir|entry
SWISS-PROT
sp|accesion|locus
PDB
Protein Data Bank
pdb|entry|chain

2) Sequence

The line immediately proceeding the identifier is the raw sequence. For both DNA and proteins, standard nucleic acid and amino acid IUB/IUPAC codes are used.

Additionally, there are a few more notes to consider:

  • Lower-case letters are mapped to upper-case.
  • Hyphens represent a gap character.
  • Amino acid sequences, U and * are acceptable.
  • It is recommended that each line be shorter than 80 characters.

IUB/IUPAC DNA nucleic acid code

Here is a list of the standard IUB/IUPAC nucleic acid codes.

A
A
C
C
G
G
T
T
U
U
R
A or G (puRine)
Y
C, T or U (pYrimidines)
K
G, T or U (bases with Ketone)
M
A or C (bases with an aMino group
S
C or G (Strong interaction)
W
A, T or U (Weak interaction)
B
not A (B comes after A)
D
not C (D comes after C
H
not G (H comes after G)
V
neither T nor U (V comes after U)
N
A C G T U (Nucleic acid)
X
masked
-
Gap of unknown length

IUB/IUPAC amino acid residue code

Here's a list of the 24 amino acids and 3 special codons.

A
Alanine
B
Aspartic Acid (D) or Asparagine (N)
C
Cysteine
D
Aspartic Acid
E
Glutamic Acid
F
Phenylalanine
G
Glycine
H
Histidine
I
Isoleucine
J
Leucine (L) or Isoleucine (I)
K
Lysine
L
Leucine
M
Methionine
N
Asparagine
O
Pyrrolysine
P
Proline
Q
Glutamine
R
Arginine
S
Serine
T
Threonine
U
Selenocysteine
V
Valine
W
Tryptophan
Y
Tyrosine
Z
Gluatmic acid (E) or Glutamine (Q)
X
Any
*
Translation Stop
-
Gap of unknown length

Specific file extensions

The generic form of FASTA file has the .fas extension. For more specific types, we can use the following:

fna
FASTA nucleic acid
Specifies nucleic acids.
ffn
FASTA nucleotide coding regions
Contains coding regions for a genome.
faa
fasta amino acid
Contains amino acids.
frn
FASTA non-coding RNA
Non-coding RNA regions for a genome.

Multi-FASTA format

If we just append multiple sequences in FASTA format, we get multi-FASTA format. This is a single file with several sequences, and is often used for multi-alignment programs like ClustalW or multialign.

Obtaining FASTA-format

To get FASTA-formatted sequence from GenBank NCBI database, simply click the display near the top of the record and click FASTA.

Obtaining FASTA-format from NCBI
Obtaining FASTA-format for the insulin protein from the NCBI protein database. Simply click Display Settings, then FASTA.

Converting FASTA sequences

Keep in mind that there are prorams out there like READSEQ that allow you to convert formats to and from FASTA.

Take your Linux skills to the next level!

System Admin Handbook

Take your Linux skills to the next level! Try Linux & UNIX

This book approaches system administration in a practical way and is an invaluable reference for both new administrators and experienced professionals. It details best practices for every facet of system administration, including storage management, network design and administration, email, web hosting, scripting, and much more.

$ Check price
74.9974.99Amazon 4.5 logo(142+ reviews)

More Linux & UNIX resources

Become a Bioinformatics Whiz!

Bioinformatics Data Skills

Become a Bioinformatics Whiz! Try Bioinformatics

Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.

$ Check price
49.9949.99Amazon 4.5 logo(7+ reviews)

More Bioinformatics resources

Ad