FASTA (pronounced "fast-A") format is a simple type of format that bioinformaticians use to represent either nucleotide or protein sequences. It is written in text format, allowing for processing tools to easily parse the data. The general file extension is .fas.
The FASTA file format originated from a DNA and protein sequence alignment software package called FASTP created in the mid-1980's. The format allows you to precede each sequence with a comment.
There are two lines per sequence - 1) the identifier (comments, annotations) and 2) the sequence itself.
Before we dig into a FASTA sequence, let's see what one looks like. Here is an example of a standard FASTA format. Pretty simple, right?
>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP
The top line holds information pertaining to the sequence below. It is preceded by with a ">". Without this informative first line, we just have a raw format.
When the FASTA sequence comes from a biological database, the identifier marks which database. Here is a list of major database sequence identifers:
The line immediately proceeding the identifier is the raw sequence. For both DNA and proteins, standard nucleic acid and amino acid IUB/IUPAC codes are used.
Additionally, there are a few more notes to consider:
Here is a list of the standard IUB/IUPAC nucleic acid codes.
Here's a list of the 24 amino acids and 3 special codons.
The generic form of FASTA file has the .fas extension. For more specific types, we can use the following:
If we just append multiple sequences in FASTA format, we get multi-FASTA format. This is a single file with several sequences, and is often used for multi-alignment programs like ClustalW or multialign.
To get FASTA-formatted sequence from GenBank NCBI database, simply click the display near the top of the record and click FASTA.
Keep in mind that there are prorams out there like READSEQ that allow you to convert formats to and from FASTA.
Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.$ Check price
Linux for Beginners doesn't make any assumptions about your background or knowledge of Linux. You need no prior knowledge to benefit from this book. You will be guided step by step using a logical and systematic approach. As new concepts, commands, or jargon are encountered they are explained in plain language, making it easy for anyone to understand.$ Check price