Home
The FASTA format is extremely simple with just two lines per sequence - the first is for the description, the other for the raw sequence. The simplicity is nice when running a quick pairwise alignment, but limiting when we need more information per sequence.
With next-generation sequencing instruments pumping out millions of reads per run, scientists needed a way to check the quality of each base call. To document both the sequence and the probability of each of being correct, scientists came up with the FASTQ format. The "Q" comes from quality, as in the quality of the read.
The file extension for FASTQ is .fq and .fastq.
The FASTQ format was developed by the Wellcome Trust Sanger Institute, and became the de facto standard for high-throughput sequencing instrument outputs.
In addition to storing biological sequence information, it also adds a line for the quality scores. Each score is encoded with a single ASCII character
Let's take a look at an example FASTQ format, then look at each line.
@SEQ_ID
TTCAACTCGTTAGTAAATATCAAACGATCAGTACCATTTTGGGGTTCAAAGTGACAGTTT
+
!'>>>>CCC'*((((***(***-+*'')+))%%%++))**55CCF>>%%%%).1CCCC65
The first line begins with an '@' character and contains the sequence identifier with an optional description. This is just like FASTA's first line.
Here is an example sequence identifier from Illumina
@HWUSI-EAS100R:6:73:941:1973#0/1
The second line contains raw sequence reads, also similar to FASTA files.
The fourth line encodes the quality scores per each base call. This line must have the same length as the sequence in line 2.
Scores range from ! being the lowest quality and ~ being the highest. These values come from the ASCII table values 33-126.
The values are shifted down to 0 to 93, but we rarely have a Phred score of over 60.
To map the quality to the probability that a base call is correct, we use a bit of math.
The p is the probability that the corresponding base call is incorrect, and Q is the Phred quality score which can range from 0 to 93.
For a more complete guide on FASTQ, visit the FASTQ format Wikipedia page.
Python Playground is a collection of fun programming projects that will inspire you to new heights. You'll manipulate images, build simulations, and interact with hardware using Arduino & Raspberry Pi. With each project, you'll get familiarized with leveraging external libraries for specialized tasks, breaking problems into smaller, solvable pieces, and translating algorithms into code.
$ Check priceThis is Volume 2 of Bioinformatics Algorithms: An Active Learning Approach. This book presents students with a light-hearted and analogy-filled companion to the author's acclaimed course on Coursera. Each chapter begins with an interesting biological question that further evolves into more and more efficiently solutions of solving it.
$ Check priceAd