03. FASTQ format

The FASTA format is extremely simple with just two lines per sequence - the first is for the description, the other for the raw sequence. The simplicity is nice when running a quick pairwise alignment, but limiting when we need more information per sequence.

With next-generation sequencing instruments pumping out millions of reads per run, scientists needed a way to check the quality of each base call. To document both the sequence and the probability of each of being correct, scientists came up with the FASTQ format. The "Q" comes from quality, as in the quality of the read.

The file extension for FASTQ is .fq and .fastq.

Original development

The FASTQ format was developed by the Wellcome Trust Sanger Institute, and became the de facto standard for high-throughput sequencing instrument outputs.

In addition to storing biological sequence information, it also adds a line for the quality scores. Each score is encoded with a single ASCII character

Characteristics

Let's take a look at an example FASTQ format, then look at each line.

@SEQ_ID
TTCAACTCGTTAGTAAATATCAAACGATCAGTACCATTTTGGGGTTCAAAGTGACAGTTT
+
!'>>>>CCC'*((((***(***-+*'')+))%%%++))**55CCF>>%%%%).1CCCC65

1) Sequence identifier and description

The first line begins with an '@' character and contains the sequence identifier with an optional description. This is just like FASTA's first line.

Illumina sequence identifiers

Here is an example sequence identifier from Illumina

@HWUSI-EAS100R:6:73:941:1973#0/1
HSWUSI-EAS100R
Unique instrument name
6
Flowcell lane
73
Tile number within the flow cell lane
941
x-coordinate of teh cluster within the tile.
1973
y-coordinate of cluster within the tile.
#0
Index number for multiplexed sample
/1
Member of a pair

2) Raw sequence letters

The second line contains raw sequence reads, also similar to FASTA files.

3) + may have

Line three starts with a + character and is optionally followed by the same sequence identifier

4) Quality scores

The fourth line encodes the quality scores per each base call. This line must have the same length as the sequence in line 2.

Scores range from ! being the lowest quality and ~ being the highest. These values come from the ASCII table values 33-126.

ASCII table.
An ASCII table, courtesy of Wikipedia.

The values are shifted down to 0 to 93, but we rarely have a Phred score of over 60.

Quality

To map the quality to the probability that a base call is correct, we use a bit of math.

Qsanger = -10 log10 p

The p is the probability that the corresponding base call is incorrect, and Q is the Phred quality score which can range from 0 to 93.

References

For a more complete guide on FASTQ, visit the FASTQ format Wikipedia page.

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 1

Become a Bioinformatics Whiz! Try Bioinformatics

If you're looking for a fun and easy entry point into bioinformatics algorithms, this book it just for you! Filled with graphics, and written in a light-hearted and humorous story-telling persona, Bioinformatics Algorithms guides you through the intricacies of the problems faced in biology, and the clever solutions used to solve them.

$ Check price
49.9949.99Amazon 4.5 logo(4+ reviews)

More Bioinformatics resources

Learn to be a Pythonista!

Learning Python

Learn to be a Pythonista! Try Python

Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz's popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It's an ideal way to begin, whether you're new to programming or a professional developer versed in other languages.

$ Check price
64.9964.99Amazon 4 logo(279+ reviews)

More Python resources

Ad