01. Introduction to sequence file formats

Ever since biologists started curating biological data and storing them into computers, a multitude of file formats arose. The very first files contained raw DNA sequence reads in a regular .txt file, but as soon as the range of information broadened, so did the types of files.

Some of the new file formats were made to be compatible with specific software, while others were made to improve efficiency. For example, binary formats are not human-readable, but are much more efficient to deal with when performing string searches or data manipulations. You'll see that some file formats have a corresponding binary version that makes data processing simpler.

Purpose of this lesson

In this series, we'll go over the most common sequence file formats. The purpose of this lesson won't be to learn the intricate details per format, but to simply become familiarized with the different file types that are used to store biological data. This will help calm that overwhelming feeling you get when sifting through bioinformatics forums, books and software tools.

We'll mainly go over DNA sequence file types, and save database file formats such as EMBL or SWISS-PROT for another lesson.

What is a file format?

A file format is a way for computers (and humans) to standardize how data is organized. For example, this page was written on an .html extension. HTML files contain special tags that tell the browser what each block of text is, and how to display it on the page.

Additionally, computers are able to check file formats and immediately determine whether it should be opened in a text editor (for editing), a modern browser (for viewing) or some other software.

File types can also indicate which algorithm to use to view (or open) that file. For example, .gif, .jpg and .png all display images, but the level of compression, size and resolution differ.

Quagga images in .gif, .jpg and .png
Some examples of image file formats: .gif, .jpg, and .png

Plain text files

Early on, scientists held sequence information in plain text (.txt) with descriptive file names. The researchers then felt limited, when they felt the need to include annotations and additional information about the sequences.

csv, tsv

More common (yet still primitive) file types include csv and tsv. The former stands for comma-separated values, meaning that there is a comma between each value. The simplicity of this format allows researchers to easily exchange data among computers, a term known as portability.

Opening a .csv file in Excel
Opening a .csv file in Excel, and in regular text editor. These files are portable due to their simplistic nature.

A tsv (tab-separated values) file is similar, but data is separated by tabs instead of commas. Many of the biological filetypes covered in this lesson have tab-separated values.

What is a newline (EOL) character?

The newline (aka end of line or EOL) is a special character or sequence of special characters that signify the end of a line in text. On a normal text editor such as Notepad++, these characters are hidden.

What's tricky about the EOL character is that depending on the platform (UNIX or MS Windows), the newline character is different. On the Command Line, you may interchange files of the two types with the dos2unix and unix2dos commands.

Viewing end of line character ($) on Vim with the 'set list' command.
Viewing end of line character ($) on Vim with the 'set list' command.


Another common text-based format that is becoming more and more popular is the markdown format. These files are indicated with a basename of .md.

The markdown format is a markup language, just like HTML. All it does is mark up the text within the document to indicate which lines are headers, paragraphs, and so on. For example,

There are command-line tools that help convert from markdown to html, such as pandoc:

Github view of a markdown file.
The left picture shows a raw .md file, while the right picture shows how it renders in a browser.

You'll often see these files as README.md when you download a source file from GitHub or another repository. The markdown format allows the page to load with proper formatting.

Great! Now that we've covered basic text-based file formats, let's move into more bioinformatics-specific file types.

Learn to be a Pythonista!

Learning Python

Learn to be a Pythonista! Try Python

Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz's popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It's an ideal way to begin, whether you're new to programming or a professional developer versed in other languages.

$ Check price
64.9964.99Amazon 4 logo(279+ reviews)

More Python resources

Take your Linux skills to the next level!

The Linux Command Line

Take your Linux skills to the next level! Try Linux & UNIX

The Linux Command Line takes you from your very first terminal keystrokes to writing full programs in Bash, the most popular Linux shell. Along the way you'll learn the timeless skills handed down by generations of gray-bearded, mouse-shunning gurus: file navigation, environment configuration, command chaining, pattern matching with regular expressions, and more.

$ Check price
39.9539.95Amazon 4.5 logo(274+ reviews)

More Linux & UNIX resources