01. Introduction to sequence file formats

Ever since biologists started curating biological data and storing them into computers, a multitude of file formats arose. The very first files contained raw DNA sequence reads in a regular .txt file, but as soon as the range of information broadened, so did the types of files.

Some of the new file formats were made to be compatible with specific software, while others were made to improve efficiency. For example, binary formats are not human-readable, but are much more efficient to deal with when performing string searches or data manipulations. You'll see that some file formats have a corresponding binary version that makes data processing simpler.

Purpose of this lesson

In this series, we'll go over the most common sequence file formats. The purpose of this lesson won't be to learn the intricate details per format, but to simply become familiarized with the different file types that are used to store biological data. This will help calm that overwhelming feeling you get when sifting through bioinformatics forums, books and software tools.

We'll mainly go over DNA sequence file types, and save database file formats such as EMBL or SWISS-PROT for another lesson.

What is a file format?

A file format is a way for computers (and humans) to standardize how data is organized. For example, this page was written on an .html extension. HTML files contain special tags that tell the browser what each block of text is, and how to display it on the page.

Additionally, computers are able to check file formats and immediately determine whether it should be opened in a text editor (for editing), a modern browser (for viewing) or some other software.

File types can also indicate which algorithm to use to view (or open) that file. For example, .gif, .jpg and .png all display images, but the level of compression, size and resolution differ.

Quagga images in .gif, .jpg and .png
Some examples of image file formats: .gif, .jpg, and .png

Plain text files

Early on, scientists held sequence information in plain text (.txt) with descriptive file names. The researchers then felt limited, when they felt the need to include annotations and additional information about the sequences.

csv, tsv

More common (yet still primitive) file types include csv and tsv. The former stands for comma-separated values, meaning that there is a comma between each value. The simplicity of this format allows researchers to easily exchange data among computers, a term known as portability.

Opening a .csv file in Excel
Opening a .csv file in Excel, and in regular text editor. These files are portable due to their simplistic nature.

A tsv (tab-separated values) file is similar, but data is separated by tabs instead of commas. Many of the biological filetypes covered in this lesson have tab-separated values.

What is a newline (EOL) character?

The newline (aka end of line or EOL) is a special character or sequence of special characters that signify the end of a line in text. On a normal text editor such as Notepad++, these characters are hidden.

What's tricky about the EOL character is that depending on the platform (UNIX or MS Windows), the newline character is different. On the Command Line, you may interchange files of the two types with the dos2unix and unix2dos commands.

Viewing end of line character ($) on Vim with the 'set list' command.
Viewing end of line character ($) on Vim with the 'set list' command.


Another common text-based format that is becoming more and more popular is the markdown format. These files are indicated with a basename of .md.

The markdown format is a markup language, just like HTML. All it does is mark up the text within the document to indicate which lines are headers, paragraphs, and so on. For example,

There are command-line tools that help convert from markdown to html, such as pandoc:

Github view of a markdown file.
The left picture shows a raw .md file, while the right picture shows how it renders in a browser.

You'll often see these files as README.md when you download a source file from GitHub or another repository. The markdown format allows the page to load with proper formatting.

Great! Now that we've covered basic text-based file formats, let's move into more bioinformatics-specific file types.

Become a Bioinformatics Whiz!

Bioinformatics Data Skills

Become a Bioinformatics Whiz! Try Bioinformatics

Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.

$ Check price
49.9949.99Amazon 4.5 logo(7+ reviews)

More Bioinformatics resources

Take your Linux skills to the next level!

System Admin Handbook

Take your Linux skills to the next level! Try Linux & UNIX

This book approaches system administration in a practical way and is an invaluable reference for both new administrators and experienced professionals. It details best practices for every facet of system administration, including storage management, network design and administration, email, web hosting, scripting, and much more.

$ Check price
74.9974.99Amazon 4.5 logo(142+ reviews)

More Linux & UNIX resources