Ever since biologists started curating biological data and storing them into computers, a multitude of file formats arose. The very first files contained raw DNA sequence reads in a regular .txt file, but as soon as the range of information broadened, so did the types of files.
Some of the new file formats were made to be compatible with specific software, while others were made to improve efficiency. For example, binary formats are not human-readable, but are much more efficient to deal with when performing string searches or data manipulations. You'll see that some file formats have a corresponding binary version that makes data processing simpler.
In this series, we'll go over the most common sequence file formats. The purpose of this lesson won't be to learn the intricate details per format, but to simply become familiarized with the different file types that are used to store biological data. This will help calm that overwhelming feeling you get when sifting through bioinformatics forums, books and software tools.
We'll mainly go over DNA sequence file types, and save database file formats such as EMBL or SWISS-PROT for another lesson.
A file format is a way for computers (and humans) to standardize how data is organized. For example, this page was written on an .html extension. HTML files contain special tags that tell the browser what each block of text is, and how to display it on the page.
Additionally, computers are able to check file formats and immediately determine whether it should be opened in a text editor (for editing), a modern browser (for viewing) or some other software.
File types can also indicate which algorithm to use to view (or open) that file. For example, .gif, .jpg and .png all display images, but the level of compression, size and resolution differ.
Early on, scientists held sequence information in plain text (.txt) with descriptive file names. The researchers then felt limited, when they felt the need to include annotations and additional information about the sequences.
More common (yet still primitive) file types include csv and tsv. The former stands for comma-separated values, meaning that there is a comma between each value. The simplicity of this format allows researchers to easily exchange data among computers, a term known as portability.
A tsv (tab-separated values) file is similar, but data is separated by tabs instead of commas. Many of the biological filetypes covered in this lesson have tab-separated values.
The newline (aka end of line or EOL) is a special character or sequence of special characters that signify the end of a line in text. On a normal text editor such as Notepad++, these characters are hidden.
What's tricky about the EOL character is that depending on the platform (UNIX or MS Windows), the newline character is different. On the Command Line, you may interchange files of the two types with the dos2unix and unix2dos commands.
Another common text-based format that is becoming more and more popular is the markdown format. These files are indicated with a basename of .md.
The markdown format is a markup language, just like HTML. All it does is mark up the text within the document to indicate which lines are headers, paragraphs, and so on. For example,
There are command-line tools that help convert from markdown to html, such as pandoc:
You'll often see these files as README.md when you download a source file from GitHub or another repository. The markdown format allows the page to load with proper formatting.
Great! Now that we've covered basic text-based file formats, let's move into more bioinformatics-specific file types.
Learn the best practices used by academic and industry professionals. Bioinformatics Data Skills give a great overview to the Linux Command Line, Github, and other essential tools used in the trade. This book bridges the gap between knowing a few programming languages and being able to utilize the tools to analyze large amounts of biological data.$ Check price
In this completely revised second edition of the perennial best seller How Linux Works, author Brian Ward makes the concepts behind Linux internals accessible to anyone curious about the inner workings of the operating system. Inside, you'll find the kind of knowledge that normally comes from years of experience doing things the hard way.$ Check price