07. GFF and GTF formats

GFF, or the General Feature Format is used to describe genes and other features of DNA, RNA and protein sequences. It comes with the .gff extension.

What exactly is GFF?

GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments.

name, start and end parameters
Name, start and end parameters (NSE).

GFF allows for users to perform common operations such as intersection, exclusion, union, filtration, sorting, transformation and dereferencing.

What types of software use GFF?

Several types of bioinformatics software use GFF. This includes genome views such as GBrowse, Jalview and IGB.

Different versions

There are several versions of GFF. The ones used today are GFF2, GTF and GFF3.

GFF2 (General Feature Format version 2) was limited in that it could only handle three-level feature hierachies instead of three-level such as gene -> transcript -> exon. Thus the Sequence Ontology and GMOD projects expanded on this with features.

GTF (General Transfer Format) has also been known as GFF Version 2.5 since it improves on verison 2, but not as much as version 3.

Characteristics

GFF consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file.

Optional track lines

Within the file, we can also include optional track definition lines. These go at the beginning of the list of features they are to affect.

Fields

refseq name
Name of chromosome or scaffold. Chromosomes can be given without the 'chr' prefix.
Must be one used within Ensembl.
source
Source of annotation, name of program that generated this feature.
feature
Feature type name.
Gene, variation, similarity
start
Start position, starting at 1.
end
End position, starting at 1.
score
Floating point value.
For scores such as similarity, identity, etc.
strand
'+' for forward and '-' for reverse.
frame
Either 0, 1 or 2.
0 indicates first base of the feature is first base of codon, 1 indicates second base of feature is the first base of a codon, etc.
attribute
Semicolon-separated list of tag-value pairs.
Provides additional information about each feature.

Validator

Validators allow us to ensure that a file is formatted properly. To validate a GFF3 file, go to the GFF3 validator.

References

Ensembl

Wellcome trust sanger institute. GFF: an exchange format for feature description

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 2

Become a Bioinformatics Whiz! Try Bioinformatics

This is Volume 2 of Bioinformatics Algorithms: An Active Learning Approach. This book presents students with a light-hearted and analogy-filled companion to the author's acclaimed course on Coursera. Each chapter begins with an interesting biological question that further evolves into more and more efficiently solutions of solving it.

$ Check price
49.9949.99Amazon 5 logo(5+ reviews)

More Bioinformatics resources

Learn to be a Pythonista!

Introducing Python

Learn to be a Pythonista! Try Python

Easy to understand and fun to read, Introducing Python is ideal for beginning programmers as well as those new to the language. Author Bill Lubanovic takes you from the basics to more involved and varied topics, mixing tutorials with cookbook-style code recipes to explain concepts in Python 3. End-of-chapter exercises help you practice what you learned.

$ Check price
39.9939.99Amazon 4.5 logo(37+ reviews)

More Python resources

Ad