07. GFF and GTF formats

GFF, or the General Feature Format is used to describe genes and other features of DNA, RNA and protein sequences. It comes with the .gff extension.

What exactly is GFF?

GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments.

name, start and end parameters
Name, start and end parameters (NSE).

GFF allows for users to perform common operations such as intersection, exclusion, union, filtration, sorting, transformation and dereferencing.

What types of software use GFF?

Several types of bioinformatics software use GFF. This includes genome views such as GBrowse, Jalview and IGB.

Different versions

There are several versions of GFF. The ones used today are GFF2, GTF and GFF3.

GFF2 (General Feature Format version 2) was limited in that it could only handle three-level feature hierachies instead of three-level such as gene -> transcript -> exon. Thus the Sequence Ontology and GMOD projects expanded on this with features.

GTF (General Transfer Format) has also been known as GFF Version 2.5 since it improves on verison 2, but not as much as version 3.

Characteristics

GFF consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file.

Optional track lines

Within the file, we can also include optional track definition lines. These go at the beginning of the list of features they are to affect.

Fields

refseq name
Name of chromosome or scaffold. Chromosomes can be given without the 'chr' prefix.
Must be one used within Ensembl.
source
Source of annotation, name of program that generated this feature.
feature
Feature type name.
Gene, variation, similarity
start
Start position, starting at 1.
end
End position, starting at 1.
score
Floating point value.
For scores such as similarity, identity, etc.
strand
'+' for forward and '-' for reverse.
frame
Either 0, 1 or 2.
0 indicates first base of the feature is first base of codon, 1 indicates second base of feature is the first base of a codon, etc.
attribute
Semicolon-separated list of tag-value pairs.
Provides additional information about each feature.

Validator

Validators allow us to ensure that a file is formatted properly. To validate a GFF3 file, go to the GFF3 validator.

References

Ensembl

Wellcome trust sanger institute. GFF: an exchange format for feature description

Become a Bioinformatics Whiz!

Introduction to Bioinformatics Vol. 1

Become a Bioinformatics Whiz! Try Bioinformatics

If you're looking for a fun and easy entry point into bioinformatics algorithms, this book it just for you! Filled with graphics, and written in a light-hearted and humorous story-telling persona, Bioinformatics Algorithms guides you through the intricacies of the problems faced in biology, and the clever solutions used to solve them.

$ Check price
49.9949.99Amazon 4.5 logo(4+ reviews)

More Bioinformatics resources

Take your Linux skills to the next level!

System Admin Handbook

Take your Linux skills to the next level! Try Linux & UNIX

This book approaches system administration in a practical way and is an invaluable reference for both new administrators and experienced professionals. It details best practices for every facet of system administration, including storage management, network design and administration, email, web hosting, scripting, and much more.

$ Check price
74.9974.99Amazon 4.5 logo(142+ reviews)

More Linux & UNIX resources

Ad