07. GFF and GTF formats

GFF, or the General Feature Format is used to describe genes and other features of DNA, RNA and protein sequences. It comes with the .gff extension.

What exactly is GFF?

GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments.

name, start and end parameters
Name, start and end parameters (NSE).

GFF allows for users to perform common operations such as intersection, exclusion, union, filtration, sorting, transformation and dereferencing.

What types of software use GFF?

Several types of bioinformatics software use GFF. This includes genome views such as GBrowse, Jalview and IGB.

Different versions

There are several versions of GFF. The ones used today are GFF2, GTF and GFF3.

GFF2 (General Feature Format version 2) was limited in that it could only handle three-level feature hierachies instead of three-level such as gene -> transcript -> exon. Thus the Sequence Ontology and GMOD projects expanded on this with features.

GTF (General Transfer Format) has also been known as GFF Version 2.5 since it improves on verison 2, but not as much as version 3.

Characteristics

GFF consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file.

Optional track lines

Within the file, we can also include optional track definition lines. These go at the beginning of the list of features they are to affect.

Fields

refseq name
Name of chromosome or scaffold. Chromosomes can be given without the 'chr' prefix.
Must be one used within Ensembl.
source
Source of annotation, name of program that generated this feature.
feature
Feature type name.
Gene, variation, similarity
start
Start position, starting at 1.
end
End position, starting at 1.
score
Floating point value.
For scores such as similarity, identity, etc.
strand
'+' for forward and '-' for reverse.
frame
Either 0, 1 or 2.
0 indicates first base of the feature is first base of codon, 1 indicates second base of feature is the first base of a codon, etc.
attribute
Semicolon-separated list of tag-value pairs.
Provides additional information about each feature.

Validator

Validators allow us to ensure that a file is formatted properly. To validate a GFF3 file, go to the GFF3 validator.

References

Ensembl

Wellcome trust sanger institute. GFF: an exchange format for feature description

Take your Linux skills to the next level!

The Linux Command Line

Take your Linux skills to the next level! Try Linux & UNIX

The Linux Command Line takes you from your very first terminal keystrokes to writing full programs in Bash, the most popular Linux shell. Along the way you'll learn the timeless skills handed down by generations of gray-bearded, mouse-shunning gurus: file navigation, environment configuration, command chaining, pattern matching with regular expressions, and more.

$ Check price
39.9539.95Amazon 4.5 logo(274+ reviews)

More Linux & UNIX resources

Learn to be a Pythonista!

Learn Python in One Day

Learn to be a Pythonista! Try Python

Have you always wanted to learn computer programming but are afraid it'll be too difficult for you? Or you're familiar with some programming but are interested in learning Python fast? Then this book is for you. You no longer have to waste your time and money learning Python from lengthy books, expensive online courses or complicated Python tutorials.

$ Check price
11.9711.97Amazon 4 logo(185+ reviews)

More Python resources

Ad