Main Manual OGMP FGMP PID


1. Introduction
2. What Makes GOBASE Unique ?
3. Data Sources
4. Usage and Search Capabilities
5. Software Implementation
6. History
7. Future Development
8. Journal references

1. Introduction

During the last decade the capacity of computers to retrieve, organise and analyse information has revolutionized the working habits of scientists. Few disciplines have benefitted from this revolution as much as the biological sciences, and in particular the genome research community. Recent advances in the development of computer networks and databases have made it possible for a single user or organization to share information easily with the global community. Today, the amount and diversity of biological information available through these networks is overwhelming, ranging from ultrastructural architecture to genetic make-up and inheritance. New protein and DNA sequences are being deposited at public repositories at an unprecedented rate. In order to manage and utilize the huge amount of sequence and gene mapping data generated by various genome sequencing projects, it is essential to develop specialised databases, capable of establishing relations between the data and the general biology of the organisms from which they are derived. The creation of the organelle genome database GOBASE, which began as a collaborative effort between two Canadian universities (Université de Montréal and Dalhousie University), is a step in this direction.

GOBASE collects biological information from a number of sources world-wide and contains nucleotide and related protein sequences, genetic and physical maps, RNA secondary structures, biochemical and organismal information for mitochondrial and chloroplast genomes, among other important biological data (See Fig. 1.1 below).


Fig 1.1 GOBASE integrates diverse data types.
Why Organelle Genomes?

Organelles (mitochondria and chloroplasts) are of interest for many reasons:

The most important advantage to genomics offered by organelles is the number of completely sequenced genomes already available and currently being sequenced (e.g., by the Canadian Organelle Genome Megasequencing Program (OGMP) and Fungal Mitochondrial Genome Project (FMGP)). Organelle genomes are information-rich, typically coding for 12-200 proteins, tRNAs and rRNAs. They are ideal for comparative genomic studies because the dataset of organelle genomes is large enough to allow meaningful comparative genome analysis in ways that are only beginning to become possible for smaller nuclear genomes. GOBASE currently contains ~4900 complete organelle genomes. The ability to examine a large number of genomes of the same type permits investigation of genome evolution and identification of complex gene rearrangements and their roles in evolutionary divergence and speciation.

Why an Organelle Genome Database?

The body of currently available organelle data is dispersed among a number of sources, difficult to access or sometimes even to locate. In addition, cross-referencing biological information is complex and is rarely done. For example, given a GenBank record containing a ribosomal RNA sequence, there is no effective way to retrieve the corresponding secondary structures from another database without carrying out a separate search, and performing cross-genome comparisons is even more difficult. The available data sets are often incomplete or contain errors that are sometimes hard to identify and rectify with the underlying data structure. GOBASE provides a solution to these problems by offering a central, comprehensive, well-validated, relational database, so that researchers can access all the relevant information associated with organelles.

Goals

GOBASE was created to address questions in the following areas of biology:

This was accomplished by integrating data collected from online sources with information generated by the GOBASE team and its collaborators.

The volume of organelle data now available is too large to allow for expert visual investigation of each individual record. We are therefore engaged in the ongoing development of a suite of programs in Perl and SQL to retrieve data from GenBank, standardise gene and product names, and populate GOBASE records automatically so far as that is possible, requiring human intervention only for cases when the appropriate annotations cannot be unambiguously assigned by automated means.

2. What makes GOBASE unique ?

GOBASE is conceptually different from existing project-specific databases such as FlyBase and ACeDB, and from general-purpose archives such as GenBank. It is specifically a comparative organelle genome database integrating data from a wide range of sources. Furthermore, it is unique in its aim to bring together information from both mitochondrial and chloroplast genomes. We strive to excel in areas for which databases such as GenBank are not specifically designed. GOBASE is fundamentally different from GenBank in the following key areas:

Data accuracy

Although each record submitted to NCBI receives a rigorous and systematic check to ensure accuracy, due to the very high volume of data that needs to be processed, errors and inconsistency inevitably remain in the database. The most commonly found errors include incorrectly assigned biological features, gene and product names, and overlapping open-reading frames (ORFs). These errors and inconsistencies often cause a large number of false positives or completely irrelevant records to appear in the search results. At GOBASE, we have implemented extended, semi-automated validation procedures to reduce the number of false positives. These procedures are designed to correct:

Records presenting problems to the automated validation procedures are presented on a correction web page for expert verification for inspection and modification as necessary.

The taxonomic information contained within GOBASE represents an exact duplication of NCBI's taxonomic tree. This dataset has certain internal inconsistencies of nomenclature owing to different practices among taxonomists working in different areas of biology - for example, taxonomic ranks between "order" and "family" may be assigned to any or all of "suborder", "infraorder", "parvorder" or "superfamily", and many such taxa do not have a rank associated with them. We maintain the tree structure exactly, but we have made no attempts to assign rankings to any taxon where NCBI have not done so.

Classification scheme

Organelle sequence records are extracted from GenBank and divided into 9 fundamental categories (SEQUENCE, GENE, PROTEIN, RNA, EXON, INTRON, GENE/PRODUCT CLASS, MAP, TAXONOMY). We have not adopted the classification scheme of the International Sequence Database Consortium (NCBI, EMBL, DDBJ) as it does not rank high-level biological features (such as coding regions) and low-level features (such as repeat elements) distinctly enough for our needs. Our additional specialised classes of features, designed specifically with organelle genomic data in mind, allow much more focused and sophisticated biological queries. This substantially increases the reliability and accuracy of searches and allows users to distinguish closely related records that are unresolvable using more general definitions.

Data integration

Organelle research draws substantially on biochemical, physiological, ecological, taxonomic, genetic and other types of information. Although comprehensive data repositories for general biochemical information (such as enzymatic pathways) exist, other biological data pertinent to organelles (e.g. the composition of particular respiratory enzyme complexes, the ultrastructure of ciliate mitochondria) are widely dispersed over various sources. In order to provide a central location for quick and easy access to a large variety of relevant data, the following types of data have been integrated and cross-referenced in GOBASE:

Various types of biological information are bound tightly to the sequences so that they can be requested with ease once a sequence of interest is located in the database. This type of integrated online service for organelle genomic research cannot be found elsewhere.

3. Data sources

Database content

The GOBASE database currently contains

GOBASE relies on the international federation of databases for much of its data, although the project also generates some of the data in-house. Our data sources are detailed below.

RNA tRNA Multiple Alignments Sprinzl's collection at Bayreuth
Secondary Structures OGMP
rRNA Multiple Alignments GOBASE-RSU¹
Secondary Structures GOBASE-RSU
OGMP
Introns Multiple Alignments OGMP
Secondary Structures GOBASE-RSU
DNA Gene Sequences GenBank
OGMP
Genome Sequences GenBank
OGMP
Protein Sequences GenBank
OGMP
Tertiary Structures PDB
Multiple Alignments OGMP
Phylogenies OGMP
PID
GOBASE
Genetic and Physical Maps OGMP
GOBASE
Books on Genetic Maps
Organism Morphology PID
Biochemistry Ecocyc
WIT (formerly PUMA)
Diseases Online Mendelian Inheritance in Man
Citations Medline Entrez
Current Contents ISI
Bioinformatics Seqanalref database
Others Meeting Abstracts Bionet
Contacts

1) Michael W. Gray, Murray N Schnare, Dalhousie University, Halifax NS Canada

4. Usage and Search Capabilities.

GOBASE is accessed through query forms. Each form allows the user to query the database by a particular object class (also referred to as entity). The main object classes currently contained in GOBASE are


5. Software Implementation

The GOBASE database is currently implemented using PostgreSQL. For GOBASE release 7, a redesigned user interface has been built in PHP to allow for cleaner and more powerful user access to the underlying data.

Update procedure

The procedure for generating a new release of GOBASE is as follows:

Most of the scripts currently used in the population process are written in Perl, while the expert verification interface is implemented in Java. The database and associated scripts are in the process of a redesign which it is hoped will allow for more efficient and frequent GOBASE updates in future.


6. History

Internet access to the first release of the GOBASE database (the prototype) was launched on June 22, 1996, for beta-testing by interested parties from the scientific community. This first release contained mitochondrial data extracted from GenBank records. It also implemented the NCBI taxonomy tree and included a few genetic maps supplied by the OGMP (Organelle Genome Megasequencing Program).

The second release (10 Jun 1997) of GOBASE was a fully functional research tool for gene and genome research. Errors in the original records had been corrected and many design issues had been resolved.

GOBASE release 3 (Jan 1998) implemented use of gene name synonyms, gene order information and links to tRNA databases, and added stored secondary structures for RNAse P and 5S ribosomal RNAs.

The fourth release of GOBASE had no major changes in function or appearance.

GOBASE release 5 (Jan 2001) was the first to contain data for chloroplasts, and implemented access to this information in an interface separate from the mitochondrial database.

GOBASE release 6 (Summer 2002) contained information downloaded from GenBank on March 26, 2002, and contained roughly 35% more sequence data than the previous release.

GOBASE release 7 (January 2003) is the first release to be based on PostgreSQL rather than Sybase, which has proved faster and more powerful than the previous implementation, and much easier to maintain. It also includes a redesigned user interface implemented in PHP3 which addresses many of the shortcomings of the previous interface; it is more powerful, more flexible and looks better.

GOBASE release 8 (June 2003) has had many further enhancements to the interface, most notably a restored taxonomic search page which uses a new data architecture to allow for very much faster searches, and a new RNA structure page.

Release 8.1 (October 2003) combined the chloroplast and mitochondrial datasets into a single database accessible through a single interface, to which further enhancements have been made.

GOBASE release 9 (February 2004) combines some additional modification to the interface, such as the ability to retrieve flanking residues for a sequence, and further optimisation of the taxonomy queries, with data updates such as the addition of new RNA structure data, much expert correction of older mitochondrial data and the addition of 95,000 mitochondrial sequences.

GOBASE release 10 (summer 2004) includes a restored RNA interface with access to approximately 57,000 tRNA sequences, some from GenBank and some identified locally; the marking of deduced features, such as the introns implied by exons in a GenBank entry but not specifically marked in that entry; the addition of a Gene Distribution page showing the presence of genes in various functional classes across the taxonomic range of mitochondrial genomes contained within GOBASE; and the addition of just under 50,000 new mitochondrial sequences.

GOBASE release 11 (December 2004) contains the first bacterial sequence to be included in GOBASE, a genome of Rickettsia prowazekii. Also new in release 11 are 83,000 chloroplast seguences, bringing the chloroplast part of the database up to date; the addition of information from the Gene Ontology Consortium to the Genes and Products interface, providing a new means by which to sort and order the contents of GOBASE; and the marking of type examples from heavily sampled genes, so that large quantities of data from projects focusing on one gene in one species can be compensated for.

GOBASE release 12 (May 2005) includes a full re-annotation of the Rickettsia prowazekii genome sequence. 62,000 new mitochondrial sequences and 5700 chloroplast sequences have been added to the database, and the interface has been redesigned for internal consistency and clarity. Graphics have been added depicting gene structures and also the position and context of genes on sequences, and the documentation of the database has been updated and reorganised for ease of use.

GOBASE release 13 (September 2005) included nearly 10,000 new bacterial genes derived from genome sequences of Escherichia coli strain K12 and Nostoc sp. PCC 7120, and also 59,000 new mitochondrial sequences and 9500 new chloroplast sequences. Also, an internal intron numbering format was applied, including sub-numbers for sections of the same introns occurring on different sequences, to allow for consistent automatic representation of trans-spliced genes.

GOBASE release 14 (December 2005) included new RNA structures, to a total of 110, now available through the RNA page as well as the RNAStructure page; 49 intron structures available through the introns page; the addition of an initial dataset of RNA substitution editing information with a corresponding interface page; and some other interface enhancements, such as the addition of author search functionality to the sequence page, and Gene Ontology search terms to the Gene page.

GOBASE release 15 (April 2006) included a comprehensive reannotation of the 4000 Escherichia coli genes contained in GOBASE, an expanded RNA editing dataset covering about 4700 positions in 360 genes, and the processing of 70,000 new mitochondrial sequences and 20,000 new chloroplast sequences.

GOBASE release 16 (September 2006) included a comprehensive reannotation of the 5400 Nostoc genes contained in the database, and the addition of 98 new intron stuctures, accessible through the intron page. 90,000 new mitochondrial sequences and 17,000 new chloroplast sequences were added in this release.

GOBASE release 17 (December 2006) included 66,000 new mitochondrial sequences and 13,000 new chloroplast sequences.

GOBASE release 18 (April 2007) included 80,000 new mitochondrial sequences and 29,000 new chloroplast sequences. New features in this release include an updated database architecture to handle trans-splicing genes, which are now represented accurately in the gene structure diagrams, and the addition of more detailed literature reference information for each sequence including links to PubMed where they exist. Updated Gene Ontology identifiers have been added to all bacterial sequences.

GOBASE release 19 (September 2007) included an updated RNA editing interface and new editing data including insertion/deletion edits, and modified sequence download functionality to allow a choice between genomic and CDS downloads where appropriate. This release also included 68,000 new mitochondrial sequences and 20,000 new chloroplast sequences, taking the total number of sequences in GOBASE over 1,000,000.

GOBASE release 20 (December 2007) included a prototype human-specific sequence search page, allowing users to search complete human mitochondrial DNA sequences by haplotype or disease state, and showing point mutations on each sequence in an alignment with NCBI's reference human mitochondrial sequence. This release also included 55,000 new mitochondrial sequences and 16,000 new chloroplast sequences.

GOBASE release 21 (June 2008) included a prototype human-specific mutation page, allowing users to select a particular gene or region on the human mitochondrial genome sequence and retrieve information about point mutations contained in that gene or region. This release also included 83,000 new mitochondrial sequences.

GOBASE release 22 (November 2008) included an additional set of human mutation data derived from OMIM, including ~120 mutation sites in tRNA genes. This release also includes 142,000 new mitochondrial sequences and 20,000 new chloroplast sequences.

GOBASE release 23 (April 2009) includes selection of single reference whole-genome sequences for each species from which we have complete mitochondrial or chloroplast data. This release also includes 42,000 new mitochondrial sequences and 39,000 new chloroplast sequences.

GOBASE release 24 (October 2009) includes a link to a database of all complete organelle genome sequences in GenBan.k updated daily; these sequences are presented exactly as in GenBank, without GOBASE validation procedures, as a complementary resource to the GOBASE database. This release also includes 82,000 new mitochondrial sequences and 52,000 new chloroplast sequences.

7. Future Development

The bacterial data currently included in GOBASE represent the first of many bacterial genomes which we plan to include and functionally annotate, to provide a basis for evolutionary comparisons with the existing organelle datasets.

Orfs are named according to the number of amino acids in the deduced protein sequence. Therefore, identical orf names do not reflect similarity or homology. For example orf172 of liverwort is not a homolog of orf172 of Cyanidium caldarium, whereas it is a counterpart of orf234 of Prototheca. GOBASE will eventually address this problem by implementing the ymf or ycf nomenclature proposed by the CPGN (Plant Molecular Biology Reporter, Vol.12, No.2).

We are currently collaborating with scientists at NCBI to establish a database based on the content of GOBASE as an auxiliary to GenBank. This database will focus on the additional data that expert curation at GOBASE has generated, notably the curated gene and product names and synonyms and RNA secondary structure data, thus providing a permanent repository for two decades of curation of organelle genome data.

Current limitations and known bugs

© 2006 Departement de Biochimie, Université de Montréal
Comments and questions to: gobase@BCH.UMontreal.CA