2. What Makes GOBASE Unique ?
3. Data Sources
4. Usage and Search Capabilities
5. Software Implementation
7. Future Development
8. Journal references
GOBASE collects biological information from a number of sources world-wide and contains nucleotide and related protein sequences, genetic and physical maps, RNA secondary structures, biochemical and organismal information for mitochondrial and chloroplast genomes, among other important biological data (See Fig. 1.1 below).
Organelles (mitochondria and chloroplasts) are of interest for many reasons:
The most important advantage to genomics offered by organelles is the number of completely sequenced genomes already available and currently being sequenced (e.g., by the Canadian Organelle Genome Megasequencing Program (OGMP) and Fungal Mitochondrial Genome Project (FMGP)). Organelle genomes are information-rich, typically coding for 12-200 proteins, tRNAs and rRNAs. They are ideal for comparative genomic studies because the dataset of organelle genomes is large enough to allow meaningful comparative genome analysis in ways that are only beginning to become possible for smaller nuclear genomes. GOBASE currently contains ~4900 complete organelle genomes. The ability to examine a large number of genomes of the same type permits investigation of genome evolution and identification of complex gene rearrangements and their roles in evolutionary divergence and speciation.
Why an Organelle Genome Database?
The body of currently available organelle data is dispersed among a number of sources, difficult to access or sometimes even to locate. In addition, cross-referencing biological information is complex and is rarely done. For example, given a GenBank record containing a ribosomal RNA sequence, there is no effective way to retrieve the corresponding secondary structures from another database without carrying out a separate search, and performing cross-genome comparisons is even more difficult. The available data sets are often incomplete or contain errors that are sometimes hard to identify and rectify with the underlying data structure. GOBASE provides a solution to these problems by offering a central, comprehensive, well-validated, relational database, so that researchers can access all the relevant information associated with organelles.
GOBASE was created to address questions in the following areas of biology:
The volume of organelle data now available is too large to allow for expert visual investigation of each individual record. We are therefore engaged in the ongoing development of a suite of programs in Perl and SQL to retrieve data from GenBank, standardise gene and product names, and populate GOBASE records automatically so far as that is possible, requiring human intervention only for cases when the appropriate annotations cannot be unambiguously assigned by automated means.
Although each record submitted to NCBI receives a rigorous and systematic check to ensure accuracy, due to the very high volume of data that needs to be processed, errors and inconsistency inevitably remain in the database. The most commonly found errors include incorrectly assigned biological features, gene and product names, and overlapping open-reading frames (ORFs). These errors and inconsistencies often cause a large number of false positives or completely irrelevant records to appear in the search results. At GOBASE, we have implemented extended, semi-automated validation procedures to reduce the number of false positives. These procedures are designed to correct:
Records presenting problems to the automated validation procedures are presented on a correction web page for expert verification for inspection and modification as necessary.
The taxonomic information contained within GOBASE represents an exact duplication of NCBI's taxonomic tree. This dataset has certain internal inconsistencies of nomenclature owing to different practices among taxonomists working in different areas of biology - for example, taxonomic ranks between "order" and "family" may be assigned to any or all of "suborder", "infraorder", "parvorder" or "superfamily", and many such taxa do not have a rank associated with them. We maintain the tree structure exactly, but we have made no attempts to assign rankings to any taxon where NCBI have not done so.
Organelle sequence records are extracted from GenBank and divided into 9 fundamental categories (SEQUENCE, GENE, PROTEIN, RNA, EXON, INTRON, GENE/PRODUCT CLASS, MAP, TAXONOMY). We have not adopted the classification scheme of the International Sequence Database Consortium (NCBI, EMBL, DDBJ) as it does not rank high-level biological features (such as coding regions) and low-level features (such as repeat elements) distinctly enough for our needs. Our additional specialised classes of features, designed specifically with organelle genomic data in mind, allow much more focused and sophisticated biological queries. This substantially increases the reliability and accuracy of searches and allows users to distinguish closely related records that are unresolvable using more general definitions.
Organelle research draws substantially on biochemical, physiological, ecological, taxonomic, genetic and other types of information. Although comprehensive data repositories for general biochemical information (such as enzymatic pathways) exist, other biological data pertinent to organelles (e.g. the composition of particular respiratory enzyme complexes, the ultrastructure of ciliate mitochondria) are widely dispersed over various sources. In order to provide a central location for quick and easy access to a large variety of relevant data, the following types of data have been integrated and cross-referenced in GOBASE:
The GOBASE database currently contains
|RNA||tRNA||Multiple Alignments||Sprinzl's collection at Bayreuth|
|Genetic and Physical Maps||OGMP
Books on Genetic Maps
WIT (formerly PUMA)
|Diseases||Online Mendelian Inheritance in Man|
1) Michael W. Gray, Murray N Schnare, Dalhousie University, Halifax NS Canada
GOBASE is accessed through query forms. Each form allows the user to query the database by a particular object class (also referred to as entity). The main object classes currently contained in GOBASE are
The procedure for generating a new release of GOBASE is as follows:
Most of the scripts currently used in the population process are written in Perl, while the expert verification interface is implemented in Java. The database and associated scripts are in the process of a redesign which it is hoped will allow for more efficient and frequent GOBASE updates in future.
The second release (10 Jun 1997) of GOBASE was a fully functional research tool for gene and genome research. Errors in the original records had been corrected and many design issues had been resolved.
GOBASE release 3 (Jan 1998) implemented use of gene name synonyms, gene order information and links to tRNA databases, and added stored secondary structures for RNAse P and 5S ribosomal RNAs.
The fourth release of GOBASE had no major changes in function or appearance.
GOBASE release 5 (Jan 2001) was the first to contain data for chloroplasts, and implemented access to this information in an interface separate from the mitochondrial database.
GOBASE release 6 (Summer 2002) contained information downloaded from GenBank on March 26, 2002, and contained roughly 35% more sequence data than the previous release.
GOBASE release 7 (January 2003) is the first release to be based on PostgreSQL rather than Sybase, which has proved faster and more powerful than the previous implementation, and much easier to maintain. It also includes a redesigned user interface implemented in PHP3 which addresses many of the shortcomings of the previous interface; it is more powerful, more flexible and looks better.
GOBASE release 8 (June 2003) has had many further enhancements to the interface, most notably a restored taxonomic search page which uses a new data architecture to allow for very much faster searches, and a new RNA structure page.
Release 8.1 (October 2003) combined the chloroplast and mitochondrial datasets into a single database accessible through a single interface, to which further enhancements have been made.
GOBASE release 9 (February 2004) combines some additional modification to the interface, such as the ability to retrieve flanking residues for a sequence, and further optimisation of the taxonomy queries, with data updates such as the addition of new RNA structure data, much expert correction of older mitochondrial data and the addition of 95,000 mitochondrial sequences.
GOBASE release 10 (summer 2004) includes a restored RNA interface with access to approximately 57,000 tRNA sequences, some from GenBank and some identified locally; the marking of deduced features, such as the introns implied by exons in a GenBank entry but not specifically marked in that entry; the addition of a Gene Distribution page showing the presence of genes in various functional classes across the taxonomic range of mitochondrial genomes contained within GOBASE; and the addition of just under 50,000 new mitochondrial sequences.
GOBASE release 11 (December 2004) contains the first bacterial sequence to be included in GOBASE, a genome of Rickettsia prowazekii. Also new in release 11 are 83,000 chloroplast seguences, bringing the chloroplast part of the database up to date; the addition of information from the Gene Ontology Consortium to the Genes and Products interface, providing a new means by which to sort and order the contents of GOBASE; and the marking of type examples from heavily sampled genes, so that large quantities of data from projects focusing on one gene in one species can be compensated for.
GOBASE release 12 (May 2005) includes a full re-annotation of the Rickettsia prowazekii genome sequence. 62,000 new mitochondrial sequences and 5700 chloroplast sequences have been added to the database, and the interface has been redesigned for internal consistency and clarity. Graphics have been added depicting gene structures and also the position and context of genes on sequences, and the documentation of the database has been updated and reorganised for ease of use.
GOBASE release 13 (September 2005) included nearly 10,000 new bacterial genes derived from genome sequences of Escherichia coli strain K12 and Nostoc sp. PCC 7120, and also 59,000 new mitochondrial sequences and 9500 new chloroplast sequences. Also, an internal intron numbering format was applied, including sub-numbers for sections of the same introns occurring on different sequences, to allow for consistent automatic representation of trans-spliced genes.
GOBASE release 14 (December 2005) included new RNA structures, to a total of 110, now available through the RNA page as well as the RNAStructure page; 49 intron structures available through the introns page; the addition of an initial dataset of RNA substitution editing information with a corresponding interface page; and some other interface enhancements, such as the addition of author search functionality to the sequence page, and Gene Ontology search terms to the Gene page.
GOBASE release 15 (April 2006) included a comprehensive reannotation of the 4000 Escherichia coli genes contained in GOBASE, an expanded RNA editing dataset covering about 4700 positions in 360 genes, and the processing of 70,000 new mitochondrial sequences and 20,000 new chloroplast sequences.
GOBASE release 16 (September 2006) included a comprehensive reannotation of the 5400 Nostoc genes contained in the database, and the addition of 98 new intron stuctures, accessible through the intron page. 90,000 new mitochondrial sequences and 17,000 new chloroplast sequences were added in this release.
GOBASE release 17 (December 2006) included 66,000 new mitochondrial sequences and 13,000 new chloroplast sequences.
GOBASE release 18 (April 2007) included 80,000 new mitochondrial sequences and 29,000 new chloroplast sequences. New features in this release include an updated database architecture to handle trans-splicing genes, which are now represented accurately in the gene structure diagrams, and the addition of more detailed literature reference information for each sequence including links to PubMed where they exist. Updated Gene Ontology identifiers have been added to all bacterial sequences.
GOBASE release 19 (September 2007) included an updated RNA editing interface and new editing data including insertion/deletion edits, and modified sequence download functionality to allow a choice between genomic and CDS downloads where appropriate. This release also included 68,000 new mitochondrial sequences and 20,000 new chloroplast sequences, taking the total number of sequences in GOBASE over 1,000,000.
GOBASE release 20 (December 2007) included a prototype human-specific sequence search page, allowing users to search complete human mitochondrial DNA sequences by haplotype or disease state, and showing point mutations on each sequence in an alignment with NCBI's reference human mitochondrial sequence. This release also included 55,000 new mitochondrial sequences and 16,000 new chloroplast sequences.
GOBASE release 21 (June 2008) included a prototype human-specific mutation page, allowing users to select a particular gene or region on the human mitochondrial genome sequence and retrieve information about point mutations contained in that gene or region. This release also included 83,000 new mitochondrial sequences.
GOBASE release 22 (November 2008) included an additional set of human mutation data derived from OMIM, including ~120 mutation sites in tRNA genes. This release also includes 142,000 new mitochondrial sequences and 20,000 new chloroplast sequences.
GOBASE release 23 (April 2009) includes selection of single reference whole-genome sequences for each species from which we have complete mitochondrial or chloroplast data. This release also includes 42,000 new mitochondrial sequences and 39,000 new chloroplast sequences.
GOBASE release 24 (October 2009) includes a link to a database of all complete organelle genome sequences in GenBan.k updated daily; these sequences are presented exactly as in GenBank, without GOBASE validation procedures, as a complementary resource to the GOBASE database. This release also includes 82,000 new mitochondrial sequences and 52,000 new chloroplast sequences.
The bacterial data currently included in GOBASE represent the first of many bacterial genomes which we plan to include and functionally annotate, to provide a basis for evolutionary comparisons with the existing organelle datasets.
Orfs are named according to the number of amino acids in the deduced protein sequence. Therefore, identical orf names do not reflect similarity or homology. For example orf172 of liverwort is not a homolog of orf172 of Cyanidium caldarium, whereas it is a counterpart of orf234 of Prototheca. GOBASE will eventually address this problem by implementing the ymf or ycf nomenclature proposed by the CPGN (Plant Molecular Biology Reporter, Vol.12, No.2).
We are currently collaborating with scientists at NCBI to establish a database based on the content of GOBASE as an auxiliary to GenBank. This database will focus on the additional data that expert curation at GOBASE has generated, notably the curated gene and product names and synonyms and RNA secondary structure data, thus providing a permanent repository for two decades of curation of organelle genome data.
© 2006 Departement de Biochimie, Université de Montréal