| Nucleic Acids Research | Pages |
Vertebrate MitBASE: a specialised database on vertebrate mitochondrial DNA sequences
Introduction
Data Sources
Data Structure
Treatment Of Variant Sequences
Data-Retrieval
Acknowledgements
References
Vertebrate MitBASE: a specialised database on vertebrate mitochondrial DNA sequences
ABSTRACT
INTRODUCTION
During the last few years the importance of bioinformatics for the scientific community has grown following the increased number of biosequences available. A few years ago the EMBL (1) and GenBank (2) databases were highly representative of the amount of nucleic acids sequences available at that time for data-retrieval and queries. The enormous amount of data which is now submitted daily to the primary databases is due to the bench technical support given mainly by PCR (Polymerase Chain Reaction) which enables the fast sequencing of DNA fragments, genes and genomes and also to the easy use of network resources. The growth in database entries during the last decade and the parallel increasing amount of redundancies and mistakes has produced a change in database production.
These simple considerations lead to a question: how can specific data be easily retrieved from this ocean of information and moreover how can retrieved information be free from errors and really representative of the data available in the primary data-banks? An expert in bioinformatics has an array of appropriate tools for answering these questions but is it the same for a bench-scientist?
A possible solution can be found in the creation of a specialized database containing information related to a limited set of data. Following this idea, recently a large number of specialized databases whose data content is restricted to a well-defined subject and field, e.g., proteins, SWISS-PROT (3), mitochondrial and human mitochondrial DNA data GOBASE (4), MmtDB (5), MITOMAP (6), yeast sequences, YPD (7) and many more have been created.
Here we present a mitochondrial DNA vertebrate database not inclusive of the human data. This work is part of a more extensive EU-funded project, MitBASE (manuscript published Nucleic Acids Research Database Issue 1999), (EU BIO4-CT950160) co-ordinated by Prof. C. Saccone, structured in the ORACLE database management system and resident at the EBI in Hinxton, UK and now available on the Web at URL: http://www.ebi.ac.uk/htbin/Mitbase/mitbase.pl
DATA SOURCES
Data are mainly derived from the primary databases, EMBL data library and GenBank. The data from these main sources have been enhanced through the addition of information from the available literature and personal communications.
Table 1.
After the optimisation of queries over the full range of the primary databases by using different retrieval systems, SRS (8), ACNUC (9) and Entrez (10), with the purpose of cross-matching among retrieved data, we have standardized a set of queries covering the full vertebrate sequences subdivided in taxa: Fish, Aves, Reptiles, Amphibians, Mammals and Chordata (not inclusive of the previous five groups). Table 1 summarises the vertebrate data available in MitBASE at present.
DATA STRUCTURE
Data have been organised into two well-defined groups:
(i) citations, cross-references to other databases and nucleotide sequences. This information is already resident in the EMBL data library and only have to be checked, properly revised and often integrated or modified;
(ii) value-added information derived from publications or personal communications.
Our purpose has been the integration of information in the primary databases with details on bench techniques such as PCR details or sequencing and cloning methods, population localization, identification and classification of conservative sequences blocks (e.g., TAS, ETAS, CSB) (11), computational methods and programs applied to the sequences described in the entries.
To avoid time-consuming procedures such as re-entering existing data from group 1 we have designed a Web interface developed at the EBI within the MitBASE project and available under password from the MitBASE home page through the option `Submission of Data'. The interface enables the entries to be loaded from the EMBL data library into an `editable' form. Entries can be selected by entry name or accession number. After choosing from a pull-down menu the fields of interest, the program outputs to the screen the strings of characters related to the desired fields in manageable blocks ready for editing. After the data are checked and revised this information is parsed and loaded into the centralized MitBASE database structured under the ORACLE management system.
Data related to the second group are locally stored in an MS Access database. It shows a single form with all the necessary fields linked to a complete set of lookup tables containing repetitive strings of characters to enable fast data entry. Pop-down menus are also available, minimizing syntax errors. This information, periodically released at the EBI, is parsed and loaded into the ORACLE database. Table 2 gives a schematic representation of the data-set structure containing data from both groups 1 and 2.
Table 2.
TREATMENT OF VARIANT SEQUENCES
When the content of the selected vertebrate data is explored, it becomes evident that there is a large numbers of sequences (Table 3) related to the same organism and the same gene or fragment. All these entries can be defined as variants of a sequence where the reference sequence can be a real fragment with a biological meaning or a synthetic sequence created by combining sequences. Figure
Figure 1. The reference sequence. The diagram shows all the possible cases of reference sequence creation: (a) the simpler case where all the available fragments (numbered from 1 to 8) related to a single species can be referred to a complete mitochondrial genome (A); (b) the creation of a consensus sequence among three fragments (A, B and C) able to cover the complete set of sequences (numbered from 1 to 8) related to a single species. Table 3. . The treating of the variant sequences is one of the main goals of this project where users could retrieve a complete set of entries related to variants within a single organism where all the mutation details will be clearly described. The retrieval of the variant sequence data has been performed with the program VARCLU (S.B.Malladi, manuscript in preparation) written in C language and running under a Windows environment. VARCLU can process a Clustal V or W (12) format file giving as output a text format file containing point mutations, insertions and deletions compared to a reference sequence included in the alignment. We could therefore codify most of the variants keeping off the more complex cases represented by overlapping fragments. We are now producing a new program able to automatically generate a consensus sequence from a set of fragments. This program will work coupled with VARCLU allowing the detection of all the possible variants and reference sequences.
DATA-RETRIEVAL
Vertebrate data can be retrieved using the `Simple Query System' option available from the MitBASE home page. The vertebrate Simple Query System is part of a more comprehensive tool resident in the MitBASE home page. This retrieval system can query the vertebrate database following two pathways: (i) by species or (ii) by gene. Once the vertebrate section and the path are chosen, data can be displayed in different formats: (a) a flat file version or (b) a descriptive format. The first one is based on the most common EMBL flat file standard but contains a specific organisation of the identifiers common to all MitBASE outputs plus several new identifiers specially made for the requirements of vertebrate flat files. This output can be easily exported for data analyses and will be soon implemented in SRS (8). A clickable cross-referencing is also being set-up. When the descriptive output format is chosen, the same information is displayed identified by full descriptive strings of characters. This format has been proposed to encourage the use of databases by non-expert users who can now enjoy a descriptive and user-friendly output.
ACKNOWLEDGEMENTS
Thanks to Dr Matteo di Tommaso for the informatics support to the design of the database and to Dr D. G. Bembo for his suggestions during the manuscript preparation. This work has been funded under the EU Biotechnology programme, contract number: BIO4 CT950160.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This Article ![]()
![]()
Abstract
![]()
Print PDF (56K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (1)
![]()
Request Permissions ![]()
Scopus Links ![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Carone, A.
![]()
Articles by Saccone, C.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Carone, A.
![]()
Articles by Saccone, C.
![]()
Social Bookmarking ![]()
![]()
What's this?