Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (56K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Carone, A.
Right arrow Articles by Saccone, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Carone, A.
Right arrow Articles by Saccone, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research Pages 150-152  


Vertebrate MitBASE: a specialised database on vertebrate mitochondrial DNA sequences
Introduction
Data Sources
Data Structure
Treatment Of Variant Sequences
Data-Retrieval
Acknowledgements
References


Vertebrate MitBASE: a specialised database on vertebrate mitochondrial DNA sequences

Vertebrate MitBASE: a specialised database on vertebrate mitochondrial DNA sequences

A. Carone*, S. B. Malladi1, M. Attimonelli and C. Saccone

Dipartimento di Biochimica e Biologia Molecolare, Universita' degli studi di Bari. Via E. Orabona 4, 70126 Bari, Italy and 1EBI, Hinxton Hall, Hinxton, Cambridge, UK

Received September 11, 1998; Revised October 15, 1998; Accepted October 27, 1998

ABSTRACT

Vertebrate MitBASE is a specialized database where all the vertebrate mitochondrial DNA entries from primary databases are collected, revised and integrated with new information emerging from the literature. Variant sequences are also analyzed, aligned and linked to reference sequences. Data related to the same species and fragment can be viewed over the WWW. The database has a flexible interface and a retrieval system to help non-expert users and contains information not currently available in the primary databases. Vertebrate MitBASE is now available through the MitBASE home page at URL: http://www.ebi.ac.uk/htbin/Mitbase/mitbase.pl . This work is part of a larger project, MitBASE which is a network of databases covering the full panorama of knowledge on mitochondrial DNA from protists to human sequences.

INTRODUCTION

During the last few years the importance of bioinformatics for the scientific community has grown following the increased number of biosequences available. A few years ago the EMBL (1) and GenBank (2) databases were highly representative of the amount of nucleic acids sequences available at that time for data-retrieval and queries. The enormous amount of data which is now submitted daily to the primary databases is due to the bench technical support given mainly by PCR (Polymerase Chain Reaction) which enables the fast sequencing of DNA fragments, genes and genomes and also to the easy use of network resources. The growth in database entries during the last decade and the parallel increasing amount of redundancies and mistakes has produced a change in database production.

These simple considerations lead to a question: how can specific data be easily retrieved from this ocean of information and moreover how can retrieved information be free from errors and really representative of the data available in the primary data-banks? An expert in bioinformatics has an array of appropriate tools for answering these questions but is it the same for a bench-scientist?

A possible solution can be found in the creation of a specialized database containing information related to a limited set of data. Following this idea, recently a large number of specialized databases whose data content is restricted to a well-defined subject and field, e.g., proteins, SWISS-PROT (3), mitochondrial and human mitochondrial DNA data GOBASE (4), MmtDB (5), MITOMAP (6), yeast sequences, YPD (7) and many more have been created.

Here we present a mitochondrial DNA vertebrate database not inclusive of the human data. This work is part of a more extensive EU-funded project, MitBASE (manuscript published Nucleic Acids Research Database Issue 1999), (EU BIO4-CT950160) co-ordinated by Prof. C. Saccone, structured in the ORACLE database management system and resident at the EBI in Hinxton, UK and now available on the Web at URL: http://www.ebi.ac.uk/htbin/Mitbase/mitbase.pl

DATA SOURCES

Data are mainly derived from the primary databases, EMBL data library and GenBank. The data from these main sources have been enhanced through the addition of information from the available literature and personal communications.


Table 1. Data source
Number of mitochondrial DNA vertebrate entries available in EMBL data library at release 54 (1998), subdivided in taxa. The first column represents raw data which have not been proofread for redundancies and errors while the second shows the corrected data already available in our database. The number of bases corresponding to the entries are also reported.

After the optimisation of queries over the full range of the primary databases by using different retrieval systems, SRS (8), ACNUC (9) and Entrez (10), with the purpose of cross-matching among retrieved data, we have standardized a set of queries covering the full vertebrate sequences subdivided in taxa: Fish, Aves, Reptiles, Amphibians, Mammals and Chordata (not inclusive of the previous five groups). Table 1 summarises the vertebrate data available in MitBASE at present.

DATA STRUCTURE

Data have been organised into two well-defined groups:

(i) citations, cross-references to other databases and nucleotide sequences. This information is already resident in the EMBL data library and only have to be checked, properly revised and often integrated or modified;

(ii) value-added information derived from publications or personal communications.

Our purpose has been the integration of information in the primary databases with details on bench techniques such as PCR details or sequencing and cloning methods, population localization, identification and classification of conservative sequences blocks (e.g., TAS, ETAS, CSB) (11), computational methods and programs applied to the sequences described in the entries.

To avoid time-consuming procedures such as re-entering existing data from group 1 we have designed a Web interface developed at the EBI within the MitBASE project and available under password from the MitBASE home page through the option `Submission of Data'. The interface enables the entries to be loaded from the EMBL data library into an `editable' form. Entries can be selected by entry name or accession number. After choosing from a pull-down menu the fields of interest, the program outputs to the screen the strings of characters related to the desired fields in manageable blocks ready for editing. After the data are checked and revised this information is parsed and loaded into the centralized MitBASE database structured under the ORACLE management system.

Data related to the second group are locally stored in an MS Access database. It shows a single form with all the necessary fields linked to a complete set of lookup tables containing repetitive strings of characters to enable fast data entry. Pop-down menus are also available, minimizing syntax errors. This information, periodically released at the EBI, is parsed and loaded into the ORACLE database. Table 2 gives a schematic representation of the data-set structure containing data from both groups 1 and 2.


Table 2. Data-set general structure
For the sake of clarity, fields have been grouped into blocks. The `value-added' information can be focused in the computational methods block, individual block, conserved sequence block and bench methods block. The two letter codes used in the flat file production are also reported giving an indication of this final output format.

TREATMENT OF VARIANT SEQUENCES

When the content of the selected vertebrate data is explored, it becomes evident that there is a large numbers of sequences (Table 3) related to the same organism and the same gene or fragment. All these entries can be defined as variants of a sequence where the reference sequence can be a real fragment with a biological meaning or a synthetic sequence created by combining sequences. Figure 1 shows the possible cases of reference sequence creation. The simpler case (Fig. 1a) is the presence of a complete mitochondrial genome that covers all the possible variants. At present we only have 48 complete mitochondrial genomes available on the Web and over 1500 species with partial sequence data. This gives a clear overview of the biased trend in this research area which concentrates on restricted number of species (e.g., Homo sapiens or Rattus norvegicus) and genes (e.g., ribosomal genes and cytochrome b gene). Therefore most of the sequences may not be referred to a complete genome but must be referred to fragments containing a part of a gene or several genes (Fig. 1b).


Figure 1. The reference sequence. The diagram shows all the possible cases of reference sequence creation: (a) the simpler case where all the available fragments (numbered from 1 to 8) related to a single species can be referred to a complete mitochondrial genome (A); (b) the creation of a consensus sequence among three fragments (A, B and C) able to cover the complete set of sequences (numbered from 1 to 8) related to a single species.


Table 3. . Variant versus non-variant sequences
Comparison between variant sequences (see text) and non-variant sequences subdivided in vertebrate taxa. This sample list refers to a set of data which have been proofread for redundancy and shows the importance of a proper coding of the variant sequences that represent over half of the total amount of the vertebrate entries.

The treating of the variant sequences is one of the main goals of this project where users could retrieve a complete set of entries related to variants within a single organism where all the mutation details will be clearly described.

The retrieval of the variant sequence data has been performed with the program VARCLU (S.B.Malladi, manuscript in preparation) written in C language and running under a Windows environment. VARCLU can process a Clustal V or W (12) format file giving as output a text format file containing point mutations, insertions and deletions compared to a reference sequence included in the alignment. We could therefore codify most of the variants keeping off the more complex cases represented by overlapping fragments. We are now producing a new program able to automatically generate a consensus sequence from a set of fragments. This program will work coupled with VARCLU allowing the detection of all the possible variants and reference sequences.

DATA-RETRIEVAL

Vertebrate data can be retrieved using the `Simple Query System' option available from the MitBASE home page. The vertebrate Simple Query System is part of a more comprehensive tool resident in the MitBASE home page. This retrieval system can query the vertebrate database following two pathways: (i) by species or (ii) by gene. Once the vertebrate section and the path are chosen, data can be displayed in different formats: (a) a flat file version or (b) a descriptive format. The first one is based on the most common EMBL flat file standard but contains a specific organisation of the identifiers common to all MitBASE outputs plus several new identifiers specially made for the requirements of vertebrate flat files. This output can be easily exported for data analyses and will be soon implemented in SRS (8). A clickable cross-referencing is also being set-up. When the descriptive output format is chosen, the same information is displayed identified by full descriptive strings of characters. This format has been proposed to encourage the use of databases by non-expert users who can now enjoy a descriptive and user-friendly output.

ACKNOWLEDGEMENTS

Thanks to Dr Matteo di Tommaso for the informatics support to the design of the database and to Dr D. G. Bembo for his suggestions during the manuscript preparation. This work has been funded under the EU Biotechnology programme, contract number: BIO4 CT950160.

REFERENCES

1. Stoesser,G., Moseley,M.A., Sleep,J., McGowran,M., Garcia-Pastor,M. and Sterk,P. (1998) Nucleic Acids Res., 26, 8-15. MEDLINE Abstract

2. Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J. and Ouellette,B.F. (1998) Nucleic Acids Res., 26, 1-7. MEDLINE Abstract

3. Bairoch,A. and Apweiler,R. (1998) Nucleic Acids Res., 26, 38-42. MEDLINE Abstract

4. Korab-Laskowska,M., Rioux,P., Brossard,N., Littlejohn,T.G., Gray,M.W., Lang,B.F. and Burger,G. (1998) Nucleic Acids Res., 26, 138-144. MEDLINE Abstract

5. Attimonelli,M., Calo,D., De Montalvo,A., Lanave,C., Sasanelli,D., Tommaseo Ponzetta,M. and Saccone,C. (1998) Nucleic Acids Res., 26, 120-125. MEDLINE Abstract

6. Kogelnik,A.M., Lott,M.T., Brown,M.D., Navathe,S.B. and Wallace,D.C. (1998) Nucleic Acids Res., 26, 112-115. MEDLINE Abstract

7. Hodges,P.E., Payne,W.E. and Garrels,J.I. (1998) Nucleic Acids Res., 26, 68-72. MEDLINE Abstract

8. Etzold,T., Ulyanov,A. and Argos,P. (1996) Methods Enzymol., 266, 114-128. MEDLINE Abstract

9. Gouy,M., Gautier,C., Attimonelli,M., Lanave,C. and Pesole,G. (1985) Comput. Applic. Biosci., 3, 167-172.

10. Schuler,G.D., Epstein,J.A., Ohkawa,H. and Kans,J.A. (1996) Methods Enzymol., 266,141-162. MEDLINE Abstract

11. Sbisa,E., Tanzariello,F., Reyes,A., Pesole,G. and Saccone,C. (1997) Gene, 205, 125-140. MEDLINE Abstract

12. Higgins,D.G. and Sharp,P.M. (1988) Gene, 73, 237-244. MEDLINE Abstract


*To whom correspondence should be addressed. Tel: +39 080 548 2180; Fax: +39 080 548 4467; Email: areasc15@area.ba.cnr.it


This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (56K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Carone, A.
Right arrow Articles by Saccone, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Carone, A.
Right arrow Articles by Saccone, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?