| Nucleic Acids Research | Pages |
Update of MmtDB: a Metazoa mitochondrial DNA variants database
Introduction
MmtDB Data
MmtDB Data Source
MmtDB Structure
MmtDB Data Distribution And Database Access
Conclusion
Acknowledgements
References
Update of MmtDB: a Metazoa mitochondrial DNA variants database
ABSTRACT
INTRODUCTION
Mitochondria are subcellular organella under the control of both nuclear and mitochondrial genomes. The mitochondrion is the only organelle in Metazoa that contains its own DNA (1). The idea to create a Metazoa mtDNA variants specialised database (MmtDB) originated from the awareness that a large mass of information associated to mtDNA sequences is not stored in the primary databases which instead contain redundant information (e.g., bibliographic and taxonomic). Therefore MmtDB has been designed and implemented as a subset of the primary databases, accurately revised and enriched with specific information pertaining to the biological features of each entry, aimed at providing new data structure and generating new cross-references between sets of data completely unlinked up till now.
MmtDB is characterised as being a collection of variants and not simply a collection of Metazoa mtDNA sequences. A variant is therefore, for each species of the class Metazoa, a fragment where nucleotide differences (variations) are detected as compared to a reference sequence.
The database has been updated and several improvements have been implemented: (i) a more detailed codification for subject and pedigree; (ii) a new section of MmtDB, the Aligned Metazoan mitochondrial biosequences (AMmtDB), has been added; (iii) the human data have been implemented in SRS (2).
MmtDB DATA
MmtDB was originally designed as a Metazoan database, and our group started with the management of vertebrate and particularly human data. Yet we realized a much greater effort was needed to cover all Metazoa and some collaboration was sought. So presently MmtDB is part of the MitBASE project, a comprehensive and integrated mitochondrial database funded by the EU BIOTECHNOLOGY Programme. MitBASE is developed by a network of six nodes, each collecting and editing data on different groups of organisms (protists, plants, fungi, vertebrates, invertebrates and humans), by a bioinformatic node (EBI) and a node dealing with a pilot project on nuclear genes related to mitochondria.
The vertebrate mitochondrial genome is a closed round molecule having a size between 15 000 and 20 000 bp (3). The peculiar features of this molecule are its high compactness and simplicity; mitochondrial DNA (mtDNA) is dense with information, does not have introns, contains only orthologous single copy genes, and lacks recombination. It is generally composed of 13 genes coding for proteins, 22 tRNA genes and two rRNA genes. It shows a major non-coding region, called D-loop in vertebrates, which is involved in regulatory processes. The D-loop is the most variable part of the genome and is thus used as a marker for human diversity studies (4,5).
The mtDNA has unusual genetics. It is maternally inherited (6) and is polyploid, which means it is present, in both the cell and organelle, with a high copy number. Therefore the mtDNA in the same organelle, cell, tissue, organ or subject may not be homogenous, indeed when a new mutation occurs it creates a mixed intracellular population of mutant and normal mtDNAs known as heteroplasmy. When a heteroplasmic cell divides, the mutant and the normal DNAs are randomly distributed in the daughter cells (mitotic segregation process) (7). At some stages in oogenesis, the amount of mtDNA molecules is reduced to a relatively small number (bottleneck hypothesis) (8). Following these stages, over-replication brings the amount of molecules in each DNA cell to its normal high level and this can lead to a relatively pure population of each genome that pre-existed in the original parental organelle (9).
Moreover, the location of mtDNA, which is attached to the mitochondrial inner membrane and close to the respiratory chain, its lack of protective proteins (e.g., histones) and a very poor DNA repair system (10) make mtDNA particularly prone to mutation.
Figure
Figure
Figure
Because of its reduced size and the fact that mtDNA is used in several fields of applied biology (11), the number of sequences of mitochondrial genes and complete genomes is growing exponentially. In particular much information on polymorphic regions of mtDNA is now available in the literature. For human mtDNA, several of the sequences managed in MmtDB are related to evolutionary studies and are relevant to the hypervariable segments (HVI, HVII) of the D-loop (12-15), and as many others are in connection with pathology studies on alterations of the mtDNA (deletions, insertions and point mutations) (16,17). The human data are coded using as reference the nucleotide sequence published by Anderson et al. in 1981 (18), which, despite being a hybrid (derived in part from placental mtDNA and in part from HeLa cell mtDNA), represents an important reference in human variability studies.



MmtDB DATA SOURCE
The Metazoa mitochondrial sequences are retrieved from the EMBL (19) and GenBank (20) primary databases.
The data are extracted from the primary databases using the GCG (21), ACNUC (22) and SRS packages. The comparison between a reference sequence and each potential variant is performed by applying the GCG program BESTFIT. The published sequence data which are not included in the primary databases are extracted from bibliographic databases (Medline, Current Contents), from Entrez or other information systems. Congress acta and unpublished data kindly provided by the authors are also included.
MmtDB STRUCTURE
The data in MmtDB are organised into two large classes: SPECIES and VARIANTS (Fig. 1). Each of these classes is further organised into subclasses or objects (an example of subclasses associated to each human variant is shown in Figure 2). To each species, n variants are associated and each variant is an entry in the MmtDB database. The SPECIES class refers to the items in the database which can be associated to a biological species of METAZOA and for which mtDNA data are available.
To the class SPECIES the following objects are associated: the Reference Sequence, the Gene and Restriction Endonuclease Maps, the Taxonomic Classification and the Bibliography. The reference sequence(s) is represented by the nucleotide sequence of the complete mitochondrial genome, if the genome of that species has been fully sequenced, or otherwise if one or more fragments have been sequenced, by the longest sequence of each fragment. In any case, the reference sequence is also considered as a variant.
The VARIANTS class includes as objects information items specific of the fragment under consideration, such as: (i) the location of the fragment in the reference sequence (analysed region); (ii) the experimental method used for the detection of the variant, e.g., Sanger, Maxam and Gilbert, RFLP, Southern or PCR; (iii) the pattern of the variation events with respect to the reference sequence, i.e., the nucleotide position in the reference sequence where the variation occurs, the type of variation (point mutations, deletions or insertions), the involved gene and the loss or gain of a restriction site following the variation; (iv) bibliographic references; (v) the tissue or cell lines from which the DNA was extracted; (vi) population data, relevant to the geographic and linguistic origin of the subjects from which the DNA was extracted; (vii) the age and the sex of the subjects and their pathological or normal status.
Population data are coded according to geographical (Continental groups) and anthropological (Population groups) classifications. A linguistic classification according to Ruhlen (23) (Linguistic phylum, Language group and Language) has also been added. These classifications are often limited by the scarcity of information reported in the original papers. The subject origin has been better coded according to geographical coordinates and will soon be implemented.
When the variant has been extracted from the Primary Databases it is cross-referenced through the accession number and the entry name which univocally identify each data-entry in the primary databases (DR field in Fig. 3). Then the entries are internally cross-referenced through their accession number in MmtDB (AC fields in Fig. 3) in order to link different but correlated entries. The correlation can be based on the Tissue type (T), the Aplotype (A) , the Family (F) or the Heteroplasmic status (H).
Figure
The family correlation pertains mainly to human data and namely to mitochondrial disease studies. The family code, defined in MmtDB has been improved by adding the subject pedigree, where known, in order to define if the pathology has been maternally inherited. To each studied pedigree a progressive family code is assigned, e.g., AB, followed by a roman number identifying the generation and an Arabic number identifying the subject in the generation. The spouses of direct descendants will have a different family code, e.g., AC, AD, because they belong to a different family. If no information on spouses is reported in the molecular study no code is assigned. Starting from the second generation, direct descendants are coded as follows: family code, in brackets codes of ancestors added with sex information (m for male and f for female), subject code, e.g. AB(m I.1/f I.2).II.1. Whenever the parents mating is consanguineous, the / symbol in brackets is substituted with the = symbol. In further generations only the parent belonging to the family pedigree is reported with the sex code as follows: AB[(m I.1/f I.2):f II.4:m III.2)].IV.5. If the direct ancestor has been studied the ancestor code is followed by * to mark information on the subjects which are present in the database. The same rule will be applied for any further generation and consanguineous mating in the pedigree.
Figure
For subjects whose information on the pedigree is not reported, the general code AA.I.1 is used. Whenever special notations are used in the paper, such as letters, they are kept in the family code. Example AB.I.A. If information on generation and number of subjects in the generation is missing, a question mark is used in the code. Figure 3 shows an example of MmtDB entry in the flatfile format, which is commonly used by most of the biological databases such as the EMBL data library, GenBank, SWISS-PROT (24) and many others. In AMmtDB, the Metazoa mitochondrial sequences of complete genes coding for proteins and tRNAs carefully aligned, have been stored and can be retrieved. The keys for selection are name of genes, name of species. The alignment of the sequences has been performed by using different programmes CLUSTAL (25), PILEUP from GCG package (26) and optimized manually. Figure 4 shows the number of the aligned sequences in AMmtDB stored up to date. Alignment of both nucleotide and amino acid sequences can be viewed.


MmtDB DATA DISTRIBUTION AND DATABASE ACCESS
A World Wide Web site has been developed to allow easy access to the information in the MmtDB at the following address: http://WWW.ba.cnr.it/~areamt08/MmtDBWWW.htm . The MmtDB home page is shown in Figure 5.
MmtDBWWW is an interrogation system which allows a point-and-click interface for the selection of lists of entries on which the following tasks can be performed:
- flatfile generation for each variant entry (Fig. 3)
- nucleotide sequence extraction of variant sequences based on a reference sequence with nucleotide variations in capital letters
- analysis of variation events (Fig. 6).
Figure
The selection can be performed using at maximum the Boolean combination of four of the following criteria: gene code, source, technique name, continental group, linguistic family, linguistic group, language, family code, sex code, age, pathology acronym, variation event code, variation position, analysed region, restriction enzyme name and all text, that is search for a specific word in the entire flatfile entry. Within the collaboration with the HmutDB (a federated single human mutation database), the human mtDNA variants have been implemented in SRS by Heiki Lehavaslaiho in the section `Mutation Databases'. In particular the `View' option allows data to be easily retrieved into tables which can be processed by any graphics software in order to obtain any statistical view of the data according to user requirements.

CONCLUSION
MmtDB is a complete database and its structure is suitable for the flexible organization of several different information items, which can then be easily retrieved.
In MmtDB redundancy has been minimised by comparing each new sequence against the whole set of stored sequences before it is entered as a new entry. Every effort has been made to ensure the accuracy of the data.
Database users are constantly encouraged to provide comments and possibly new data to include in the database. We believe that the contribution of the mitochondrion community, of biochemists, clinicians, geneticists and taxonomists is essential for allowing the implementation and growth of the project.
ACKNOWLEDGEMENTS
This work has been partially supported by MPI (Italy), and by CNR Research Area of Bari, Italy.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||