ABSTRACT
The present paper describes the structure of MmtDB-a specialized database designed to collect Metazoa mitochondrial DNA
variants. Priority in the data collection is given to the Metazoa species for
which a large amount of variants is available, as it is the case for human
variants. Starting from the sequences available in the Nucleotide Sequence
Databases, the redundant sequences are removed and new sequences from other
sources are added. Value-added information are associated to each variant sequence, e.g. analysed
region, experimental method, tissue and cell lines, population data, sex, age,
family code and information about the variation events (nucleotide position,
involved gene, restriction site's gain or loss). Cross-references are introduced to the EMBL Data Library, as well as an internal
cross-referencing among MmtDB entries according to their tissual, heteroplasmic,
familiar and aplotypical correlation. MmtDB can be accessed through the World
Wide Web at URL
http://WWW.ba.cnr.it/~areamt08/MmtDBWWW.htm.
Mitochondria are subcellular organella under the control of both nuclear and
mitochondrial genomes. The mitochondrion is the only organelle in Metazoa that
contains its own DNA (
1
). The idea to create a Metazoa mtDNA variants specialized database (MmtDB)
originated from the awareness that a large mass of information associated to
mtDNA sequences is not stored in the primary databases, which instead contain
redundant information (e.g. bibliographic and taxonomic). Therefore MmtDB has
been designed and implemented as a subset of the primary databases, accurately
revised and enriched with specific information pertaining to the biological
features of each entry, aimed to provide new data structure and to generate new
cross-references between sets of data completely unlinked until now.
MmtDB was originally designed as a Metazoan database, and our group started with
the management of vertebrate and particularly human data. Yet we realized a
much greater effort was needed to cover all Metazoa and some collaboration was
seeked. So presently MmtDB is part of the MitBase project, a comprehensive and
integrated mitochondrial database recently funded by the EU BIOTECHNOLOGY
Programme. MitBase will be developed by a network of six nodes, each collecting
and editing data on different groups of organisms (protists, plants, fungi,
vertebrates, invertebrates and humans), by a bioinformatic node (EBI) and a
node dealing with a pilot project on nuclear genes related to mitochondria. The
role of our group in the project is to continue the work started on
vertebrates.
The vertebrates mitochondrial genome is a closed round molecule having a size
from 15 000 to 20 000 bp (
2
). The peculiar features of this molecule are its high compactness and
simplicity; mitochondrial DNA (mtDNA) is dense with information, does not have
introns and is generally composed of 13 genes coding for proteins, 22 tRNA
genes and two rRNA genes. It shows a major non-coding region, called the D-loop in vertebrates, which is involved in regulatory processes. The
D-loop is the most variable part of the genome and is thus used as a marker
for human diversity studies (
3
,
4
).
The mtDNA has an unusual genetics. It is maternally inherited (
5
) and is polyploid, which means it is present, in both the cell and organelle,
with a high copy number. Therefore the mtDNA in a same organelle, cell, tissue,
organ or individual may not be homogeneous, indeed when a new mutation occurs
it creates a mixed intracellular population of mutant and normal mtDNAs known
as heteroplasmy. When a heteroplasmic cell divides, the mutant and the normal
DNAs are randomly distributed in the daughter cells (mitotic segregation
process) (
6
). At some stages in oogenesis, the amount of mtDNA molecules is reduced to a
relatively small number (bottleneck hypothesis) (
7
). Following these stages over- replication brings the amount of molecules in each DNA cell to its normal
high level and this can lead to a relatively pure population of each genome
that pre-existed in the original parental organelle (
8
).
Figure
Moreover, the location of mtDNA, which is attached to the mitochondrial inner
membrane and close to the respiratory chain, its lack of protective proteins
(e.g. histones) and a very poor DNA repair system (
9
) make mtDNA particularly prone to mutation.
Because of its reduced size and the fact that mtDNA is used in several fields of
applied biology (
10
), the number of sequences of mitochondrial genes and complete genomes is
growing exponentially. In particular, much information on polymorphic regions
of mtDNA is now available in literature. For human mt DNA, several of the
sequences managed in MmtDB are related to evolutionary studies and are relevant
to the hypervariable segments (HVI, HVII) of the D-loop (
11
-
14
), and as many others are in connection with pathology studies on alterations of
the mtDNA (deletions, insertions and point mutations) (
15
,
16
).
The human data are coded using as reference the nucleotide sequence published by
Anderson
et al
. in 1981 (
17
), which, despite being a hybrid (was derived from placenta mtDNA and in part
from HeLa cell mtDNA), represents an important reference in human variability
studies.
The Metazoan mitochondrial sequences are retrieved from the EMBL
(
18
) and GenBank
(
19
) primary databases.
Figure The data are extracted from the primary databases using the GCG (
20
), ACNUC (
21
) and SRS (
22
) packages. The comparison between a reference sequence and each potential
variant is performed by applying the GCG program BESTFIT. The published
sequence data which are not included in the primary databases are extracted
from bibliographic databases (Medline, Current Contents), from Entrez or other
information systems. Congress acta and unpublished data kindly provided by the
authors are also included.
The data in MmtDB are organized into two large classes:
SPECIES
and
variants
(Fig.
1
). Each of these classes is further organized into subclasses or objects (an
example of subclasses associated to each human variant is shown in Fig.
2
). To
each species, n variants
are associated and
each variant is an entry in the MmtDB
database. The species class refers to the items in the database which can be associated to a
biological species of METAZOA and of which mtDNA variants are available.
To the class SPECIES the following objects are associated: the Reference
Sequence, the Gene and Restriction Endonuclease Maps, the Taxonomic
Classification and the Bibliography. The reference sequence(s) is represented
by the nucleotide sequence of the complete mitochondrial genome, if the genome
of that species has been fully sequenced; otherwise if one or more fragments
have been sequenced, the reference is either the longest sequence of each
fragment or a virtual sequence made up by the combination of overlapping
fragments from different sources. Such a reference sequence can be regarded as
a tool which allows compactness of information. For each species the entire
sequence can thus be reported only once and for each variant it is sufficient
to code only the differences in the
`pattern of the variation events'
.
The variants class includes as objects information items specific of the fragment under
consideration, such as: (i) the location of the fragment in the reference
sequence (
analysed region
); (ii) the
experimental method
used for the detection of the variant, e.g. Sanger, Maxam and Gilbert, RFLP,
Southern or PCR; (iii) the
pattern of the variation events
with respect to the reference sequence, i.e. the nucleotide position in the
reference sequence where the variation occurs, the type of variation (point
mutations, deletions or insertions), according to the codes reported in Table
1
, the involved gene and the loss or gain of a restriction site following the
variation; (iv)
bibliographic
references; (v) the
tissue or cell lines
from which the DNA was extracted; (vi)
population data
, relevant to the geographic and linguistic origin of the individuals from which
the DNA was extracted; (vii) the
age
and the
sex
of the individuals and their
pathological
or
normal
status
.
Population data are coded according to geographical (Continental groups) and
anthropological (Population groups) classifications. A linguistic
classification according to M. Ruhlen (
23
) (Linguistic phylum, Language group and Language) has also been added. These
classifications are often limited by the scarcity of information reported in
the original papers.
When the variant has been extracted from the primary databases it is
cross-referenced
through the Accession Number and the Entry Name which univocally identify each
data-entry in the primary databases (DR field in Fig.
3
). Then the entries are
internally cross-referenced
through their accession number in MmtDB (AC fields in Fig.
3
) in order to link different but correlated entries. The correlation can be
based on the tissue type (T), the aplotype (A), the family (F) or the
heteroplasmic status (H).
The family correlation pertains mainly to human data related to mitochondrial
disease studies. Indeed when complete pedigrees have been analysed a
family code
is defined in MmtDB composed of two letters for the family, followed by a roman
number identifying the generation and an Arabic number identifying the
individual in the generation. Figure
3
shows an example of MmtDB entry in the flatfile format, which is commonly used
by most of the biological databases, as in the EMBL data library, the GenBank,
the Swissprot (
24
) and many others.
Data at present are stored by annotators skilled in biology. We are planning to
prepare a submission form allowing the authors to submit data directly. The
data in MmtDB are stored in a compact format as shown in Figure
4
using the program MITO-INS.
MITO-INS is written in C language and interactively structured to allow the
annotator to insert, modify and display the data of an entry in the compact
format. It is provided with a menu for the choice of operation by the
annotator. The input data are saved into a database made up of a set of files
relationally structured to allow the complete and fully flexible interrogation
and retrieval of data.
A World Wide Web site has been developed to allow an easy access to the
information in the MmtDB at the following address: http://WWW.ba.cnr.it/~areamt08/MmtDBWWW.htm . MmtDB Home Page is shown in Figure
5
.
Figure
MmtDBWWW is an interrogation system which allows a point-and-click interface for the selection of lists of entries on which the
following tasks can be performed:
flatfile generation
for each variant entry (Fig.
3
);
nucleotide sequence extraction
of variant sequences based on a reference sequence with nucleotide variations
in capital letters;
analysis of variation events
(Fig.
6
).
Figure
The selection can be performed using at maximum the Boolean combination of four
of the following criteria: gene code, source, technique name, continental
group, linguistic family, linguistic group, language, family code, sex code,
age, pathology acronym, variation event code, variation position, analysed
region, restriction enzyme name and all text, that is search for a specific
word in the entire flatfile entry.
MmtDB is not the only specialized mitochondrial DNA database available to the
scientific community, two other databases contain similar data, GOBASE and
Mitomap (
25
,
26
).
The major purpose of GOBASE is to organize, collect and cross-reference the dispersed information concerning organelle genomes, both
mitochondria and chloroplasts, with the aim to make them available to the
scientific community as an organized set, free from errors.
Mitomap is a comprehensive collection of human mitochondrial data and includes
essentially clinical data on pathogenic mutations. Mitomap is the one similar
in content to MmtDB, but it does not offer the same flexibility in structure
and potential for data correlation. Indeed, Mitomap is made up of text tables
which do not allow an easy and prompt correlation between data. No cross-reference is provided in Mitomap to other databases. Furthermore,
selection is possible for only one of four set fields: function, disease,
polymorphisms and restriction sites. Despite the good clinical information
provided in Mitomap, it does not include information on the analysed region and
sequence, on the used analysis technique, the source of the mtDNA, the
restriction enzymes as well as on the individual.
Therefore, MmtDB is a more complete database and its structure is suitable for
the flexible organization of several different information items, which can
then be easily retrieved.
In MmtDB, redundancy has been minimized by comparing each new sequence against
the whole set of stored sequences before it is entered as a new entry. Every
effort has been made to ensure the accuracy of the data.
Database users are constantly encouraged to provide comments and possibly new
data to include in the database. We believe that the contribution of the
mitochondrion community, of biochemists, clinicians, geneticists and
taxonomists is essential to allow the implementation and growth of the project.
We are grateful to Professor Alberto Mioni of the University of Padova for his
help in population linguistic classifications. This work has been partially
supported by MPI (Italy), by the `Comitato Biotecnologie e Biologia Molecolare'
of the CNR, Italy and by the EU-Biotechnology Programme.




REFERENCES
Return
