| Nucleic Acids Research | Pages |
DNA Data Bank of Japan dealing with large-scale data submission
Introduction
A Large-Scale Data Submission System
Overview Of Data Flow In DDBJ
Data Submission Procedures And System Overview For Large-Scale Data
References
DNA Data Bank of Japan dealing with large-scale data submission
ABSTRACT
INTRODUCTION
When we began data bank activity at the DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) more than 10 years ago, we expected that the amount of submitted data would double over the following five years or so. However, the rate is presently less than two years beyond our expectations. There are two reasons for this which were unexpected at that time.
The first one is due to the commencement of the cDNA project which produced a large number of partially sequenced DNA fragments of a few hundred nucleotides in expressed genes called the expressed sequence tags (EST) (1). EST data have also been sent to DDBJ as mass submissions sometimes including more than 10 000 sequences in a single submission. We allocate an accession number to each of the submitted ESTs. The EST data continue to be produced by many projects worldwide and submitted to DDBJ, GenBank and the EMBL Nucleotide Sequence Database. Today, ESTs occupy more than 75% of the total data that the three data banks have collaboratively collected, processed and released.
Since EST data provide our users with information only about the source organism and the corresponding gene being just expressed in the cell with unknown function, one might consider that they have a narrower range of usage than ordinary sequence data. One can, however, take advantage of their richness in number when making use of them for biological study. For example, we recently carried out research in gene hunting for a HLA genome region, and found EST data quite useful (2). Namely, we made homology search of possible exons detected by a gene finding tool in this region against the dbEST, and picked up almost identical ESTs with the exons. Some of the ESTs were then found to form a gene which was similar to an extant functional gene.
The second reason lies in the beginning of genome projects, particularly for prokaryotes. The projects have produced the complete genome sequences for 10 or more prokaryotic species (3,4 and others) and that of Saccharomyces cerevisiae (5). Furthermore, the complete genome sequence of eukaryotic species such as Caenorhabditis elegans and Arabidopsis thaliana will soon be available, and then the completion of the human genome sequencing projects will follow. One of the usages of the complete genome sequence data is to investigate the evolution of genome structure. In S.cerevisiae it has been indicated that the whole genome of the yeast experienced genome duplication about 100 million years ago (6). Similar observation and discussion has been made for the human genome (7), though the complete genome sequence is not yet available.
In this article we report the modification of our data submission tool, Sakura (8), and database management system, Yamato II (9). The modification resulted in the ability of the tools to handle and process mass submissions of ESTs and long DNA stretches of a genome sequence more efficiently than the original ones. We also briefly discuss the extension of large scale data submission and processing on the basis of the object oriented database management system.
A LARGE-SCALE DATA SUBMISSION SYSTEM
Table 1. DDBJ originally developed two World Wide Web (WWW) interface-oriented systems, Sakura and Yamato II; Sakura is used for data submission, whereas Yamato II is utilized in data annotation and management. Both tools have been successful over the past few years in terms of service and operation, however, the systems apparently lacked the ability to handle a large number of sequences in a set and/or very long sequences. In recent times, DDBJ has experienced a dramatic increase in massive data submission (Table 1). Therefore, DDBJ has introduced a new system (modified from Sakura and Yamato II) which has the capacity to accept and annotate this influx of data.
OVERVIEW OF DATA FLOW IN DDBJ
When submitting ordinary data, a submitter has several options (Fig.
Figure 1. Overview of data flow from data registration by submitters to data dissemination at DDBJ. Besides these submission methods, there are other systems suited for large-scale data submissions, which have been developed by DDBJ and will be explained in detail later. These systems allow a submitter to verify file formats before submitting the data by Email. The systems were also designed to automatically verify an invalid character in addition to the consistency between sequence and annotation. These systems save labor and time for the validation of submitted data at DDBJ.
DATA SUBMISSION PROCEDURES AND SYSTEM OVERVIEW FOR LARGE-SCALE DATA
The DDBJ's new system for managing large-scale data submissions primarily consists of four separate parts which are (i) the WWW data submission system, (ii) large-scale data submission system/off-line (installed at the submitter side), (iii) data submission management system and (iv) data storing system (Fig.
Figure 2. Overview and modules of the large-scale data submission system. The large-scale data submission system/off-line is publicly available by downloading the program from an ftp server at DDBJ. The system rigorously checks the file and annotations automatically excluding any invalid characters. It also allows a submitter to verify the file formats and consistency concerning annotation and sequence prior to actually sending the data files by Email. Two types of files are used for the submission; one is used for annotation and the other is employed for recording sequences. The file for annotation is in a tabular format which popular word processors, spreadsheet and database management systems can handle. Therefore, a submitter is not required to purchase an expensive platform in addition to their conventional system. Nucleotide sequences are submitted in the FASTA file format. The large-scale data submission system/off-line has another important function, which is to automatically verify the file format and consistency. This is important for submitting large-scale genome data, because it greatly alleviates the number of efforts, which exhaust the human resources of reviewers and annotators at DDBJ. The third system is called the data submission management system, which manages the submitted files and monitors the submission progress. After receiving the Excel and FASTA files sent from a submitter, a record is made for each submission by operating the system, and the message is sent to other staff members of DDBJ by Email. The fourth is called the data storing system. This system automatically performs more rigorous checking before completely loading the files to the master database in addition to checking the consistency regarding the annotation and sequence. The template information ranging from the locus, definition, accession number, the feature information and to the source organism is also loaded to the master database by the system. Finally, the system issues an accession number to the administrator of DDBJ who notifies the number to a submitter by Email. By operating the large-scale data submission system, DDBJ has recently responded to some of the major genome data submitted from institutes and universities in Japan. For example, Kazusa DNA Research Institute has submitted Arabidopsis data with the size of 7 472 343 nucleotides which is the longest sequence data ever submitted to DDBJ on a single submission. The Institute has made another submission of Homo sapiens data of 67 914 nucleotides. Kitasato University has also submitted Homo sapiens data with the sizes of 4 842 948 and 5 561 026 nucleotides, separately. In addition, the Product Assessment Technology Center of the Ministry of Industry and Trade has submitted Pyrococcus horikosii data with the size of 1 738 505 nucleotides. All of those data are now available at DDBJ, GenBank and EMBL Nucleotide Database. In addition to the new systems mentioned above, there are three more systems served for data management. First, is the master database system called the ddbj, which exclusively stores the files despite the size of the data per entry. The database is based upon Sybase with the UNIX operating system controlled by the SUN server. The second one is the so-called group manager, which updates the annotations after loading the files into the database. The last one is the distribution manager, which releases the new data into the public arena. There was a special case for submitting GSS (Genome Survey Sequence) data (Fig. Figure 3. Submission system for GSS (Genome Survey Sequence). Although the large-scale data submission system is not well known, its ability to efficiently process data has led it to be regarded as the flagship of DDBJ. As shown previously, over the last few years there has been a substantial increase in sequence submissions to DDBJ. From our experience with data management, it has become obvious that we need to enhance our ability to develop more efficient and effective data submission tools. We aim to offer efficient, cost effective and user friendly services, thereby making DDBJ more competitive. DDBJ is now considering upgrading the data submission and processing systems by introducing a new type of architecture such as CORBA, which is an object oriented distributed platform. The new systems will be well suited for more efficiently managing the submitted data by reducing data handling labor and time.
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
Y. Tateno, T. Imanishi, S. Miyazaki, K. Fukami-Kobayashi, N. Saitou, H. Sugawara, and T. Gojobori
DNA Data Bank of Japan (DDBJ) for genome scale research in life science
Nucleic Acids Res.,
January 1, 2002;
30(1):
27 - 30.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler
GenBank
Nucleic Acids Res.,
January 1, 2000;
28(1):
15 - 18.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
Y. Tateno, S. Miyazaki, M. Ota, H. Sugawara, and T. Gojobori
DNA Data Bank of Japan (DDBJ) in collaboration with mass sequencing teams
Nucleic Acids Res.,
January 1, 2000;
28(1):
24 - 26.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. Harger, G. Chen, A. Farmer, W. Huang, J. Inman, D. Kiphart, F. Schilkey, M. P. Skupski, and J. Weller
The Genome Sequence DataBase
Nucleic Acids Res.,
January 1, 2000;
28(1):
31 - 32.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
G. Perriere, P. Bessieres, and B. Labedan
EMGLib: the Enhanced Microbial Genomes Library (update 2000)
Nucleic Acids Res.,
January 1, 2000;
28(1):
68 - 71.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. A. Blake, J. T. Eppig, J. E. Richardson, M. T. Davisson, and the Mouse Genome Database Group
The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse
Nucleic Acids Res.,
January 1, 2000;
28(1):
108 - 111.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Ringwald, J. T. Eppig, J. A. Kadin, J. E. Richardson, and the Gene Expression Database Group
GXD: a Gene Expression Database for the laboratory mouse: current status and recent enhancements
Nucleic Acids Res.,
January 1, 2000;
28(1):
115 - 119.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
B. L. Maidak, J. R. Cole, T. G. Lilburn, C. T. Parker Jr, P. R. Saxman, J. M. Stredwick, G. M. Garrity, B. Li, G. J. Olsen, S. Pramanik, et al.
The RDP (Ribosomal Database Project) continues
Nucleic Acids Res.,
January 1, 2000;
28(1):
173 - 174.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Ruiz, V. Giudicelli, C. Ginestoux, P. Stoehr, J. Robinson, J. Bodmer, S. G. E. Marsh, R. Bontrop, M. Lemaitre, G. Lefranc, et al.
IMGT, the international ImMunoGeneTics database
Nucleic Acids Res.,
January 1, 2000;
28(1):
219 - 221.
[Abstract]
[Full Text]
[PDF]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (470K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (15)
![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Sugawara, H.
![]()
Articles by Tateno, Y.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Sugawara, H.
![]()
Articles by Tateno, Y.
![]()
Social Bookmarking ![]()
![]()
What's this?