Nucleic Acids Research, 2004, Vol. 32, Database issue D468-D470
© 2004 Oxford University Press
Ensembl 2004
Wellcome Trust Sanger Institute and 1 European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
*To whom correspondence should be addressed. Tel: +44 1223 494983; Fax: +44 1223 494919; Email: th{at}sanger.ac.uk
Received September 16, 2003; Accepted September 18, 2003
| ABSTRACT |
|---|
|
|
|---|
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organize biology around the sequences of large genomes. It is a comprehensive and integrated source of annotation of large genome sequences, available via interactive website, web services or flat files. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. The facilities of the system range from sequence analysis to data storage and visualization and installations exist around the world both in companies and at academic sites. With a total of nine genome sequences available from Ensembl and more genomes to follow, recent developments have focused mainly on closer integration between genomes and external data.
| INTRODUCTION |
|---|
|
|
|---|
Genome sequences provide a natural framework about which to organize biological data. In the short time in which they have been available, genome databases have proved invaluable resources to researchers. Ensembl provides one of the most popular sources of automatic analysis and integration of large genome sequence data and is a joint project between the EBI and the Sanger Institute. It now contains nine genomes: five vertebrates: human, mouse, rat, fugu, zebrafish; two worms: Caenorhabditis briggsae and Caenorhabditis elegans and two insects: Drosophila melanogaster and Anopheles gambiae. Ensembl has been involved in the continued analysis of human data, analysis of the mouse genome (1), analysis of the A.gambiae genome (2) and the C.briggsae genome. Ensembl gene predictions have also formed the core set of annotations for the forthcoming rat genome analysis. Ensembl remains an entirely open project with all data freely available and code openly licensed. Ensembl has developed a strong developer network of users in both academia and industry and is being installed both to mirror Ensembl generated data and to be used as a software foundation for user projects. Several papers describing specific aspects of Ensembl have recently been submitted (36). This paper briefly outlines some of the developments of the project since the report last year (7).
| NEW DEVELOPMENTS |
|---|
|
|
|---|
Regular update cycle
To streamline the handling of this ever changing and increasing amount of data, from February 2003, Ensembl adopted a monthly release cycle, allowing improvements to the web interface and database schema to be released monthly, with new data being incorporated as it became available. Database dumps and flat files are released in sync with updates to the website.
Pre-ensembl website
A full Ensembl annotation of a genome takes some weeks to complete. To provide users with immediate access to newly released genome assemblies Ensembl now offers a pre-ensembl website (http://pre.ensembl.org/) with limited functionality. This can be made available only a few days after the release of the genome and provides BLAST and SSAHA searching, placement of all known proteins, repeat masking and ab initio gene predictions.
Otter: an extended Ensembl schema for gene curation
During the year, Ensembl developed a new software component called Otter. Otter is an Ensembl database, but with an extended schema and an associated client/server system to support manual gene annotation. The Sanger Institute vertebrate annotation system is being migrated to use Otter, which will then put both automatic (Ensembl) and manual annotation under a single software framework and help greatly with subsequent data integration. The Otter server communicates with annotation clients via an XML format, which allows easy exchange and verification of annotation generated with different systems.
The Apollo genome browser (4), a GMOD component (http://www.gmod.org/) under joint development by Ensembl and the Berkeley Drosophila genome project (http://www. bdgp.org/), can be used as an annotation client for Otter. Apollo has also been extended to display data from DAS (distributed annotation system) servers. As an editor, Apollo has the advantage of being able to view and edit annotation in a comparative genomic context: by connecting to two Otter servers (e.g. human and mouse) and an Ensembl compara database containing pre-calculated synteny information between the two genomes, it is possible to view annotation for both genomes and edit each in the context of the synteny with the other.
| ENHANCEMENTS |
|---|
|
|
|---|
Other than these new developments, there have been continuous enhancements to existing features of Ensembl over the year. Users are recommended to read the Whats new pages accompanying every release as user interface improvements are frequently subtle, but can save researchers considerable time. Some of the more significant improvements are listed here.
Ensembl genome annotation and comparative analysis
The quality of the annotation produced by the core automatic gene building system has continued to improve, with builds delivered on seven genome assemblies during the year. The most recent is the first version of the finished human genome sequence (NCBI33) announced in April, which also has pseudogenes automatically predicted. In parallel with gene building, comparative analysis is now routinely carried out for each new assembly. DNA synteny is generated between human, mouse and rat and putative gene orthologues between all five vertebrates and between each of the two worms and insects are automatically generated.
Ensembl website
Last years move to the new schema enabled the development of significant enhancements to the Ensembl webviews. These include the addition of a fourth basepair level panel to Contigview, showing nucleotide, six frame amino acid translation and restriction enzyme site features. Additional pre-processing of SNP data during the building of the Ensembl-lite database (a denormalized database to speed web access), with respect to other annotation, has allowed Contigview, Transview and Protview to be extended to show SNPs against transcripts and their protein products, including labelling of synonymous and non-synonymous coding SNPs. Other enhancements to Contigview include labelled syntenic blocks shown on the overview panel and access to a new interface, Dotterview, from DNA conservation tracks on the detailed view panel. Dotterview is a web interface to the program Dotter, showing a dotplot of DNA similarity by default over a 10 kb window in two genomes, with Ensembl annotation. The interface for adding DAS (8) sources to Contigview has continued to be developed, giving the user much greater control over display of each source.
EnsemblMart: data mining for genomes
Ensembl has continued to import new externally generated data sets and resources into its system. These are frequently available in contigview via the DAS source menu; however, many are also being incorporated into EnsemblMart as additional data mining indicies. Examples include the STACK expression database eVOC nomenclature (collaboration with SANBI); rat QTLs and microarray identifiers from Affymetrix and others. All of these data types are queryable via the Mart data mining interface, which has increased substantially in functionality over the year and now has its own Whats new web pages and includes such functionality as integration with the ArrayExpress microarray repository at EBI.
Ensembl software system
The flexibility of components of the Ensembl software system are increasingly leading to their reuse elsewhere. Within the Sanger Institute alone, the Ensembl pipeline is being used to support gene curation by both the Wormbase and Havana (vertebrate annotation) groups. Havana is also in the process of making use of the Otter database for storing its gene annotation. The Ensembl website code has been reused to power the Vega website (http://vega.sanger.ac.uk/), which shows curated annotation of vertebrate genomes collected from a number of annotation groups into a single database. The fact that Ensembl data are also being served via DAS servers (8) is encouraging data to be combined in novel ways to provide specialist data displays. The website code has already been reused to build Contigview-like webviews of a virtual database composed entirely of different DAS sources.
| FUTURE DIRECTIONS |
|---|
|
|
|---|
Ensembl remains focused on providing a genome information infrastructure of use to many researchers, principally via the web. As well as providing the baseline annotation for a number of genomes, Ensembl is continuously trying to improve all aspects of its work, from software engineering through to data analysis. 2004 promises a number of new genomes (e.g. chicken, chimp and honey bee) but also continued technology and presentation improvements, such as new views of cross-species data, organized around the putative gene orthologues predicted by the comparative analysis pipeline.
| CONTACTING ENSEMBL |
|---|
|
|
|---|
Ensembl is a joint project of the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute (WTSI), both of which are located on the Wellcome Trust Genome Campus, Cambridge, UK. To receive announcements about updates, subscribe to the announce mailing list: majordomo{at}ebi.ac.uk subscribe ensembl-announce. To follow the day-to-day development of Ensembl, subscribe to the development mailing list: majordomo{at}ebi.ac.uk subscribe ensembl-dev. Requests for information and support can be sent to helpdesk{at}ensembl.org, which is a fully supported helpdesk. Extensive additional documentation can be found on the Ensembl website, including installation guides and tutorials, about using both the software system and the web interface.
| ACKNOWLEDGEMENTS |
|---|
We are grateful to users of our website and the developers on our mailing lists for much useful feedback and discussion. The Ensembl project is funded principally by the Wellcome Trust with additional funding from EMBL and NIH-NIAID.
| REFERENCES |
|---|
|
|
|---|
- Waterston,R.H., Lindblad-Toh,K., Birney,E., Rogers,J., Abril,J.F., Agarwal,P., Agarwala,R., Ainscough,R., Alexandersson,M., An,P. et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520562.[CrossRef][Medline]
- Holt,R.A., Subramanian,G.M., Halpern,A., Sutton,G.G., Charlab,R., Nusskern,D.R., Wincker,P., Clark,A.G., Ribeiro,J.M., Wides,R. et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science, 298, 129149.
[Abstract/Free Full Text] - Birney,E., Clamp,M.E. and Hubbard,T.J. (2002) Databases and tools for browsing genomes. Annu. Rev. Genom. Hum. Genet., 3, 293310.[CrossRef][ISI][Medline]
- Lewis,S.E., Searle,S.M., Harris,N., Gibson,M., Lyer,V., Richter,J., Wiel,C., Bayraktaroglir,L., Birney,E., Crosby,M.A. et al. (2002) Apollo: a sequence annotation editor. Genome Biol., 3, RESEARCH0082.
- Hoon,S., Ratnapu,K.K., Chia,J.M., Kumarasamy,B., Juguang,X., Clamp,M., Stabenau,A., Potter,S., Clarke,L. and Stupka,E. (2003) Biopipe: a flexible framework for protocol-based bioinformatics analysis. Genome Res., 13, 19041915.
[Abstract/Free Full Text] - Clamp,M. (2003) The Jalview Java Alignment Editor. Bioinformatics, in press.
- Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V. et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res., 31, 3842.
[Abstract/Free Full Text] - Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. (2001) The Distributed Annotation System. BMC Bioinformatics, 2, 7.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
X. Xu, Y. Zhao, and R. Simon Gene Set Expression Comparison kit for BRB-ArrayTools Bioinformatics, January 1, 2008; 24(1): 137 - 139. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Wierling, R. Herwig, and H. Lehrach Resources, standards and tools for systems biology Brief Funct Genomic Proteomic, October 17, 2007; (2007) elm027v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Zhang, J. Li, L. Kong, G. Gao, Q.-R. Liu, and L. Wei NATsDB: Natural Antisense Transcripts DataBase Nucleic Acids Res., January 12, 2007; 35(suppl_1): D156 - D161. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Pacher, M. J. Seewald, M. Mikula, S. Oehler, M. Mogg, U. Vinatzer, A. Eger, N. Schweifer, R. Varecka, W. Sommergruber, et al. Impact of constitutive IGF1/IGF2 stimulation on the transcriptional program of human breast cancer cells Carcinogenesis, January 1, 2007; 28(1): 49 - 59. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Semon and L. Duret Evolutionary Origin and Maintenance of Coexpressed Gene Clusters in Mammals Mol. Biol. Evol., September 1, 2006; 23(9): 1715 - 1723. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Zhang, X. S. Liu, Q.-R. Liu, and L. Wei Genome-wide in silico identification and analysis of cis natural antisense transcripts (cis-NATs) in ten species Nucleic Acids Res., July 18, 2006; 34(12): 3465 - 3475. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Holste, G. Huo, V. Tung, and C. B. Burge HOLLYWOOD: a comparative relational database of alternative splicing Nucleic Acids Res., January 1, 2006; 34(suppl_1): D56 - D62. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. K. L. Leung, L. Trinkle-Mulcahy, Y. W. Lam, J. S. Andersen, M. Mann, and A. I. Lamond NOPdb: Nucleolar Proteome Database Nucleic Acids Res., January 1, 2006; 34(suppl_1): D218 - D220. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Chen, A. J. Mackey, C. J. Stoeckert Jr, and D. S. Roos OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups Nucleic Acids Res., January 1, 2006; 34(suppl_1): D363 - D368. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. I. Nesvizhskii and R. Aebersold Interpretation of Shotgun Proteomic Data: The Protein Inference Problem Mol. Cell. Proteomics, October 1, 2005; 4(10): 1419 - 1440. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor, et al. Galaxy: A platform for interactive large-scale genome analysis Genome Res., October 1, 2005; 15(10): 1451 - 1455. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. JAGLA, N. AULNER, P. D. KELLY, D. SONG, A. VOLCHUK, A. ZATORSKI, D. SHUM, T. MAYER, D. A. DE ANGELIS, O. OUERFELLI, et al. Sequence characteristics of functional siRNAs RNA, June 1, 2005; 11(6): 864 - 872. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Y. Chen, H. Manninga, K. Slanchev, M. Chien, J. J. Russo, J. Ju, R. Sheridan, B. John, D. S. Marks, D. Gaidatzis, et al. The developmental miRNA profiles of zebrafish as determined by small RNA cloning Genes & Dev., June 1, 2005; 19(11): 1288 - 1293. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Sharov, D. B. Dudekula, and M. S.H. Ko Genome-wide assembly and analysis of alternative transcripts in mouse Genome Res., May 1, 2005; 15(5): 748 - 754. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. B. Cannon, J. A. Crow, M. L. Heuer, X. Wang, E. K.S. Cannon, C. Dwan, A.-F. Lamblin, J. Vasdewani, J. Mudge, A. Cook, et al. Databases and Information Integration for the Medicago truncatula Genome and Transcriptome Plant Physiology, May 1, 2005; 138(1): 38 - 46. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. M. Harrison, D. Zheng, Z. Zhang, N. Carriero, and M. Gerstein Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability Nucleic Acids Res., April 28, 2005; 33(8): 2374 - 2383. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Sarkans, H. Parkinson, G. G. Lara, A. Oezcimen, A. Sharma, N. Abeygunawardena, S. Contrino, E. Holloway, P. Rocca-Serra, G. Mukherjee, et al. The ArrayExpress gene expression database: a software engineering and implementation perspective Bioinformatics, April 15, 2005; 21(8): 1495 - 1501. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Santoyo, J. M. Vaquerizas, and J. Dopazo Highly specific and accurate selection of siRNAs for high-throughput functional assays Bioinformatics, April 15, 2005; 21(8): 1376 - 1382. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Heathcote, C. Braybrook, L. Abushaban, M. Guy, M. E. Khetyar, M. A. Patton, N. D. Carter, P. J. Scambler, and P. Syrris Common arterial trunk associated with a homeodomain mutation of NKX2.6 Hum. Mol. Genet., March 1, 2005; 14(5): 585 - 593. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. J. Adams, L. van der Weyden, F. V. Gergely, M. J. Arends, B. L. Ng, D. Tannahill, R. Kanaar, A. Markus, B. J. Morris, and A. Bradley BRCTx Is a Novel, Highly Conserved RAD18-Interacting Protein Mol. Cell. Biol., January 15, 2005; 25(2): 779 - 788. [Abstract] [Full Text] [PDF] |
||||
![]() |
Gáb. E. Tusnády, I.án Simon, A.ás Váradi, and T.ás Arányi BiSearch: primer-design and search tool for PCR on bisulfite-treated genomes Nucleic Acids Res., January 13, 2005; 33(1): e9 - e9. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Blüthgen, S. M. Kielbasa, and H. Herzel Inferring combinatorial regulation of transcription in silico Nucleic Acids Res., January 12, 2005; 33(1): 272 - 279. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Florea, V. Di Francesco, J. Miller, R. Turner, A. Yao, M. Harris, B. Walenz, C. Mobarry, G. V. Merkulov, R. Charlab, et al. Gene and alternative splicing annotation with AIR Genome Res., January 1, 2005; 15(1): 54 - 66. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Brooksbank, G. Cameron, and J. Thornton The European Bioinformatics Institute's data resources: towards systems biology Nucleic Acids Res., January 1, 2005; 33(suppl_1): D46 - D53. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Adel, D. Laurent, and M. Dominique HOPPSIGEN: a database of human and mouse processed pseudogenes Nucleic Acids Res., January 1, 2005; 33(suppl_1): D59 - D66. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Hubbard, D. V. Grafham, K. J. Beattie, I. M. Overton, S. R. McLaren, M. D.R. Croning, P. E. Boardman, J. K. Bonfield, J. Burnside, R. M. Davies, et al. Transcriptome analysis for the chicken based on 19,626 finished cDNA sequences and 485,337 expressed sequence tags Genome Res., January 1, 2005; 15(1): 174 - 183. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Chen, Y. Zhang, Y. Yin, G. Gao, S. Li, Y. Jiang, X. Gu, and J. Luo SPD--a web-based secreted protein database Nucleic Acids Res., January 1, 2005; 33(suppl_1): D169 - D173. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Meinel, A. Krause, H. Luz, M. Vingron, and E. Staub The SYSTERS Protein Family Database in 2005 Nucleic Acids Res., January 1, 2005; 33(suppl_1): D226 - D229. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Kersey, L. Bower, L. Morris, A. Horne, R. Petryszak, C. Kanz, A. Kanapin, U. Das, K. Michoud, I. Phan, et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes Nucleic Acids Res., January 1, 2005; 33(suppl_1): D297 - D302. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Stothard, G. Van Domselaar, S. Shrivastava, A. Guo, B. O'Neill, J. Cruz, M. Ellison, and D. S. Wishart BacMap: an interactive picture atlas of annotated bacterial genomes Nucleic Acids Res., January 1, 2005; 33(suppl_1): D317 - D320. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Chen, T. W. Harris, I. Antoshechkin, C. Bastiani, T. Bieri, D. Blasiar, K. Bradnam, P. Canaran, J. Chan, C.-K. Chen, et al. WormBase: a comprehensive data resource for Caenorhabditis biology and genomics Nucleic Acids Res., January 1, 2005; 33(suppl_1): D383 - D389. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D'Eustachio, E. Schmidt, B. de Bono, B. Jassal, G.R. Gopinath, G.R. Wu, L. Matthews, et al. Reactome: a knowledgebase of biological pathways Nucleic Acids Res., January 1, 2005; 33(suppl_1): D428 - D432. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Wang, X. He, J. Ruan, M. Dai, J. Chen, Y. Zhang, Y. Hu, C. Ye, S. Li, L. Cong, et al. ChickVD: a sequence variation database for the chicken genome Nucleic Acids Res., January 1, 2005; 33(suppl_1): D438 - D441. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Hubbard, D. Andrews, M. Caccamo, G. Cameron, Y. Chen, M. Clamp, L. Clarke, G. Coates, T. Cox, F. Cunningham, et al. Ensembl 2005 Nucleic Acids Res., January 1, 2005; 33(suppl_1): D447 - D453. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Elnitski, B. Giardine, P. Shah, Y. Zhang, C. Riemer, M. Weirauch, R. Burhans, W. Miller, and R. C. Hardison Improvements to GALA and dbERGE II: databases featuring genomic sequence alignment, annotation and experimental results Nucleic Acids Res., January 1, 2005; 33(suppl_1): D466 - D470. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. P. O'Brien, M. Remm, and E. L. L. Sonnhammer Inparanoid: a comprehensive database of eukaryotic orthologs Nucleic Acids Res., January 1, 2005; 33(suppl_1): D476 - D480. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Penzkofer, T. Dandekar, and T. Zemojtel L1Base: from functional annotation to prediction of active LINE-1 elements Nucleic Acids Res., January 1, 2005; 33(suppl_1): D498 - D500. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Boccia, M. Petrillo, D. di Bernardo, A. Guffanti, F. Mignone, S. Confalonieri, L. Luzi, G. Pesole, G. Paolella, A. Ballabio, et al. DG-CST (Disease Gene Conserved Sequence Tags), a database of human-mouse conserved elements associated to disease genes Nucleic Acids Res., January 1, 2005; 33(suppl_1): D505 - D510. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Smink, E. M. Helton, B. C. Healy, C. C. Cavnor, A. C. Lam, D. Flamez, O. S. Burren, Y. Wang, G. E. Dolman, D. B. Burdick, et al. T1DBase, a community web-based resource for type 1 diabetes research Nucleic Acids Res., January 1, 2005; 33(suppl_1): D544 - D549. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Shklar, L. Strichman-Almashanu, O. Shmueli, M. Shmoish, M. Safran, and D. Lancet GeneTide--Terra Incognita Discovery Endeavor: a new transcriptome focused member of the GeneCards/GeneNote suite of databases Nucleic Acids Res., January 1, 2005; 33(suppl_1): D556 - D561. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Baross, Y. S.N. Butterfield, S. M. Coughlin, T. Zeng, M. Griffith, O. L. Griffith, A. S. Petrescu, D. E. Smailus, J. Khattra, H. L. McDonald, et al. Systematic Recovery and Analysis of Full-ORF Human cDNA Clones Genome Res., October 1, 2004; 14(10b): 2083 - 2092. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Weckx, P. De Rijk, C. Van Broeckhoven, and J. Del-Favero SNPbox: web-based high-throughput primer design from gene to genome Nucleic Acids Res., July 1, 2004; 32(suppl_2): W170 - W172. [Abstract] [Full Text] [PDF] |
||||
|
|











