Nucleic Acids Research, 2005, Vol. 33, Database issue D54-D58
© 2005, the authors
Nucleic Acids Research, Vol. 33, Database issue © Oxford University Press 2005; all rights reserved
Entrez Gene: gene-centered information at NCBI
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Room 5AS.13B, 45 Center Drive, Bethesda, MD 20892-6510, USA
* To whom correspondence should be addressed. Tel: +1 301 435 5950; Fax: +1 301 480 2918; Email: maglott{at}ncbi.nlm.nih.gov
Received September 15, 2004; Accepted September 22, 2004
| ABSTRACT |
|---|
|
|
|---|
Entrez Gene (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) is NCBI's database for gene-specific information. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's Reference Sequence project (RefSeq), from collaborating model organism databases, and from many other databases available from NCBI. Records are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is updated as new information becomes available. Entrez Gene is a step forward from NCBI's LocusLink, with both a major increase in taxonomic scope and improved access through the many tools associated with NCBI Entrez.
| INTRODUCTION |
|---|
|
|
|---|
Entrez Gene is the gene-specific database at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. Entrez Gene provides unique integer identifiers for genes and other loci for a subset of model organisms. It tracks those identifiers, and is integrated with the Entrez system for interactive query, LinkOuts, and access by E-utilities (1). The information that is maintained includes nomenclature, chromosomal localization, gene products and their attributes (e.g. protein interactions), associated markers, phenotypes, interactions, and a wealth of links to citations, sequences, variation details, maps, expression reports, homologs, protein domain content and external databases.
Data in Entrez Gene result from a mixture of curation and automated analyses. Annotation in sequences from NCBI's Reference sequence project (2) or the International Nucleotide Sequence Database Collaboration (DDBJ, EMBL, GenBank) (3) is integrated with information from collaborating model organism databases, literature review (especially the Gene References into Function or GeneRIFs) (1), and public users, with curation by RefSeq staff as required.
Entrez Gene is an integral part of representation of gene-specific information at NCBI. The information conveyed by establishing a gene-to-sequence relationship is used by other NCBI resources (1) such as BLAST, Geo, HomoloGene, Map Viewer, UniGene, UniSTS and NCBI's genome annotation pipeline. For example, the names associated with GeneIDs are used in HomoloGene, Map Viewer, UniGene and the Mammalian Gene Collection (4). Inconsistencies in representation of genes and their sequences are investigated, and resolved by NCBI staff in collaboration with external nomenclature authorities.
The content, display and bulk reporting from Entrez Gene continue to be developed. Users may be interested in subscribing to gene-announce{at}ncbi.nlm.nih.gov to receive information about modifications.
| FUNCTION OF THE DATABASE |
|---|
|
|
|---|
The primary goals of Entrez Gene are to provide tracked, unique identifiers for genes and to report information associated with those identifiers for unrestricted public use. The identifier that is assigned (GeneID) is an integer, and is species specific. In other words, the integer assigned to dystrophin in human is different from that in any other species. For genomes that had been represented in LocusLink, the GeneID is the same as the LocusID. The GeneID is reported in RefSeq records as a db_xref (e.g. /db_xref="GeneID:856646", in GenBank format).
Entrez Gene provides multiple reports. For the interactive user, the defaults are the HTML summary display resulting from an Entrez query (Figure 1) or a gene-specific report achieved by clicking on the symbol in the summary page (Figure 2). The Gene Table display option is useful to obtain a report of the intron/exon organization of the gene as annotated on a RefSeq genomic sequence, and to navigate quickly to the sequence of any of those gene features. In addition to the multiple views from Entrez, Gene provides a complete database extraction in ASN.1 format as well as several tab-delimited reports for ftp transfer (ftp://ftp.ncbi.nlm.nih.gov/gene/). The data are also available from the programmatic interface to Entrez, namely e-utilities (1).
|
|
| SCOPE OF THE DATABASE |
|---|
|
|
|---|
When are GeneIDs assigned?
Identifiers are always assigned to what is annotated as a Gene on a RefSeq record. Records may also be created when an authoritative source for a genome assigns an identifier to a gene, mapped locus or trait, even though that entity is not yet defined by explicit sequence. Although this means that Entrez Gene is not restricted to what might be considered a gene biologically, the expectation is that some of these records will become more gene-like as the molecular basis of traits or other loci is defined. Each Gene record is assigned a type from an enumerated list in the ASN.1 specification for Entrez Gene. (See the Gene Help document for more information.) This type, which is indexed in Entrez as a named property (e.g. genetype protein coding), can be changed without changing the GeneID.
Some current statistics
As of September 2004, there were more than 2400 taxa represented in Entrez Gene, with a total of approximately 958 000 current records. Not all the taxa are completely represented in Entrez Gene; most of the eukaryotes (
600 total), for example, have Gene records only for their mitochondrial genomes. More than half of the taxa represented are viruses (
1350). Next of those having genomes with comprehensive gene annotation are eubacteria and Archaea (
200 and 20, respectively). About 95 per cent of all records are for protein-coding genes.
Record content
Table 1 summarizes the gene-specific information that can be retrieved through Entrez Gene, how the data are shown, and some aspects of how those data are processed. For example, GeneRIFs, contributed primarily by the public and the Index Section of the National Library of Medicine, provide an annotated bibliography of the function, discovery and mapping of genes from the current literature and are seen in the default report. Information about Clusters of Orthologous Groups (COGs) (5) is available via Links menus. This combination of text and connections is designed to provide sufficient descriptions, keywords and links to make Entrez Gene an effective starting place to retrieve information of interest.
|
| ACCESS TO ENTREZ GENE |
|---|
|
|
|---|
The information in Entrez Gene can be accessed in multiple ways at NCBI (Table 2). The most direct is to submit a query to Entrez from the NCBI home page and display the results in Gene, or enter a query in any Entrez query bar and restrict the database search to Gene. Another way is to take advantage of the Links computed by the Entrez system. For example, you might find a PubMed record of interest, and from PubMed's Links menu discover that there is a record in Gene connected to the publication.
|
Many databases within NCBI take advantage of the GeneID<->sequence relationship maintained by Gene to make connections from a sequence of interest to the Entrez Gene record. For example, BLAST queries matching protein or mRNA accessions associated with an Entrez Gene record are identified by the blue G icon. Map Viewer provides links from annotated genes to Entrez Gene. And RefSeq records include the GeneID as a db_xref in the gene feature. Thus you can obtain gene-specific information not only by text queries but also by genomic position (Map Viewer), RefSeq annotation and related sequences (BLAST, Entrez Nucleotide, Entrez Protein).
| LINKS TO EXTERNAL DATABASES FROM ENTREZ GENE |
|---|
|
|
|---|
Entrez Gene can serve as a directory to gene-specific information for databases outside of NCBI. External databases can register with the LinkOut service (1) and submit information about how their database should be connected to any Gene record. Any user of Entrez Gene retrieving a record with LinkOuts will then be able to connect to the registered database according to the specification of the data provider.
| FEEDBACK |
|---|
|
|
|---|
We welcome your feedback with respect to the Entrez Gene interface, or any data contained therein. You may use any of the Feedback options on a Gene page (Figure 1).
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions{at}oupjournals.org.
| REFERENCES |
|---|
|
|
|---|
- Wheeler,D.L., Benson,D.A., Bryant,S., Canese,K., Church,D.M., Edgar,R., Federhen,S., Helmberg,W., Kenton,D., Khovayko,O. et al. ( (2005) ) Database resources of the National Center for Biotechnology Information: Update. Nucleic Acid Res, , 33, , D39D45.
[Abstract/Free Full Text] . - Pruitt,K.D., Tatusova,T. and Maglott,D. ( (2005) ) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts, and proteins. Nucleic Acids Res, , 33, , D501D504.
[Abstract/Free Full Text] . - Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Wheeler,D.L. ( (2005) ) GenBank. Nucleic Acids Res, , 33, , D34D38.
[Abstract/Free Full Text] . - Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G., Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D., et al. and Mammalian Gene Collection Program Team. ( (2002) ) Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc. Natl Acad. Sci. USA, , 99, , 1689916903.
[Abstract/Free Full Text] . - Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R., Kiryutin,B., Koonin,E.V., Krylov,D.M., Mazumder,R., Mekhedov,S.L., Nikolskaya,A.N. et al. ( (2003) ) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, , 4, , 41.[CrossRef][Medline]
.
This article has been cited by other articles:
![]() |
Z. Hu, E. S. Snitkin, and C. DeLisi VisANT: an integrative framework for networks in systems biology Brief Bioinform, July 1, 2008; 9(4): 317 - 325. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Dittrich, I. Birschmann, S. Mietner, A. Sickmann, U. Walter, and T. Dandekar Platelet Protein Interactions: Map, Signaling Components, and Phosphorylation Groundstate Arterioscler. Thromb. Vasc. Biol., July 1, 2008; 28(7): 1326 - 1331. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Z. de Pina, H. Vazquez-Meza, J. P. Pardo, J. L. Rendon, R. Villalobos-Molina, H. Riveros-Rosas, and E. Pina Signaling the Signal, Cyclic AMP-dependent Protein Kinase Inhibition by Insulin-formed H2O2 and Reactivation by Thioredoxin J. Biol. Chem., May 2, 2008; 283(18): 12373 - 12386. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. C. Ziesel, M. A. Chrenek, and P. W. Wong MultiPriDe: automated batch development of quantitative real-time PCR primers Nucleic Acids Res., May 1, 2008; 36(9): 3095 - 3100. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Sinner, A. Pfeufer, M. Akyol, B.-M. Beckmann, M. Hinterseer, A. Wacker, S. Perz, W. Sauter, T. Illig, M. Nabauer, et al. The non-synonymous coding IKr-channel variant KCNH2-K897T is associated with atrial fibrillation: results from a systematic candidate gene-based analysis of KCNH2 (HERG) Eur. Heart J., April 1, 2008; 29(7): 907 - 914. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith Proteogenomics: needs and roles to be filled by proteomics in genome annotation Brief Funct Genomic Proteomic, March 10, 2008; (2008) eln010v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. O. Barrera, Z. Li, A. D. Smith, K. C. Arden, W. K. Cavenee, M. Q. Zhang, R. D. Green, and B. Ren Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs Genome Res., January 1, 2008; 18(1): 46 - 59. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. F. Murphy, G. J. Hooiveld, M. Muller, R. A. Calogero, and K. D. Cashman Conjugated Linoleic Acid Alters Global Gene Expression in Human Intestinal-Like Caco-2 Cells in an Isomer-Specific Manner J. Nutr., November 1, 2007; 137(11): 2359 - 2365. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Zintzaras, G. Kitsios, D. Kent, N. J. Camp, L. Atwood, P. N. Hopkins, and S. C. Hunt Genome-Wide Scans Meta-Analysis for Pulse Pressure Hypertension, September 1, 2007; 50(3): 557 - 564. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W. Tullai, M. E. Schaffer, S. Mullenbrock, G. Sholder, S. Kasif, and G. M. Cooper Immediate-Early and Delayed Primary Response Genes Are Distinct in Function and Genomic Architecture J. Biol. Chem., August 17, 2007; 282(33): 23981 - 23995. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Meehl, S. Herbert, F. Gotz, and A. Cheung Interaction of the GraRS Two-Component System with the VraFG ABC Transporter To Support Vancomycin-Intermediate Resistance in Staphylococcus aureus Antimicrob. Agents Chemother., August 1, 2007; 51(8): 2679 - 2689. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Osada Inference of Expression-Dependent Negative Selection Based on Polymorphism and Divergence in the Human Genome Mol. Biol. Evol., August 1, 2007; 24(8): 1622 - 1626. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Lee, T. Hong, S. J. Byun, T. Woo, and Y. J. Choi ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences Nucleic Acids Res., July 13, 2007; 35(suppl_2): W159 - W162. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Huang, B. T. Sherman, Q. Tan, J. Kir, D. Liu, D. Bryant, Y. Guo, R. Stephens, M. W. Baseler, H. C. Lane, et al. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists Nucleic Acids Res., July 13, 2007; 35(suppl_2): W169 - W175. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. A. Baumgartner Jr, K. B. Cohen, L. M. Fox, G. Acquaah-Mensah, and L. Hunter Manual curation is not sufficient for annotation of genomic databases Bioinformatics, July 1, 2007; 23(13): i41 - i48. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sasidharan and C. Chothia The selection of acceptable protein mutations PNAS, June 12, 2007; 104(24): 10080 - 10085. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Johnson, O. Fletcher, C. Palles, M. Rudd, E. Webb, G. Sellick, I. dos Santos Silva, V. McCormack, L. Gibson, A. Fraser, et al. Counting potentially functional variants in BRCA1, BRCA2 and ATM predicts breast cancer susceptibility Hum. Mol. Genet., May 1, 2007; 16(9): 1051 - 1057. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. J. Gaulton, K. L. Mohlke, and T. J. Vision A computational system to select candidate genes for complex human traits Bioinformatics, May 1, 2007; 23(9): 1132 - 1140. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. H. Otu, K. Naxerova, K. Ho, H. Can, N. Nesbitt, T. A. Libermann, and S. J. Karp Restoration of Liver Mass after Injury Requires Proliferative and Not Embryonic Transcriptional Patterns J. Biol. Chem., April 13, 2007; 282(15): 11197 - 11204. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Schlicker, C. Huthmacher, F. Ramirez, T. Lengauer, and M. Albrecht Functional evaluation of domain domain interactions and human protein interaction networks Bioinformatics, April 1, 2007; 23(7): 859 - 865. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. T. Shankavaram, W. C. Reinhold, S. Nishizuka, S. Major, D. Morita, K. K. Chary, M. A. Reimers, U. Scherf, A. Kahn, D. Dolginow, et al. Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study Mol. Cancer Ther., March 1, 2007; 6(3): 820 - 832. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. C. Duarte, S. A. Becker, N. Jamshidi, I. Thiele, M. L. Mo, T. D. Vo, R. Srivas, and B. O. Palsson Global reconstruction of the human metabolic network based on genomic and bibliomic data PNAS, February 6, 2007; 104(6): 1777 - 1782. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Lee, T. Kim, S.-K. Kim, K. H. Lee, and D. Lee Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications Nucleic Acids Res., January 12, 2007; 35(suppl_1): D47 - D50. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kim, A. V. Alekseyenko, M. Roy, and C. Lee The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species Nucleic Acids Res., January 12, 2007; 35(suppl_1): D93 - D98. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Groth, N. Pavlova, I. Kalev, S. Tonov, G. Georgiev, H.-D. Pohlenz, and B. Weiss PhenomicDB: a new cross-species genotype/phenotype resource Nucleic Acids Res., January 12, 2007; 35(suppl_1): D696 - D699. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Mazzarelli, J. Brestelli, R. K. Gorski, J. Liu, E. Manduchi, D. F. Pinney, J. Schug, P. White, K. H. Kaestner, and C. J. Stoeckert Jr EPConDB: a web resource for gene expression related to pancreatic development, beta-cell function and diabetes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D751 - D755. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Smith, J. H. Finger, T. F. Hayamizu, I. J. McCright, J. T. Eppig, J. A. Kadin, J. E. Richardson, and M. Ringwald The mouse Gene Expression Database (GXD): 2007 update Nucleic Acids Res., January 12, 2007; 35(suppl_1): D618 - D623. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Ren, K. Chen, and I. T. Paulsen TransportDB: a comprehensive database resource for cytoplasmic membrane transport systems and outer membrane channels Nucleic Acids Res., January 12, 2007; 35(suppl_1): D274 - D279. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. N. Twigger, M. Shimoyama, S. Bromberg, A. E. Kwitek, H. J. Jacob, and the RGD Team The Rat Genome Database, update 2007--Easing the path from disease to data and back again Nucleic Acids Res., January 12, 2007; 35(suppl_1): D658 - D662. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Eyre, M. W. Wright, M. J. Lush, and E. A. Bruford HCOP: a searchable database of human orthology predictions Brief Bioinform, January 1, 2007; 8(1): 2 - 5. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. L. Andersen, C. Wiuf, M. Kruhoffer, M. Korsgaard, S. Laurberg, and T. F. Orntoft Frequent occurrence of uniparental disomy in colorectal cancer Carcinogenesis, January 1, 2007; 28(1): 38 - 48. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Draghici, S. Sellamuthu, and P. Khatri Babel's tower revisited: a universal resource for cross-referencing across annotation databases Bioinformatics, December 1, 2006; 22(23): 2934 - 2939. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. L. Webb, M. F. Rudd, G. S. Sellick, R. El Galta, L. Bethke, W. Wood, O. Fletcher, S. Penegar, L. Withey, M. Qureshi, et al. Search for low penetrance alleles for colorectal cancer through a scan of 1467 non-synonymous SNPs in 2575 cases and 2707 controls with validation by kin-cohort analysis of 14 704 first-degree relatives Hum. Mol. Genet., November 1, 2006; 15(21): 3263 - 3271. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. R. Grimes, T. Q. Wen, M. Mewissen, R. M. Baxter, S. Moodie, J. S. Beattie, and P. Ghazal PDQ Wizard: automated prioritization and characterization of gene and protein lists using biomedical literature Bioinformatics, August 15, 2006; 22(16): 2055 - 2057. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Pang, A. Lin, M. Holford, B. E. Enerson, B. Lu, M. P. Lawton, E. Floyd, and H. Zhao Pathway analysis using random forests classification and regression Bioinformatics, August 15, 2006; 22(16): 2028 - 2036. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Rodriguez, M. A. Glozak, Y. Ma, and W. D. Cress Bok, Bcl-2-related Ovarian Killer, Is Cell Cycle-regulated and Sensitizes to Stress-induced Apoptosis J. Biol. Chem., August 11, 2006; 281(32): 22729 - 22735. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. J. Mattingly, M. C. Rosenstein, A. P. Davis, G. T. Colby, J. N. Forrest Jr, and J. L. Boyer The Comparative Toxicogenomics Database: A Cross-Species Resource for Building Chemical-Gene Interaction Networks Toxicol. Sci., August 1, 2006; 92(2): 587 - 595. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Rudd, G. S. Sellick, E. L. Webb, D. Catovsky, and R. S. Houlston Variants in the ATM-BRCA2-CHEK2 axis predispose to chronic lymphocytic leukemia Blood, July 15, 2006; 108(2): 638 - 644. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Donaldson and B. Gottgens TFBScluster web server for the identification of mammalian composite regulatory elements. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W524 - W528. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Shepelev and A. Fedorov Advances in the Exon-Intron Database (EID) Brief Bioinform, June 1, 2006; 7(2): 178 - 185. [Abstract] [Full Text] [PDF] |
||||
![]() |



















