| Nucleic Acids Research | Pages |
IMGT, the international ImMunoGeneTics database
Introduction
IMGT Rules
Standardization of keywords
Standardization of sequence annotation
Standardization of Ig and TcR gene designation
IMGT/LIGM-DB Organization And Content
Data Collection And Annotation
Collection of data
Automation of the annotation procedure
Innovations In Data Integrity And IMGT Quality
The IMGT unique numbering
Internal cross-references and data integrity
IMGT data distribution
Conclusion
Access And Contact
Acknowledgements
References
IMGT, the international ImMunoGeneTics database
ABSTRACT
INTRODUCTION
The immune system has evolved to protect individuals against pathogenic viruses, micro-organisms and parasites. It is vital therefore that individuals have a normal immune system. Normal immune responses depend on the ability to recognize foreign molecules or antigens on the potential pathogens in order to eliminate the source of the antigen. In the vertebrate species, the molecules involved in the recognition of antigens are encoded by the immunoglobulin superfamily, and this includes immunoglobulins (Ig), T cell receptors (TcR) and Major Histocompatibility Complex (MHC). In humans, the latter is referred to as the Human Leukocyte Antigen (HLA) system. The molecular synthesis and genetics of the Ig and TcR chains (1,2) is particularly complex and unique since it includes biological mechanisms such as DNA molecular rearrangements in seven loci (three for Ig and four for TcR) located on four different chromosomes in human, nucleotide deletions and insertions at the rearrangement junctions, and hypermutations in the Ig loci. The number of potential protein forms of Ig and TcR is almost unlimited. Owing to the complexity and high number of published sequences, data control and detailed annotations are a very difficult task for the generalist databanks: EMBL (3), GenBank (4) and DDBJ (5). The international ImMunoGeneTics (IMGT) database (6) was created in 1992 by Marie-Paule Lefranc (CNRS, Montpellier II University, Montpellier, France, lefranc@ligm.crbm.cnrs-mop.frc) (see Fig. 1 for the IMGT home page, http://imgt.cnusc.fr:8104 ). IMGT comprises alignment tables and expertly annotated sequences and consists of three databases: LIGM-DB for Ig and TcR, MHC/HLA-DB and PRIMER-DB (an Ig, TcR and MHC-related primer database), these last two are currently in development.
Figure

IMGT RULES
Standardization of keywords
IMGT keywords for Ig and TcR comprise the following. (i) General keywords. Indispensable for the sequence assignments, they are described in an exhaustive and non redundant list, and are organized in a tree structure. (ii) Specific keywords. They are more specifically associated to particularities of the sequences (orphon, transgene, etc.) or to diseases (leukemia, lymphoma, myeloma, etc.). The list is not definitive and new specific keywords can easily be added if needed.
Standardization of sequence annotation
Ig and TcR sequences have been analysed at the DNA and protein level in order to define a list of labels for the structural and functional motifs (6). 177 feature labels were shown to be necessary for an accurate annotation. The annotation is the most critical step and a very time consuming process since ~50 sequences a week can be annotated by an experienced annotator. Levels of annotation have been defined, which allow the users to query sequences in IMGT/LIGM-DB even though they are not fully annotated.
Standardization of Ig and TcR gene designation
The objective is to provide immunologists and geneticists with a unique nomenclature per locus which will allow extraction and comparison of data for the complex B and T cell antigen receptor molecules, whatever the species. Data concerning the human Ig and TcR genes have been standardized, and maps of loci, tables of germline genes in the IMGT nomenclature, correspondence to other gene designations, and gene functionality are available at the IMGT Marie-Paule page from http://imgt.cnusc.fr:8104 (Fig. 2).
Figure

IMGT/LIGM-DB ORGANIZATION AND CONTENT
LIGM-DB development is mainly based on a relational model organization. The database is maintained with SYBASE as relational DBSM (Data Base System Manager) on Unix IBM RISC6000 at CNUSC (Centre National Universitaire Sud de Calcul) in Montpellier (France). CNUSC is in charge of the computing exploitation. New releases of the relational schema and updates of the database structure, that closely follow the results of biological research, are under LIGM and CNUSC responsibility.
In October 1997, LIGM-DB contained 23 741 nucleic acid sequences of Ig or TcR from 78 species. IMGT sequences are identified by the EMBL accession number. IMGT data comprise core data that consist of sequence data, bibliographical references, taxonomic data retrieved from EMBL entries, completed with annotations, specific analysis and expertise provided by LIGM. IMGT/LIGM-DB standardized keywords have been assigned to all entries, and more than 9300 sequences are now fully annotated. Resulting sequences can be obtained in different formats (FASTA, EMBL, etc.), with three reading frames, and with protein translation. This will be upgraded to set up a protein database for Ig and TcR. Since August 1996, the IMGT/LIGM-DB content closely follows the Ig and TcR EMBL one, with the advantage of being deleted from sequences which have previously been wrongly assigned to Ig and TcR.
DATA COLLECTION AND ANNOTATION
Collection of data
The unique source of data is the generalist database EMBL. Once the sequences are allowed by the authors to be made public, EMBL automatically sends Ig and TcR sequences to LIGM by Email. After control by LIGM curators, sequences are scanned in order to store IMGT non-specific information, such as bibliographical references and taxonomic data. Then, specific LIGM-DB annotations are added onto the sequence information. This step, a time consuming and elaborate one, is the limiting factor for the addition of expert data. Moreover, the publication of novel Ig, TcR and MHC sequences continues at an ever increasing pace and the development of automated sequencing techniques suggests that this trend will continue for the foreseeable future. To overcome that major problem, we are developing, in collaboration with EMBL-EBI, a WWW interface for direct submission of the IMGT/LIGM-DB keywords and labels by the sequence submitters. This model of sequence submission to primary databases and control, currently developed by IMGT, can be later extended to other specialized and integrated databases.
Automation of the annotation procedure
In order to speed up the annotation procedures, semi-automatic annotation software has been developed (LIGMotif, validation module, DNAPLOT), in collaboration with LIGM, CNUSC and IFG. These programs are used as an aid by the IMGT annotators at LIGM. These tools still need verification by experts but the aim is to reduce the time required for annotations to a minimum.
LIGMotif: search for conserved motifs in Ig and TcR. As an aid for the annotation procedure, the LIGMotif written in portable C language has been developed by G. Mennessier (LPM, Montpellier). The algorithms used are specific for the search of Ig and TcR patterns in DNA sequences. These algorithms are based on the delimitation rules established after extensive research by LIGM. Motifs of interest for the Leader, Variable, Diversity and Joining regions can be searched for in the germline or rearranged DNA and cDNA. The output is an ASCII file containing features associated to sequences, plus a number of online information which is useful for the annotator to decide which solution(s) might be correct. New rules are defined as the scientific knowledge of the immunogenetic sequences becomes available.
DNAPLOT: software development for nucleotide and protein sequence alignments. DNAPLOT is an alignment tool, part of IMGT, which uses sets of adequate sequences to build, display, maintain and search nucleotide sequence alignment tables. This programme, developed by H. H. Althaus and W. Müller (IFG, Köln), can be downloaded from http://www.genetik.unikoeln.de/dnaplot.html . A version adapted to Ig and TcR sequences, IMGT/DNAPLOT, is available from the IMGT home page (Fig. 4). The aim is to provide an easy-to-use and fast tool for research. These alignments are very important for verification of new sequences, their annotation and also in the design of experimental procedures in sequence analysis. IMGT/DNAPLOT currently performs two tasks: (i) the analysis of functional germline variable sequences to identify the gene and/or the subgroup and (ii) the analysis of rearranged variable sequence to identify the V-GENE, the D-SEGMENT (for IgH, and TcR [beta] and [delta] chains) and J-SEGMENT involved in the rearrangements. Searches can be done related to Ig, and soon to TcR and MHC gene alignments, using IMGT reference sequence data.
INNOVATIONS IN DATA INTEGRITY AND IMGT QUALITY
The IMGT unique numbering
A uniform numbering system for Ig and TcR sequences of all species has been established by Marie-Paule Lefranc to facilitate sequence comparison and cross-referencing between experiments from different laboratories whatever the antigene receptor (Ig or TcR), the chain type or the species (7) (Fig. 3). This numbering results from the analysis of >3000 Ig and TcR variable region sequences of vertebrate species from fish to human. It takes into account and combines the definition of the framework (FR) and complementarity determining region (CDR) (8), structural data from X-ray diffraction studies (9), and the characterization of the hypervariable loops (10). In the IMGT numbering, conserved amino acids from frameworks always have the same number whatever the Ig or TcR variable sequence, and whatever the species they come from. As examples: cysteine 23 (in FR1), tryptophan 41 (in FR2), leucine 89 and cysteine 104 (in FR3). Tables and graphs are available on the WWW interface at the IMGT Marie-Paule page from http://imgt.cnusc.fr:8104
Figure
This IMGT unique numbering has several advantages. (i) It has allowed the redefinition of the limits of the FR and CDR. The FR-IMGT and CDR-IMGT lengths become in themselves crucial information which characterize variable regions belonging to a group, a subgroup and/or a gene. (ii) Framework amino acids (and codons) located at the same position in different sequences can be compared without requiring sequence alignments. This also holds for amino acids belonging to CDR-IMGT of same length. (iii) The unique numbering is used as the output of the IMGT/DNAPLOT alignment tool. The aligned sequences are displayed according to the IMGT numbering and with the FR-IMGT and CDR-IMGT delimitations. (iv) The unique numbering has allowed a standardization of the description of the mutations and allelic polymorphisms of the variable regions. These mutations and allelic polymorphisms are described by comparison to the `reference sequences' defined in IMGT (Fig. 4).
Figure
By facilitating the comparison between sequences and by allowing the description of alleles and mutations, the IMGT unique numbering represents a big step forward in the analysis of the Ig and TcR sequences of all species. Moreover, it gives insight into the structural configuration of the variable domain and opens up interesting views on the evolution of these sequences, since this numbering has been applied with success to all the sequences belonging to the V-set of the immunoglobulin superfamily, including non-rearranging sequences in vertebrates (CD4, CTX, etc.) and invertebrates (drosophila amalgam, drosophila fasciclin II, etc.) (7) (graphical representations available from http://imgt.cnusc.fr:8104/textes/MPpage.html ).


Internal cross-references and data integrity
IMGT contains sequences in different biological forms (e.g., alleles, germline, rearranged or cDNA). These sequences of different biological states are cross-referenced to each other.
Control of data coherence or data integrity has been introduced step by step in LIGM-DB according to the semantic evolution of the data model. Data integrity is important for the database management especially when data input is from more than one site (the core data come from EMBL, the annotations from the authors or from LIGM). It is checked, for example, that each entry is associated to IMGT standardized keywords, and that there is coherence between keywords of an annotated entry and the labels of its subregions. This step is essential to maintain the quality of the database since biological knowledge is always improving. When new rules for describing the data appear, the existing rules are updated accordingly. This allows new entries to be correctly annotated. For the backlog, we implement automatic procedures that regularly verify the coherence of the data and select pools of sequences to be updated. This avoids the introduction of data discrepancies.
IMGT data distribution
One of the major objectives of IMGT was to provide the immunologists with a user friendly interface. The interface allows searches according to immunogenetic specific criteria and is easy to use without any knowledge in a computing language. According to this view, the current interface has been developed in WWW client-server architecture (development of interaction WWW-SYBASE) that allows the users to get easily connected from any type of platform (PC, Macintosh, workstation) using freeware such as Netscape. All LIGM-DB information is available through the following search criteria: taxonomy, receptor type, functionality, specificity; sources as genes, clones etc. (currently in development); keywords; LIGM-DB labels; accession number, mnemonic, definition, length etc.; bibliographical references.
Since July 1995, IMGT/LIGM-DB has been available on the WWW server of CNUSC Montpellier at: http://imgt.cnusc.fr:8104
To facilitate the integration of IMGT data into applications developed by other laboratories, we are currently building an Application Programming Interface to access the database and its software tools. This API includes: a set of URL links to access biological knowledge data (keywords, labels, nomenclature, etc.), a set of URL links to access all data related to one given sequence, a set of JAVA[trade] class packages to select and retrieve data from an appropriate IMGT server using an Object Oriented approach.
IMGT/ LIGM-DB is also available from a number of different sites as part of the SRS-WWW servers (EMBL-EBI UK, INFOBIOGEN France, The Norwegian EMBnet Node, BEN Belgian EMBnet Node, CAOS-CAMM Centre The Netherlands, etc.). The data from IMGT are distributed to the end user by other established methods already used at the EMBL-EBI. This includes: distribution of CD-ROM, the EMBL-EBI file server: NETSERV@ebi.ac.uk, the EMBL-EBI WWW server: http://www.ebi.ac.uk/imgt , FTP from the EMBL-EBI anonymous FTP server: ftp.ebi.ac.uk, Internet biogopher servers: gopher.ebi.ac.uk.
From January 1996 to November 1997, the IMGT WWW server at Montpellier has been accessed by >14 500 sites, with an average of 2000-2500 requests a week.
CONCLUSION
IMGT is developed by LIGM (Montpellier, France) in collaboration with CNUSC (Montpellier, France), EMBL-EBI (Hinxton, UK), ICRF (Oxford, UK), IFG (Köln, Germany), BPRC (Rijswijk, The Netherlands) and EUROGENTEC S.A. (Seraing, Belgium). The information provided by IMGT is of much value to clinicians and biological scientists in general. The main objectives for the next three years include the development of a WWW interface for direct submission of the data by the authors, development of MHC/HLA-DB and extension to all species. New specific databases will be developed and integrated into IMGT: a protein database for Ig and TcR which will contain translations of potentially functional and ORF sequences from LIGM-DB, and protein data from Kabat (8) and SWISS-PROT (11), and IMGT/PRIMER-DB, an oligonucleotide primer database for Ig, TcR and MHC. IMGT will include, in the future, analysis of genetics data and displays of physical maps. IMGT is designed to allow common access to all immunogenetics data. Particular attention will be given to the establisment of cross-referencing links to other databases pertinent to the users of IMGT.
ACCESS AND CONTACT
CNUSC WWW server at http://imgt.cnusc.fr:8104 . Contact Denys.Chaume@cnusc.fr
EBI servers at ftp.ebi.ac.uk (folder/pub/databases/imgt). Contact malik@ebi.ac.uk
For comments and suggestions contact giudi@ligm.crbm.cnrs-mop.fr
IMGT Initiator and Coordinator: Marie-Paule Lefranc, Laboratoire d'ImmunoGénétique Moléculaire, LIGM, UMR CNRS 5535, 1919 route de Mende, 34293 Montpellier Cedex 5, France. Tel: +33 4 67 61 36 34; Fax: +33 4 67 04 02 31; Email: lefranc@ligm.crbm.cnrs-mop.fr.
ACKNOWLEDGEMENTS
We are deeply grateful to Gérard Mennessier (LIGMotif), Hans-Helmar Althaus (DNAPLOT), Valérie Barbié (PRIMER-DB), Géraldine Folch, Nathalie Pallarés, Manuel Ruiz and Dominique Scaviner (LIGM-DB), Steven Marsh and Natasja de Groot (MHC/HLA-DB). IMGT is funded by the European Union's BIOTECH programme, the CNRS (Centre National de la Recherche Scientifique), and the MENRT (Ministére de l'Education Nationale, de la Recherche et de la Technologie). Subventions have been received from ARC (Association pour la Recherche sur le Cancer), ARP (Association de Recherche sur la Polyarthrite), FRM (Fondation pour la Recherche Médicale) and the Région Languedoc-Roussillon.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
This article has been cited by other articles:
![]() |
Y.-T. Chen and J. T. Kung CD1d-Independent Developmental Acquisition of Prompt IL-4 Gene Inducibility in Thymus CD161(NK1)-CD44lowCD4+CD8- T Cells Is Associated with Complementarity Determining Region 3-Diverse and Biased V{beta}2/V{beta}7/V{beta}8/V{alpha}3.2 T Cell Receptor Usage J. Immunol., November 15, 2005; 175(10): 6537 - 6550. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-P. Lefranc IMGT, the international ImMunoGeneTics database(R) Nucleic Acids Res., January 1, 2003; 31(1): 307 - 310. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-J. Jacobin, J. Laroche-Traineau, M. Little, A. Keller, K. Peter, M. Welschof, A. Nurden, and G. Clofent-Sanchez Human IgG Monoclonal Anti-{alpha}IIb{beta}3-Binding Fragments Derived from Immunized Donors Using Phage Display J. Immunol., February 15, 2002; 168(4): 2035 - 2045. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ruiz, V. Giudicelli, C. Ginestoux, P. Stoehr, J. Robinson, J. Bodmer, S. G. E. Marsh, R. Bontrop, M. Lemaitre, G. Lefranc, et al. IMGT, the international ImMunoGeneTics database Nucleic Acids Res., January 1, 2000; 28(1): 219 - 221. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Montesinos-Rongen, R. Kuppers, D. Schluter, T. Spieker, D. Van Roost, C. Schaller, G. Reifenberger, O. D. Wiestler, and M. Deckert-Schluter Primary Central Nervous System Lymphomas Are Derived from Germinal-Center B Cells and Show a Preferential Usage of the V4-34 Gene Segment Am. J. Pathol., December 1, 1999; 155(6): 2077 - 2086. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Behar, T. A. Podrebarac, C. J. Roy, C. R. Wang, and M. B. Brenner Diverse TCRs Recognize Murine CD1 J. Immunol., January 1, 1999; 162(1): 161 - 167. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


