ABSTRACT
The PROSITE database consists of biologically significant patterns and profiles
formulated in such a way that with appropriate computational tools it can help
to determine to which known family of protein (if any) a new sequence belongs,
or which known domain(s) it contains.
PROSITE (
1
,
2
) is a method of determining what is the function of uncharacterized proteins
translated from genomic or cDNA sequences. It consists of a database of
biologically significant patterns and profiles formulated in such a way that
with appropriate computational tools it can rapidly and reliably determine to
which known family of protein (if any) the new sequence belongs, or which known
domain(s) it contains.
In some cases the sequence of an unknown protein is too distantly related to any
protein of known structure to detect its resemblance by overall sequence
alignment, but relationships can be revealed by the occurrence in its sequence
of a particular cluster of residue types which is variously known as a pattern,
motif, signature or fingerprint. These motifs arise because specific region(s)
of a protein which may be important, for example, for their binding properties
or for their enzymatic activity are conserved in both structure and sequence.
These structural requirements impose very tight constraints on the evolution of
these small but important portion(s) of a protein sequence. The use of protein
sequence patterns or profiles to determine the function of proteins is becoming
very rapidly one of the essential tools of sequence analysis. This reality has
been recognized by many authors (
3
,
4
). Based on these observations, we decided in 1988, to actively pursue the
development of a database of regular expression-like patterns which would be used to search against sequences of unknown
function.
But, while sequence patterns are very useful, there are a number of protein
families as well as functional or structural domains that cannot be detected
using patterns due to their extreme sequence divergence. Typical examples of
important functional domains which are weakly conserved are the globins, the
immunoglobulin, the SH2 and SH3 domain. In such domains there are only a few
sequence positions which are well conserved. Any attempt to build a consensus
pattern for such regions will either fail to pick up a significant proportion
of the protein sequences that contain such a region (false negatives) or will
pick up too many proteins that do not contain the region (false positives).
The use of techniques based on profiles or weight matrices (the two terms are
used synonymously here) allows the detection of such proteins or domains. A
profile is a table of position-specific amino acid weights and gap costs. These numbers (also referred to
as scores) are used to calculate a similarity score for any alignment between a
profile and a sequence, or parts of a profile and a sequence. An alignment with
a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. As with patterns, there may be
several matches to a profile in one sequence, but multiple occurrences in the
same sequences must be disjoint (non-overlapping) according to a specific definition included in the profile.
Another feature that distinguish patterns from profiles is that the latter are
usually not confined to small regions with high sequence similarity. Rather
they attempt to characterize a protein family or domain over its entire length.
We therefore started in 1994 to complement the approach based on patterns by
gradually adding to PROSITE profile entries. The profile structure (
5
,
6
) used in PROSITE is similar to but slightly more general than the one
introduced by Gribskov and co-workers (
7
); additional parameters allow representation of other motif descriptors,
including the currently popular hidden Markov models (
8
). Profiles can be constructed by a large variety of different techniques. The
classical method developed by Gribskov and co-workers (
9
) requires a multiple sequence alignment as input and uses a symbol comparison
table to convert residue frequency distributions into weights. Most profiles
included in PROSITE are generated by this procedure applying recently described
modifications (
10
,
11
). In some cases we also applied alternative profile construction methods
including structure-based approaches and methods involving hidden Markov modelling.
The design of PROSITE follows five leading concepts:
Completeness
. For such a compilation to be helpful in the determination of protein function,
it is important that it contains as many biologically meaningful patterns and
profiles as possible.
High specificity
. In the majority of cases we have chosen patterns or profiles that are specific
enough that they do not detect too many unrelated sequences, yet they will
detect most, if not all, sequences that clearly belong to the set in
consideration.
The core of the PROSITE database is composed of two ASCII (text) files. The
first file (PROSITE.DAT) is a computer-readable file that contains all the information necessary for programs
that make use of PROSITE to scan sequence(s) for the occurrence of the patterns
and/or profiles. This file also includes, for each entry described, statistics
on the number of hits obtained while scanning for that pattern or profile in
SWISS-PROT. Cross-references to the corresponding SWISS-PROT entries are also present in the file. The second file
(PROSITE.DOC), which we call the textbook, contains textual information that
documents each pattern.
A sample textbook entry is shown (Fig.
1
a); this particular entry is linked to two entries in the PROSITE.DAT file: a
pattern and a profile (Fig.
1
b).
Several document files are also distributed with the database:
PROSUSER.TXT
The database user's manual
PROFILE .TXT
A detailed description of the syntax for the profiles
PROSITE .LIS
A list of PROSITE documentation entries
PROSITE .GET
A document on how to obtain a local copy of PROSITE
PROSITE .PRG
A description of programs and electronic mail servers that make use of PROSITE
PAUTINDX.TXT
An index of authors cited in the PROSITE.DOC file
Release 13.2 of PROSITE (October 1996) contains 936 documentation entries
describing 1225 different patterns, rules and profiles. The list of the entries
which have been added since the publication of the previous article (
2
) describing PROSITE is provided in Table
1
. The database requires ~5 Mb of disk storage space. The present distribution frequency is two
releases per year. No restrictions are placed on use or redistribution of the
data.
Table 1
PROSITE is distributed on CD-ROM by the EMBL Outstation - the European Bioinformatics Institute (EBI) (
13
). For all enquiries regarding the subscription and distribution of PROSITE one
should contact: The EMBL Outstation - The European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Telephone: (+44
1223) 494 400; Telefax : (+44 1223) 494 468; Email: datalib@ebi.ac.uk">datalib{at}ebi.ac.uk
If you have access to a computer system linked to the Internet you can obtain
PROSITE using FTP (File Transfer Protocol), from the following file servers:
EBI anonymous FTP server
Internet address: ftp.ebi.ac.uk (or 192.54.41.33)
NCBI Repository (National Library of Medicine, NIH, Washington, DC, USA)
Internet address: ncbi.nlm.nih.gov (or 130.14.20.1)
ExPASy (Expert Protein Analysis System) server, University of Geneva,
Switzerland
Internet address: expasy.hcuge.ch (or 129.195.254.61)
National Institute of Genetics (Japan) FTP server
Internet address: ftp2.ddbj.nig.ac.jp (or 133.39.3.6)
PROSITE can be obtained from the EBI network fileserver. Detailed instructions
on how to make the best use of this service, and in particular on how to obtain
PROSITE, can be obtained by sending to the network address netserv{at}ebi.ac.uk
the following message:
HELP
HELP PROSITE
Many academic groups and commercial companies have developed computer programs
that make use of the pattern entries in PROSITE. The `PROSITE.PRG' file
contains a full list of these programs, their operating system specificity,
characteristics as well as information on how to obtain them.
To make use of profile entries, we are distributing, under the name `pftools',
the source code (in FORTRAN77) of two programs that should help software
developers to implement profile-specific routines in their application(s):
pfscan
Loads a sequence from a file and scans it with all (or one) of PROSITE profiles.
pfsearch
Loads a profile from a file and scans for it in a SWISS-PROT data base file.
These tools are available by anonymous FTP from the server ulrec3.unil.ch in the
directory /pub/pftools..
There are many email servers that are available to molecular biologists (
14
). At least three of these servers can be used in conjunction with PROSITE:
Name:
EBI Mail-PROSITE Server
Organization:
European Bioinformatics Institute, Hinxton, UK
Description:
Allows to rapidly compare a new protein sequence against all patterns
stored in PROSITE.
Server email address:
prosite@ebi.ac.uk
Address to report problems:
nethelp@ebi.ac.uk
Name:
BLOCKS e-mail searcher
Organization:
Fred Hutchinson Center, Seattle, WA, USA
Description:
Compares a protein or DNA sequence to the database of protein blocks. Blocks are
short multiply aligned ungapped segments corresponding to the most highly
conserved regions of proteins. The BLOCKS database (
15
) has been derived from PROSITE. This server can also be used to retrieve
specifics blocks and PROSITE entries.
Server email address:
blocks@howard.fhcrc.org
Address to report problems:
henikoff@howard.fhcrc.org
Name:
MOTIF E-Mail Server on GenomeNet
Organization:
Supercomputer Laboratory, Kyoto Inst. for Chemical Research, Japan
Description:
Allows to rapidly compare a new protein sequence against all patterns stored in
PROSITE as well as in the MotifDic library (
16
).
Server email address:
motif@genome.ad.jp
Address to report problems:
motif-manager@genome.ad.jp
The most efficient and user-friendly way to browse interactively in PROSITE as well as to analyze a
sequence for the occurrence of a pattern or a profile is to use the World Wide
Web (WWW) molecular biology server ExPASy (
17
). WWW is a global information retrieval system merging the power of world-wide networks, hypertext and multimedia. Through hypertext links, it gives
access to documents and information available on thousands of servers around
the world. To access a WWW server one needs a WWW browser [such as Mosaic(TM),
Netscape Navigator(TM) or Microsoft Internet Explorer(TM)]. Using a WWW
browser, one has access to all the hypertext documents stored on the ExPASy
server (as well as many other WWW servers) and also can make use of many
sequence analysis software tools.
The ExPASy server may be accessed through its Uniform Resource Locator (URL-the addressing system defined in WWW), which is:
http://expasy.hcuge.ch/
You can directly access to the `top' page of the section of ExPASy that allows
you to browse through the PROSITE documentation and data entries by opening the
URL:
http://expasy.hcuge.ch/sprot/prosite.html
To use the PROSITE patterns and profiles, you can make use of the following
software tools:
ScanProsite
, which allows to either scan a protein sequence (from SWISS-PROT or provided by the user) for the occurrence of patterns stored in
PROSITE or to scan the SWISS-PROT database (including weekly releases) for the occurrence of a pattern
that can originate from PROSITE or be provided by the user. The URL for
ScanProsite
is:
http://expasy.hcuge.ch/sprot/scnpsite.html
ProfileScan
, which allows to scan a protein sequence (from SWISS-PROT or provided by the user) for the occurrence of profiles stored in
PROSITE. The URL for ProfileScan is:
http://ulrec3.unil.ch/software/profilescan.html
Anaphylatoxin domain signature and profile
C-terminal cystine knot signature and profile
CUB domain profile
Calcium-binding EGF-like domain signature
LDL-receptor class A (LDLRA) domain signature
Phosphotyrosine interaction domain (PID) profile
VWFC domain signature
NF-kappa-B/Rel/dorsal family signature
Ribosomal protein L1 signature
Ribosomal protein L17 signature
Ribosomal protein L21 signature
Ribosomal protein L6e signature
Ribosomal protein L15e signature
Ribosomal protein L21e signature
Ribosomal protein L36e signature
Ribosomal protein L44e signature
Ribosomal protein S21 signature
Ribosomal protein S3Ae signature
Ribosomal protein S8e signature
Ribosomal protein S12e signature
Ribosomal protein S27e signature
Short-chain dehydrogenases/reductases family signature
N-acetyl-gamma-glutamyl-phosphate reductase active site
Gamma-glutamyl phosphate reductase signature
Copper amine oxidase signatures
RNA polymerases beta chain signature
Lipolytic enzymes `G-D-X-G' family, putative active sites signatures
Peptidyl-tRNA hydrolase signatures
Ribonuclease II family signature
Glycosyl hydrolases family 35 putative active site
Iodothyronine deiodinases active site
Protozoan/cyanobacterial globins signature
Phosphatidylethanolamine-binding protein family signature
Amiloride-sensitive sodium channels signature
Ammonium transporters signature
GNS1/SUR4 family signature
Caveolins signature
ATP P2X receptors signature
Initiation factor 2 signature
PMP-22/EMP/MP20 family signatures
Glypicans signature
Tub family signatures
Mrp family signature
Hypothetical YBL036c/yggS family signature
Hypothetical YBR187w/SLL0615 family signature
Hypothetical YML110c/yigO family signatures
Hypothetical yigU/ycbT/ycf43 family signature
REFERENCES
Return
