ABSTRACT
The PROSITE database consists of biologically significant patterns and profiles
formulated in such a way that with appropriate computational tools it can help
to determine to which known family of proteins (if any) a new sequence belongs
or which known domain(s) it contains.
PROSITE (
1
,
2
) is a method of determining the function of uncharacterized proteins translated
from genomic or cDNA sequences. It consists of a database of biologically
significant patterns and profiles formulated in such a way that with
appropriate computational tools it can rapidly and reliably determine to which
known family of proteins (if any) the new sequence belongs or which known
domain(s) it contains.
In some cases the sequence of an unknown protein is too distantly related to any
protein of known structure to detect its resemblance by overall sequence
alignment, but relationships can be revealed by the occurrence in its sequence
of a particular cluster of residue types, which is variously known as a
pattern, motif, signature or fingerprint. These motifs arise because specific
region(s) of a protein which may be important, for example for their binding
properties or for their enzymatic activity, are conserved in both structure and
sequence. These structural requirements impose very tight constraints on the
evolution of these small but important portions of a protein sequence. The use
of protein sequence patterns or profiles to determine the function of proteins
is becoming very rapidly one of the essential tools of sequence analysis. This
reality has been recognized by many authors (
3
,
4
). Based on these observations, we decided in 1988 to actively pursue the
development of a database of regular expression-like patterns which could be used to search against sequences of unknown
function.
However, while sequence patterns are very useful, there are a number of protein
families, as well as functional or structural domains, that cannot be detected
using patterns, due to their extreme sequence divergence. Typical examples of
important functional domains which are weakly conserved are the globin, the
immunoglobulin, the SH2 and the SH3 domains. In such domains there are only a
few sequence positions which are well conserved. Any attempt to build a
consensus pattern for such regions will either fail to pick up a significant
proportion of the protein sequences that contain such a region (false
negatives) or will pick up too many proteins that do not contain the region
(false positives).
The use of techniques based on profiles or weight matrices (the two terms are
used synonymously here) allows detection of such proteins or domains. A profile
is a table of position-specific amino acid weights and gap costs. These numbers (also referred to
as scores) are used to calculate a similarity score for any alignment between a
profile and a sequence or parts of a profile and a sequence. An alignment with
a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. As with patterns, there may be
several matches to a profile in one sequence, but multiple occurrences in the
same sequences must be disjoint (non-overlapping) according to a specific definition included in the profile.
Another feature that distinguishes patterns from profiles is that the latter
are usually not confined to small regions with high sequence similarity.
Rather, they attempt to characterize a protein family or domain over its entire
length.
We therefore started in 1994 to complement the approach based on patterns by
gradually adding to PROSITE profile entries. The profile structure (
5
,
6
) used in PROSITE is similar to, but slightly more general than, that introduced
by Gribskov and co-workers (
7
); additional parameters allow representation of other motif descriptors,
including the currently popular hidden Markov models. Profiles can be
constructed by a large variety of different techniques. The classical method
developed by Gribskov and co-workers (
8
) requires a multiple sequence alignment as input and uses a symbol comparison
table to convert residue frequency distributions into weights. The profiles
included in PROSITE are generated by this procedure, applying recently
described modifications (
9
,
10
). In the future we intend to apply additional profile construction tools,
including structure-based approaches and methods involving hidden Markov modelling.
The design of PROSITE follows four leading concepts.
For such a
compilation to be helpful in the determination of protein function it is
important that it contains as many biologically meaningful patterns and
profiles as possible.
Table 1
In the majority of cases we have chosen patterns or profiles that are specific
enough that they do not detect too many unrelated sequences, yet they will
detect most, if not all, sequences that clearly belong to the set in
consideration.
Each of the entries in PROSITE is fully documented. The documentation includes a
concise description of the protein family that it is designed to detect, as
well as a summary of the reasons leading to the development of the pattern or
profile.
It is important that each entry be periodically reviewed to ensure that it is
still valid.
The core of the PROSITE database is composed of two ASCII (text) files. The
first file (prosite.dat) is a computer readable file that contains all the
information necessary for programs that make use of PROSITE to scan sequences
for the occurrence of the patterns and/or profiles. This file also includes,
for each of the entries described, statistics on the number of hits obtained
while scanning for that pattern or profile in the SWISS-PROT protein sequence database (
9
). Cross-references to the corresponding SWISS-PROT entries are also present in the file. The second file
(prosite.doc), which we call the textbook, contains textual information that
documents each pattern.
C1q domain signature
Death domain profile
Forkhead-associated (FHA) domain profile
Src homology 2 (SH2) domains profile
Src homology 3 (SH3) domains profile
S-layer homology domain signature
TPR repeat profile
WW domain signature and profile
Prokaryotic dksA/traR C4-type zinc finger
PHD finger profile
Copper-fist domain
Bacterial regulatory proteins, iclR family signature
Bacterial regulatory proteins, marR family signature
Sigma-70 factors ECF subfamily signature
Ribosomal protein L10 signature
Ribosomal protein L24 signature
Ribosomal protein L31 signature
Ribosomal protein L7Ae signature
Ribosomal protein L13e signature
Ribosomal protein L18e signature
Ribosomal protein L24e signature
Ribosomal protein L27e signature
Ribosomal protein L31e signature
Ribosomal protein L34e signatures
Ribosomal protein L35Ae signature
Ribosomal protein L37e signature
Ribosomal protein S6 signature
Homoserine dehydrogenase signature
Aspartate-semialdehyde dehydrogenase signature
Pyridoxamine 5'-phosphate oxidase signature
Respiratory chain NADH dehydrogenase 20 kDa subunit signature
Respiratory chain NADH dehydrogenase 24 kDa subunit signature
NNMT/PNMT/TEMT family of methyltransferases signature
Ribosomal RNA adenine dimethylases signature
Squalene and phytoene synthases signatures
ROK family signature
Casein kinase II regulatory subunit signature
Shikimate kinase signature
Prokaryotic diacylglycerol kinase signature
Acetate and butyrate kinases family signatures
RNA polymerases H 23 kDa subunits signature
RNA polymerases N 8 kDa subunits signature
RNA polymerases L 13-16 kDa subunits signature
RNA polymerases RPB6 6 kDa subunits signature
Lipolytic enzymes `G-D-S-L' family, serine active site
DNA/RNA non-specific endonucleases active site
Thermonuclease family signature
Chitinases family 18 signature
Glycosyl hydrolases family 45 active site
ATP-dependent serine proteases, lon family, serine active site
Interleukin-1[beta] converting enzyme family active sites
Hydroxymethylglutaryl-coenzyme A lyase active site
DNA photolyases class 2 signatures
Adenylate cyclases class-I signatures
Ribulose-phosphate 3-epimerase family signatures
PpiC-type peptidyl-prolyl
cis
-
trans
isomerase signature
Terpene synthases signature
SAICAR synthetase signatures
NAD-dependent DNA ligase signatures
Transposases, IS30 family, signature
Molybdenum cofactor biosynthesis proteins signatures
Radical activating enzymes signature
Electron transfer flavoprotein beta-subunit signature
Heavy metal-associated domain
Sulfate transporters signature
Xanthine/uracil permeases family signature
OmpA-like domain
GPR1/FUN34/yaaH family signature
FtsZ protein signatures
Bacterial microcompartiments proteins signature
Flagella transport protein fliP family signatures
Scorpion short toxins signature
grpE protein signature
Bacterial type II secretion system protein C signature
Bacterial type II secretion system protein N signature
Protein secE/sec61-gamma signature
Fimbrial biogenesis outer membrane usher protein signature
Apoptosis regulator proteins, Bcl-2 family signature
GTP-binding nuclear protein ran signature
Elongation factor Ts signatures
Translation initiation factor SUI1 signature
Calponin family repeat
CAP protein signatures
Hydrogenases expression/synthesis hupF/hypC family signature
NOL1/NOP2/fmu family signature
Hypothetical SUA5/yciO/yrdC family signature
Hypothetical YBL055c/yjjV family signatures
Hypothetical YBR002c family signature
Hypothetical YBR177c/yheT family signature
Hypothetical YER057c/yjgF family signature
Hypothetical YKL151c/yjeF family signatures
Hypothetical hesB/yadR/yfhF family signature
Hypothetical yabO/yceC/yfiI family signature
Hypothetical yciL/yejD/yjbC family signature
Hypothetical yedF/yeeD/yhhP family signature
Hypothetical yhdG/yjbN/yohI family signature
REFERENCES
Return
