Liisa Holm and Chris Sander, Protein Design Group, European Molecular Biology
Laboratory, D-69012 Heidelberg, Europe
FSSP (families of structurally similar proteins)
is a database of structural alignments of proteins in the Protein Data Bank
(PDB) [1]. The database currently contains an extended structural family for
each of 330 representative protein chains. Each data set contains structural
alignments of one search structure with all other structurally significantly
similar proteins in the representative set (remote homologs, < 30 %
sequence identity), as well as all structures in the Protein Data Bank with
70-30 % sequence identity relative to the search structure (medium homologs).
Very close homologs (above 70 % sequence identity) are excluded as they rarely
have marked structural differences. The alignments of remote homologs are the
result of pairwise all-against-all structural comparisons in the set of 330
representative protein chains. All such comparisons are based purely on the 3D
co-ordinates of the proteins and are derived by automatic (objective) structure
comparison programs. The significance of structural similarity is estimated
based on statistical criteria. The FSSP database is available electronically
from the EMBL file server and by anonymous ftp (file transfer protocol).
It has been estimated that the biochemistry of all living organisms involves
no more than 1,000 divergently related protein families [2]. A majority of
newly determined protein sequences can be classified into families by
detectable sequence homology. The HSSP database of sequence alignments [3]
shows that at least 26% of known sequences deposited in public databases (not
counting cDNA fragments) have a relative of known 3D structure. However,
protein families are known to retain the shape of the fold even when sequences
have diverged below the limit of detection of significant similarities at the
sequence level. Structural comparisons merge protein families of known 3D
structure into structural classes, the members of which may or may not be
evolutionarily related [4-7]. The FSSP database of structural alignments
provides a rich source of information for the study of both divergent and
convergent aspects of the evolution of protein folds.
The FSSP data sets have a wide field of applications. These include
studies to discover remote evolutionary connections in the twilight zone of
sequence similarity; to build a multiple alignment of remotely related families
for the generation of sequence profiles or sequence patterns that may identify
additional remote relatives in sequence databases [8-9]; to classify folds,
such as TIM barrels, in order to study their structural principles [10]; to
define structural cores for sequence-structure alignment (T. Smith, pers.
comm.), for modular construction of novel proteins, or for model building
by homology [11]; to test the accuracy of sequence alignment methods (B. Rost
and R.Schneider, pers. comm.); or, to use test sets of remotely
homologous pairs for fold recognition (M. Sippl, pers. comm.) and to
extract representative data sets for statistical structural analyses [12].
Other uses are only limited by your imagination.
For a protein chain in the representative set, with PDB identifier Nxxx (like:
1PPT, 5PCY) and chain identifier Y (omitted if blank), there is an ASCII (text)
file Nxxx.FSSP or NxxxY.FSSP which contains a few or tens of proteins
structurally similar to the search structure (Z-score above 2 in the pairwise
structural comparison, see below), alongside the secondary structure and
solvent accessibility extracted from the 3D coordinates of the search structure
[13]. The structurally equivalent residues are reported in the form of a
multiple alignment and as a list of matching fragments and can be inspected
using three-dimensional graphics. The co-ordinates must be retrieved separately
from the corresponding PDB data sets, e.g. Nxxx.PDB. Details about the methods
used to derive the database are given in [14,15].
Figure 1 shows an example dataset from FSSP, that for the SH3 domain of
chicken brain alpha-spectrin (1SHG.FSSP). General information about the
structure and notation are given at the top of the dataset. The dataset
contains 5 (NALIGN) structurally aligned proteins which are listed in the '##
PROTEINS' section. 1SHF-A is the homologous SH3 domain from fyn
(PROTEIN column) and is aligned with a positional root mean square deviation of
1.6 Å (RMSD column) over 57 residues (LALI column) and has 33 % sequence
identity after structural alignment (%IDE column). The other structural
homologs are two more SH3 domains (1HSP is misannotated in the PDB), actinidin,
and biotin repressor. Some structural details are given in the '##ALIGNMENTS'
section. Residue W42 (Trp) of 1SHG is in a beta-strand (E) and has a solvent
accessibility (ACC column) of 39 Å2. W42 has a structurally
equivalent residue in 5 (NOCC) of the aligned structures, of which three are
tryptophans (W), two are leucines (L), and all five are in beta-strands (b or e
in We, Wb, Le, We, Le). Finally, the '## FRAGMENTS' section says that to
superimpose the 3D coordinates of 1SHG with those of 1SHF, residues 6-46 and
47-62 of 1SHG should be equivalenced with residues A84-A124 and A127-A142 of
1SHF.
The default files (Nxxx.FSSP) contain structural alignments generated by the
program Dali [15] and are constrained to preserve sequential ordering of the
aligned segments. Alignments optimized allowing topological permutations (loop
reconnections and chain reversals) are available in files Nxxx_dali.FSSP.
Alignments using other methods are available in datasets Nxxx_suppos.FSSP and
Nxxx_comp3D.FSSP [14].
To aid navigation in the database, the 330 protein chains contained in the
representative set have been clustered into fold families (Table I). A
dendrogram of the families was produced by average linkage clustering based on
structural similarity scores [15]. Chain length effects were corrected for by
transforming the pairwise similarities into statistical significance scores
(Z-scores). Families and subfamilies result from truncating the tree at
different cut levels of Z-score. The higher the cut, the larger the resulting
number of distinct fold families (Figure 2). 142 families resulting from the
cut at an average Z-score of 2 are numbered in the first column of Table I.
Second and further members of a family are indicated by indentation relative to
the first member at the given level of significance. For example, if one
decided to derive a more refined selection of fold families using a Z-score
cutoff of 3 instead of 2, then the set of families should be expanded by all
subfamilies that are indented by one letter space in Table I, yielding a total
of 168 families. The most refined selection possible in the representative set
would place each of the 330 chains in a distinct family, but even a cut as high
as a Z-score of 10 yields only 255 families (Figure 2).
In comparing proteins with very low sequence identity, there is no direct
relationship between the structural Z-score and evolutionary relatedness. To
assert descent by common ancestry, the biological function, sequence signatures
and architectural detail should be considered. For example, the very distantly
related animal/plant lysozymes and T4 lysozyme are classified into two
neighbouring families (21 and 22) using the structural Z-score, although they
share some structural and biochemical features. As an example of common
folding motifs, family 57 in Table I contains six structures with the
[[beta]][[alpha]][[beta]][[beta]][[alpha]][[beta]] fold typified by
muconolactone isomerase (1MLI).
The FSSP data sets can be obtained from the EMBL file server [16]. To get
detailed instructions on how to use the service send the messages 'HELP' and
'HELP proteindata' to the network address Netserv@embl-heidelberg.de. If you
have access to Internet you can obtain FSSP files by
anonymous ftp (file
transfer protocol) from ftp.embl-heidelberg.de, directory:
/pub/databases/protein_extras/fssp. Access to the database is also possible
over the World Wide Web (WWW), e.g. using the XMosaic interface; the URL
address is http://www.embl-heidelberg.de/databases/protein_extras/fssp.
Distribution by the Protein Data Bank (pdb.pdb.bnl.gov) is planned for late
1994.
The SUPPOS program is available as part of the WHAT IF package
(available from
G. Vriend, email: vriend@embl-heidelberg.de). The program Dali is currently
not available for distribution. Requests for alignments of newly solved
crystallographic or solution NMR structures (C[[alpha]] co-ordinates
required) may be sent to L. Holm by email (holm@embl-heidelberg.de).
Academic redistribution of single files or of the entire database is
permitted. No inclusion in other databases or database services, academic or
other, without explicit permission of the authors. All rights reserved. Not
to be used for classified research. Users are asked to refer to this paper and
ref. 14 in reporting results on use of the database.
The content and size of the FSSP database is of course tightly coupled to the
development of the Protein Data Bank which is currently increasing at the rate
of hundreds of datasets every year. The size of the sequence-representative
set of PDB files [17], which is used here as a point of departure, has
increased from 154 in December 1992 to 204 in October 1993 to 330 in June 1994.
The complete set of data files (June 1994) requires about 11 Mb of disk
storage. Regular and frequent updates of the database are planned.
The structure comparison program Dali [15] defines the extent of the common
structural core by maximizing the agreement of intramolecular CA-CA
distances. The scoring function was deliberately designed to allow
inter-domain conformational flexibility; hence, positional root mean square
deviations for the corresponding rigid-body superimpositions are often higher
than for comparison methods that put an absolute upper limit on
intermolecular positional deviations. This, however, is only an apparent
disadvantage.
The current database contains at most one alignment per pair of full length
proteins. In future releases, the significance of alignments will be evaluated
at the level of structural domains [18], i.e., parts of structures, and
significant suboptimal alignments will be included. PDB data sets are referred
to by the PDB code; no provision can be made for asynchronous revisions of the
PDB data sets relative to the derived database.
It is often useful to complement the compilation of structure
alignments with sequence and variability information by direct
reference to the latest version of the HSSP
database of sequence-aligned protein families [3]. Users interested
in detailed local structural properties of each protein, such as
hydrogen bonding patterns, may refer to the DSSP
database of secondary structures, derived from PDB files. The HSSP
and DSSP databases are available by the same mechanism of network
access as FSSP, see above. An X-windows based protein query and 3D
inspection system, ProtQuiz 0.7 (Sander & Scharf, unpubl.; test
version available via anonymous ftp from ftp.embl-heidelberg.de), can
be used for interactive evaluation of pairwise alignments. The FSSP
database is cross-referenced with several sequence and other databases
in the information retrieval system SRS [19] with access
provided on www.embl-heidelberg.de. Kindly report any problems to
the authors by electronic mail.
1. Bernstein F.C., Koetzle T.F., Williams G.J.B., Meyer E.F., Brice M.D.,
Rodgers J.R., Kennard O., Shimanouchi T., Tasumi M., J. Mol. Biol. 112:535-542
(1977).
2. Chothia C. Nature 357:543-544 (1992).
3. Sander C., Schneider R., Proteins 9:56-68 (1991).
4. Overington J., Johnson M.S., Sali A., Blundell T.L., Proc. R. Soc. Lond.
B241:132-145 (1990).
5. Pascarella S., Argos P., Prot. Eng. 5:121-137 (1992).
6. Orengo C.A., Flores T.P., Taylor W.R., Thornton J.M., Prot. Eng. 6:485-500
(1993).
7. Holm L., Sander C., Proteins 19:165-173 (1994).
8. Bashford D., Chothia C., Lesk A.M., J. Mol. Biol. 196:199-216 (1987).
9. Taylor W.R., Prot. Eng. 2:77-86 (1988).
10. Wilmanns M., Hyde C.C., Davies D.R., Kirschner K., Jansonius J.N.,
Biochemistry 30:9161-9169 (1991).
11. Sutcliffe M.J., Haneef I., Carney D., Blundell T.L., Prot. Eng. 1:377-384
(1987).
12. Maiorov V.N., Crippen G.M., J.Mol.Biol. 235:625-634 (1994).
13. Kabsch W., Sander C., Biopolymers 22:2577-2637 (1983).
14. Holm L., Ouzounis C., Sander C., Tuparev G., Vriend G., Protein Science
1:1691-1698 (1992).
15. Holm L., Sander C., J. Mol. Biol. 233:123-138 (1993).
16. Stoehr P.J., Omond R.A., Nucleic Acids Res. 17:6763-6764 (1989).
17. Hobohm U., Scharf M., Schneider R., Sander C., Protein Science 3:409-417
(1992).
18. Holm L., Sander C., Proteins 19:256-268 (1994).
19. Etzold T., Argos P., CABIOS 9:49-57 (1993).
Table
I: Protein fold families
Structural classification of protein chains in the database of
three-dimensional structures (PDB). The sequential index of the fold family is
followed by PDB and chain identifiers and protein names of a family member.
Family 1 has 2 members (1acx, 1cobB), family 2 has 11 members (1ten, 2hhrB,
...) and so on. Indentation in the "PDB code" column means that a protein
belongs to the same family / subfamily as the protein above. The families are
defined by cutting an average linkage clustering tree at a similarity level of
2 standard deviations above expected (Z=2). Subfamilies are defined by cuts
at similarity levels of Z=3, 4, 5, 6 and 10; more refined family divisions can
be made at each level of similarity. For example, 3dpa and 4ait of family 32
are split in two separate families if the cut is made at Z=3 rather than at
Z=2; 1acx and 1cobB (family 1) end up in different families if a cut is made at
Z=5; 2hhmA and 3fbpB (family 10) stay together even at Z=10. Only chains in
the sequence-representative set (maximally 30 % sequence identity) are reported
here; higher than 30 % sequence identity between homologous proteins implies,
in general, structural similarity that would be far off the scale to the right.
One FSSP file contains a structural protein family: the search structure and
structurally homologous proteins from the PDB. File organization is
line-oriented and strictly formatted. Lines have a maximum length of 132
bytes. The file is divided into four sections, HEADER, PROTEINS, ALIGNMENTS
and FRAGMENTS. The sections are separated by double hashes (##). The HEADER
section is mandatory. The HEADER, PROTEINS and ALIGNMENTS sections are
similar to those in the HSSP database [3], with obvious modifications of
notation that are explained in the HEADER block. The FRAGMENTS section reports
the beginning and ending residue numbers of structurally equivalent segments.
The residue ranges are given both according to sequential numbering starting
from 1 and, in parentheses, according to the numbering in the PDB files.
The June 1994 release of the FSSP database is based on a
sequence-representative set of 330 protein chains (less than 30 % sequence
identity). Average linkage clustering using the similarity scores from an
all-against-all structural comparison yielded a tree representation of
structural relations in the set (cf. Table 1). Truncating the tree at
different levels of structural similarity (horizontal axis, Z-score) defines
distinct families, i.e., separated branches of the tree. Cutting at a very low
level (Z<<2) leads to a collapse into a very few general classes
(all-alpha, all-beta). Cutting at a high level increases the number of
distinct families, with a gradual approach to one family per protein chain.
Liisa Holm, Last modified Mon Oct 3 16:28:26 MET 1994