The FSSP database of structurally aligned protein fold families

Liisa Holm and Chris Sander, Protein Design Group, European Molecular Biology Laboratory, D-69012 Heidelberg, Europe

Abstract

FSSP (families of structurally similar proteins) is a database of structural alignments of proteins in the Protein Data Bank (PDB) [1]. The database currently contains an extended structural family for each of 330 representative protein chains. Each data set contains structural alignments of one search structure with all other structurally significantly similar proteins in the representative set (remote homologs, < 30 % sequence identity), as well as all structures in the Protein Data Bank with 70-30 % sequence identity relative to the search structure (medium homologs). Very close homologs (above 70 % sequence identity) are excluded as they rarely have marked structural differences. The alignments of remote homologs are the result of pairwise all-against-all structural comparisons in the set of 330 representative protein chains. All such comparisons are based purely on the 3D co-ordinates of the proteins and are derived by automatic (objective) structure comparison programs. The significance of structural similarity is estimated based on statistical criteria. The FSSP database is available electronically from the EMBL file server and by anonymous ftp (file transfer protocol).

Introduction

It has been estimated that the biochemistry of all living organisms involves no more than 1,000 divergently related protein families [2]. A majority of newly determined protein sequences can be classified into families by detectable sequence homology. The HSSP database of sequence alignments [3] shows that at least 26% of known sequences deposited in public databases (not counting cDNA fragments) have a relative of known 3D structure. However, protein families are known to retain the shape of the fold even when sequences have diverged below the limit of detection of significant similarities at the sequence level. Structural comparisons merge protein families of known 3D structure into structural classes, the members of which may or may not be evolutionarily related [4-7]. The FSSP database of structural alignments provides a rich source of information for the study of both divergent and convergent aspects of the evolution of protein folds.

The FSSP data sets have a wide field of applications. These include studies to discover remote evolutionary connections in the twilight zone of sequence similarity; to build a multiple alignment of remotely related families for the generation of sequence profiles or sequence patterns that may identify additional remote relatives in sequence databases [8-9]; to classify folds, such as TIM barrels, in order to study their structural principles [10]; to define structural cores for sequence-structure alignment (T. Smith, pers. comm.), for modular construction of novel proteins, or for model building by homology [11]; to test the accuracy of sequence alignment methods (B. Rost and R.Schneider, pers. comm.); or, to use test sets of remotely homologous pairs for fold recognition (M. Sippl, pers. comm.) and to extract representative data sets for statistical structural analyses [12]. Other uses are only limited by your imagination.

Form and content of the database

Structural alignments

For a protein chain in the representative set, with PDB identifier Nxxx (like: 1PPT, 5PCY) and chain identifier Y (omitted if blank), there is an ASCII (text) file Nxxx.FSSP or NxxxY.FSSP which contains a few or tens of proteins structurally similar to the search structure (Z-score above 2 in the pairwise structural comparison, see below), alongside the secondary structure and solvent accessibility extracted from the 3D coordinates of the search structure [13]. The structurally equivalent residues are reported in the form of a multiple alignment and as a list of matching fragments and can be inspected using three-dimensional graphics. The co-ordinates must be retrieved separately from the corresponding PDB data sets, e.g. Nxxx.PDB. Details about the methods used to derive the database are given in [14,15].

Figure 1 shows an example dataset from FSSP, that for the SH3 domain of chicken brain alpha-spectrin (1SHG.FSSP). General information about the structure and notation are given at the top of the dataset. The dataset contains 5 (NALIGN) structurally aligned proteins which are listed in the '## PROTEINS' section. 1SHF-A is the homologous SH3 domain from fyn (PROTEIN column) and is aligned with a positional root mean square deviation of 1.6 Å (RMSD column) over 57 residues (LALI column) and has 33 % sequence identity after structural alignment (%IDE column). The other structural homologs are two more SH3 domains (1HSP is misannotated in the PDB), actinidin, and biotin repressor. Some structural details are given in the '##ALIGNMENTS' section. Residue W42 (Trp) of 1SHG is in a beta-strand (E) and has a solvent accessibility (ACC column) of 39 Å2. W42 has a structurally equivalent residue in 5 (NOCC) of the aligned structures, of which three are tryptophans (W), two are leucines (L), and all five are in beta-strands (b or e in We, Wb, Le, We, Le). Finally, the '## FRAGMENTS' section says that to superimpose the 3D coordinates of 1SHG with those of 1SHF, residues 6-46 and 47-62 of 1SHG should be equivalenced with residues A84-A124 and A127-A142 of 1SHF.

The default files (Nxxx.FSSP) contain structural alignments generated by the program Dali [15] and are constrained to preserve sequential ordering of the aligned segments. Alignments optimized allowing topological permutations (loop reconnections and chain reversals) are available in files Nxxx_dali.FSSP. Alignments using other methods are available in datasets Nxxx_suppos.FSSP and Nxxx_comp3D.FSSP [14].

Index of protein fold families

To aid navigation in the database, the 330 protein chains contained in the representative set have been clustered into fold families (Table I). A dendrogram of the families was produced by average linkage clustering based on structural similarity scores [15]. Chain length effects were corrected for by transforming the pairwise similarities into statistical significance scores (Z-scores). Families and subfamilies result from truncating the tree at different cut levels of Z-score. The higher the cut, the larger the resulting number of distinct fold families (Figure 2). 142 families resulting from the cut at an average Z-score of 2 are numbered in the first column of Table I. Second and further members of a family are indicated by indentation relative to the first member at the given level of significance. For example, if one decided to derive a more refined selection of fold families using a Z-score cutoff of 3 instead of 2, then the set of families should be expanded by all subfamilies that are indented by one letter space in Table I, yielding a total of 168 families. The most refined selection possible in the representative set would place each of the 330 chains in a distinct family, but even a cut as high as a Z-score of 10 yields only 255 families (Figure 2).

In comparing proteins with very low sequence identity, there is no direct relationship between the structural Z-score and evolutionary relatedness. To assert descent by common ancestry, the biological function, sequence signatures and architectural detail should be considered. For example, the very distantly related animal/plant lysozymes and T4 lysozyme are classified into two neighbouring families (21 and 22) using the structural Z-score, although they share some structural and biochemical features. As an example of common folding motifs, family 57 in Table I contains six structures with the [[beta]][[alpha]][[beta]][[beta]][[alpha]][[beta]] fold typified by muconolactone isomerase (1MLI).

Distribution

Network access

The FSSP data sets can be obtained from the EMBL file server [16]. To get detailed instructions on how to use the service send the messages 'HELP' and 'HELP proteindata' to the network address Netserv@embl-heidelberg.de. If you have access to Internet you can obtain FSSP files by anonymous ftp (file transfer protocol) from ftp.embl-heidelberg.de, directory: /pub/databases/protein_extras/fssp. Access to the database is also possible over the World Wide Web (WWW), e.g. using the XMosaic interface; the URL address is http://www.embl-heidelberg.de/databases/protein_extras/fssp. Distribution by the Protein Data Bank (pdb.pdb.bnl.gov) is planned for late 1994.

The SUPPOS program is available as part of the WHAT IF package (available from G. Vriend, email: vriend@embl-heidelberg.de). The program Dali is currently not available for distribution. Requests for alignments of newly solved crystallographic or solution NMR structures (C[[alpha]] co-ordinates required) may be sent to L. Holm by email (holm@embl-heidelberg.de).

Conditions

Academic redistribution of single files or of the entire database is permitted. No inclusion in other databases or database services, academic or other, without explicit permission of the authors. All rights reserved. Not to be used for classified research. Users are asked to refer to this paper and ref. 14 in reporting results on use of the database.

Size of the current release

The content and size of the FSSP database is of course tightly coupled to the development of the Protein Data Bank which is currently increasing at the rate of hundreds of datasets every year. The size of the sequence-representative set of PDB files [17], which is used here as a point of departure, has increased from 154 in December 1992 to 204 in October 1993 to 330 in June 1994. The complete set of data files (June 1994) requires about 11 Mb of disk storage. Regular and frequent updates of the database are planned.

Limitations

The structure comparison program Dali [15] defines the extent of the common structural core by maximizing the agreement of intramolecular CA-CA distances. The scoring function was deliberately designed to allow inter-domain conformational flexibility; hence, positional root mean square deviations for the corresponding rigid-body superimpositions are often higher than for comparison methods that put an absolute upper limit on intermolecular positional deviations. This, however, is only an apparent disadvantage.

The current database contains at most one alignment per pair of full length proteins. In future releases, the significance of alignments will be evaluated at the level of structural domains [18], i.e., parts of structures, and significant suboptimal alignments will be included. PDB data sets are referred to by the PDB code; no provision can be made for asynchronous revisions of the PDB data sets relative to the derived database.

Related data banks and programs

It is often useful to complement the compilation of structure alignments with sequence and variability information by direct reference to the latest version of the HSSP database of sequence-aligned protein families [3]. Users interested in detailed local structural properties of each protein, such as hydrogen bonding patterns, may refer to the DSSP database of secondary structures, derived from PDB files. The HSSP and DSSP databases are available by the same mechanism of network access as FSSP, see above. An X-windows based protein query and 3D inspection system, ProtQuiz 0.7 (Sander & Scharf, unpubl.; test version available via anonymous ftp from ftp.embl-heidelberg.de), can be used for interactive evaluation of pairwise alignments. The FSSP database is cross-referenced with several sequence and other databases in the information retrieval system SRS [19] with access provided on www.embl-heidelberg.de.

Kindly report any problems to the authors by electronic mail.

References

1. Bernstein F.C., Koetzle T.F., Williams G.J.B., Meyer E.F., Brice M.D., Rodgers J.R., Kennard O., Shimanouchi T., Tasumi M., J. Mol. Biol. 112:535-542 (1977).

2. Chothia C. Nature 357:543-544 (1992).

3. Sander C., Schneider R., Proteins 9:56-68 (1991).

4. Overington J., Johnson M.S., Sali A., Blundell T.L., Proc. R. Soc. Lond. B241:132-145 (1990).

5. Pascarella S., Argos P., Prot. Eng. 5:121-137 (1992).

6. Orengo C.A., Flores T.P., Taylor W.R., Thornton J.M., Prot. Eng. 6:485-500 (1993).

7. Holm L., Sander C., Proteins 19:165-173 (1994).

8. Bashford D., Chothia C., Lesk A.M., J. Mol. Biol. 196:199-216 (1987).

9. Taylor W.R., Prot. Eng. 2:77-86 (1988).

10. Wilmanns M., Hyde C.C., Davies D.R., Kirschner K., Jansonius J.N., Biochemistry 30:9161-9169 (1991).

11. Sutcliffe M.J., Haneef I., Carney D., Blundell T.L., Prot. Eng. 1:377-384 (1987).

12. Maiorov V.N., Crippen G.M., J.Mol.Biol. 235:625-634 (1994).

13. Kabsch W., Sander C., Biopolymers 22:2577-2637 (1983).

14. Holm L., Ouzounis C., Sander C., Tuparev G., Vriend G., Protein Science 1:1691-1698 (1992).

15. Holm L., Sander C., J. Mol. Biol. 233:123-138 (1993).

16. Stoehr P.J., Omond R.A., Nucleic Acids Res. 17:6763-6764 (1989).

17. Hobohm U., Scharf M., Schneider R., Sander C., Protein Science 3:409-417 (1992).

18. Holm L., Sander C., Proteins 19:256-268 (1994).

19. Etzold T., Argos P., CABIOS 9:49-57 (1993).

Table I: Protein fold families

Structural classification of protein chains in the database of three-dimensional structures (PDB). The sequential index of the fold family is followed by PDB and chain identifiers and protein names of a family member. Family 1 has 2 members (1acx, 1cobB), family 2 has 11 members (1ten, 2hhrB, ...) and so on. Indentation in the "PDB code" column means that a protein belongs to the same family / subfamily as the protein above. The families are defined by cutting an average linkage clustering tree at a similarity level of 2 standard deviations above expected (Z=2). Subfamilies are defined by cuts at similarity levels of Z=3, 4, 5, 6 and 10; more refined family divisions can be made at each level of similarity. For example, 3dpa and 4ait of family 32 are split in two separate families if the cut is made at Z=3 rather than at Z=2; 1acx and 1cobB (family 1) end up in different families if a cut is made at Z=5; 2hhmA and 3fbpB (family 10) stay together even at Z=10. Only chains in the sequence-representative set (maximally 30 % sequence identity) are reported here; higher than 30 % sequence identity between homologous proteins implies, in general, structural similarity that would be far off the scale to the right.

Figure 1: Format of an FSSP file

One FSSP file contains a structural protein family: the search structure and structurally homologous proteins from the PDB. File organization is line-oriented and strictly formatted. Lines have a maximum length of 132 bytes. The file is divided into four sections, HEADER, PROTEINS, ALIGNMENTS and FRAGMENTS. The sections are separated by double hashes (##). The HEADER section is mandatory. The HEADER, PROTEINS and ALIGNMENTS sections are similar to those in the HSSP database [3], with obvious modifications of notation that are explained in the HEADER block. The FRAGMENTS section reports the beginning and ending residue numbers of structurally equivalent segments. The residue ranges are given both according to sequential numbering starting from 1 and, in parentheses, according to the numbering in the PDB files.

Figure 2: Definition of structural classes

The June 1994 release of the FSSP database is based on a sequence-representative set of 330 protein chains (less than 30 % sequence identity). Average linkage clustering using the similarity scores from an all-against-all structural comparison yielded a tree representation of structural relations in the set (cf. Table 1). Truncating the tree at different levels of structural similarity (horizontal axis, Z-score) defines distinct families, i.e., separated branches of the tree. Cutting at a very low level (Z<<2) leads to a collapse into a very few general classes (all-alpha, all-beta). Cutting at a high level increases the number of distinct families, with a gradual approach to one family per protein chain.

Liisa Holm, Last modified Mon Oct 3 16:28:26 MET 1994

Sander Home EMBL Home