Systematic and automated discovery of patterns in PROSITE families
Abstract
PROSITE is a method for protein classification which relies on a database of biologically significant sites and patterns in protein sequences. Most patterns in PROSITE have been gathered by a labor intensive combination of experimental characterization of functional residues and sequence alignment. In this paper we present a new and efficient supervised learning procedure, based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automatic discovery of patterns in 974 PROSITE families. For these families, Splash generates patterns with better specificity and/or sensitivity in 28%, identical statistics in 48%, and worse statistics in 15% of the cases; for the remaining families, patterns exhibited mixed behavior. Second, we have characterized the amount of overlap, on the sequences, between newly discovered patterns and those in PROSITE. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with those reported in PROSITE. Of the 272 patterns which perform strictly better than the corresponding PROSITE pattern, 178 show more than 70% overlap with the PROSITE pattern. Third, our results suggest that the statistical significance of discovered patterns correlates well with their biological significance. Finally, we use the trypsin subfamily of serine proteases to illustrate the use of this method to exhaustively discover all motifs in a family that are statistically and biologically significant. The complete analysis is sufficiently rapid, taking less than a day for all PROSITE families, to enable the use this methodology for routine curation of existing motif and profile databases.