Shoshana Brown, Ph.D.
A Gold Standard Set of Enzyme Superfamilies
Functional characterization of the proteins encoded in the many completelysequenced genomes remains a major challenge. Superfamily analysis is apowerful method for such analysis, but must be automated for efficient useon large datasets. To facilitate the development of automated superfamilyanalysis methods, we are creating a validation data set using a subset ofenzyme superfamilies that are functionally related by common aspects ofchemistry. Superfamilies within our validation set are chosen such thateach consists of a group of evolutionarily related proteins that have aset of conserved residues known in characterized members to be involved incatalysis of a common chemical step. Superfamilies are further dividedinto families, within which each enzyme catalyzes the same overallreaction.One of the major issues that must be addressed in the creation of thisgold standard set is the accurate classification of sequences intosuperfamilies and families. This classification may require the use ofinformation from several sources, including sequence similarity measures,sequence length, the presence/absence of catalytic residues, and thepresence/absence of homologs for additional enzymes involved in the samebiochemical pathway as a given family.This gold standard set will form the core data set for theStructure-Function Linkage Database (SFLD), developed by our lab inconjunction with the UCSF Computer Graphics Laboratory. The SFLD has beendesigned to explicitly link enzyme sequence, structure and function inorder to facilitate sophisticated computational analysis, such as theprediction of function for uncharacterized proteins.