Protein Structure Initiative Update

October 2008

The Protein Structure Initiative (PSI) was established in 2000 by the National Institute of General Medical Sciences (NIGMS) after a two year study by the Institute staff and Council.  Following the success of the genome sequencing projects and manifold technical advances in structural biology, a number of scientists proposed establishment of a large-scale structural biology project to significantly extend structural coverage of sequenced genes.  These scientists pointed out that the number and complexity of structures in the Protein Data Bank (PDB) had grown impressively over the past decades and that these structures had led to many scientific successes and to a greater understanding of structure-function relationships, but that the number of new and novel structures had not kept pace with the overall growth.

Following several national and international workshops, a U.S. "structural genomics" project was proposed.  Structural genomics involves high-throughput experimental determination of a large number of representative structures, with the goal of achieving systematic sampling of sequence families.  Utilization of computational modeling of sequence family homologs then extends the structural information to a much larger fraction of sequenced genes.

While considering this project, the NIGMS held three workshops to examine the feasibility, goals, scale, and target selection strategy for a structural genomics effort.  Meeting summaries can be found on the NIGMS Web site.  Following these workshops and discussions with advisors, the NIGMS Council concluded that the Institute should undertake this effort and asked the NIGMS staff to organize a "pilot" phase of the PSI as a 5-year project.  PSI-1 consisted of a centers program and an investigator-initiated grants program for methodology and technology development. With the guidance of Council, the Institute published a Request for Applications (RFA) for PSI pilot centers.

PSI Pilot Phase (PSI-1)

In response to this announcement, nine pilot research centers were established, seven in 2000 and two in 2001, to test strategies for high-throughput structural determination.  Two of these pilot centers were co-funded by the NIH National Institute of Allergy and Infectious Diseases (NIAID).  The goals of PSI-1 were to:

1.       Develop methodology and technology to increase success rates and lower costs of structure determination,

2.       Construct and automate the protein production and structure determination pipeline, and

3.       Determine novel protein structures. In this context, the term "novel" was defined to mean structures for proteins that were less than 30% identical in sequence to proteins for which structures had already been determined.

During the first year, the Institute appointed the Protein Structure Initiative Advisory Committee (PSIAC), a working group of the NIGMS Council composed of independent scientists (i.e., not connected to the PSI) to provide strategic advice to the NIGMS Council and staff on the management and planning of the project.  One important product of their first meeting was a project mission statement:  "to make the three-dimensional atomic level structures of most proteins easily available from knowledge of their corresponding DNA sequences."   

PSI Pilot Phase Results

The nine pilot centers and related research project grants produced a variety of results, including the development of numerous important new methods, automated and parallel procedures, robotic instruments, and structural determination pipeline salvage (or rescue) procedures.   These new methods and tools have been rapidly incorporated into the pilot centers' structural genomics pipelines and many components have been subsequently adopted by structural biology labs throughout the world.

During PSI-1, target selection was left to each center and was not centralized, but all centers were required to aim for novel protein structures and to list their targets on the PSI centralized database in order to minimize overlap and duplication of effort.  The pilot centers were required to disseminate their results into the newly established TargetDB, including rapid deposition and release of atomic coordinates and the data used for structure determination.  NIGMS also supported technology development for high-throughput structural biology data collection, including both the enhancement and new construction of synchrotron beamlines.  As the centers ramped up and the two new centers were begun, the aggregate budget for all the pilot centers increased from $31 million total costs in the first year to $71 million total costs in the final year of the pilot phase.

Over the five years of PSI-1, the nine pilot centers determined about 1,300 structures, with about 65% novel.  Structures contributed by PSI are comparable in quality and size to structures deposited into the PDB from other structural biology laboratories.  Since these centers took several years to reach high-throughput operation, it was not surprising that 40% of the PSI-1 structures were determined in a single year -- the fifth and final year of the project.  By the fifth year of PSI-1, the cost per structure had fallen more than two-fold -- to $138,000. (This estimated cost per structure includes funds for ongoing technology development.)

Lessons Learned

From this first phase of the PSI, NIGMS staff and the PSIAC concluded that several lessons had been learned:

  • Structural genomics pipelines can be constructed and scaled-up,
  • High-throughput operation works for many proteins,
  • NMR can make a significant contribution to structural genomics pipelines,
  • Bottlenecks remain for some proteins, especially integral membrane proteins,
  • A coordinated, 5-year target selection policy is critical for future PSI efforts,
  • Centralized archiving of materials is essential,
  • Homology modeling methods need improvement, and
  • Outreach to and involvement of the broad scientific community must be fostered.

PSI Production Phase (PSI-2)

Following consideration of PSI-1 progress and PSIAC recommendations, the NIGMS Council recommended that the Institute staff prepare announcements for the second phase of the PSI, PSI-2, to begin in July 2005.  Building on the experience and progress of the first phase, the PSI-2 Network undertook several goals:

1.       Structural coverage of sequence families, including those of known high biological importance;

2.       Continued methodology and technology development, especially for challenging classes of proteins such as integral membrane proteins; and

3.       Increased promotion of the use of structures by the broader biological community.

To achieve these goals, the PSI-2 Network included five separate components:

1.       Four large-scale high-throughput research centers focused on production of a large number of novel protein structures that, with application of computational modeling methods, broaden structural coverage of protein sequences,

2.       Six specialized centers focused on technical problems associated with pipeline bottlenecks and challenging proteins,

3.       Two homology modeling centers and a research grants program focused on improving the accuracy of comparative protein structure modeling,

4.       A materials repository,  (PSI Materials Repository (PSI-MR)), to store and distribute expression clones, and

5.       A knowledgebase (PSI Structural Genomics Knowledgebase (SGKB)) to serve as an information analysis and dissemination center.

Through the individual center Web sites, there is a great deal of information on the accomplishments and productivity of these centers.    The large-scale centers and specialized centers were funded in 2005.  The homology modeling centers and materials repository were funded in 2006.  In addition, an investigator-initiated research grants program was added in 2007 to enhance homology modeling methods and increase the chance of producing breakthroughs.  The knowledgebase was funded in mid-2007.  A supplemental grants program for the study of PSI structures of unassigned function was initiated in 2003 and is continuing.  This activity provides funds to enable investigators interested in protein function to undertake short-term research projects which capitalize on the information and reagents produced by the PSI.  Investigators outside the PSI are eligible for these awards.  The budget for all 14 PSI-2 centers and the two small grants programs is about $66 million total costs per year.  One of the specialized centers is co-funded by the NIH National Center for Research Resources (NCRR).

Large- Scale Centers

There are four large-scale production centers in PSI-2;

The four large-scale PSI centers have developed high throughput methodologies for expression, purification, and structure determination of proteins by means of x-ray crystallography and NMR. By coordinating their efforts in applying these methodologies within structure determination pipelines, the large scale centers work to generate increased coverage of novel structural families of sequenced genes for the benefit of the biomedical community. The large-scale centers are required to spend 70% of their effort on the joint PSI-2 Network activity of structural coverage.   Additionally, these centers must also provide 15% of their effort for community nominated targets and collaborations and another 15% for their own individual biomedical theme project.

PSI-2 Target Selection

The overall PSI-2 goal of providing broad structural coverage and the determination of novel protein structures from large protein families was built into the PSI-2 project, but implementation of target selection is worked out by the PSI-2 researchers.  This task is undertaken by the directors and bioinformatics staff of the large-scale centers.  Targets for the large-scale center joint activity are chosen in order to maximize structural coverage, enhance biological impact, and make the structures useful to the broad scientific community (perhaps the most important aspect of PSI-2). 

At many levels within the PSI-2 network, the issue of target selection has received intense scrutiny and frequent rounds of bioinformatics analysis.  Two groups have borne most of the responsibility for target selection and coordination of this project.  The Operations and Management Group (OMG) consists of the four large-scale center directors and the NIH PSI Network director.  The Bioinformatics Group (BIG) is composed of the four informatics directors of these centers.  These two groups, separately and together, have communicated weekly to forge a common plan for target selection and operation.  Following extensive communications, the large-scale centers agreed on a total of 3,000 structures as a 5-year goal for PSI-2 and worked out agreements on the rules of operation and target selection.  Several thousand target families have already been chosen and targets have been assigned to each large-scale center by a "match" process. Two databases were established to coordinate target assignments and to provide publicly accessible information about the activities of the PSI Network. The two databases are TargetDB, which contains lists of targets that have been adopted by each of the centers, and PepcDB which contains information about the progress on those targets and the experimental procedures that have been used in pursuit of them.

Summarizing the strategy of target selection, goals include the:

  • Coarse sampling of large families (initially Pfam with other large families added) with no structural representatives in PDB to achieve broad structural coverage (joint network activity);
  • Moderate sampling of very large families with limited structural representatives in PDB for: (joint  network activity)
    • Increased structural coverage to explore evolution of structure and function and to aid in computational modeling
    • Structural coverage of selected families with high biomedical relevance;
  • Exploration of single organisms, metagenomes, and microbiomes (joint network activity);
  • Community targets nominated by non-PSI investigators and centers (joint network/individual center activity); and
  • Biomedical theme targets (individual center activity).
  • Target selection strategy for the PSI is summarized on the PSI SGKB Web site.

Specialized Centers

Six specialized centers are focused on the development of innovative methods and technologies for structure studies of proteins that continue to be challenging and for which existing and newly developed structure methods are not yet suitable for routine application in a pipeline format. These specialized centers are smaller efforts centered on specific production bottlenecks and especially on structure determination from more difficult classes including membrane proteins, small protein complexes and proteins from higher organisms including human.  Two specialized centers, the Center for Structures of Membrane Proteins (Robert Stroud, PI) and the New York Consortium for Membrane Protein Structure (Wayne Hendrickson, PI),  are focused on membrane proteins working toward pipelines for the expression, solubilization, purification and crystallization of this important and challenging class.  Another specialized center, the Center for Eukaryotic Structural Genomics (John Markley, PI), is dedicated to structure determination by NMR spectroscopy and X-ray crystallography of eukaryotic proteins employing unique cloning into multiple vectors and complementary cell-based and wheat germ cell-free expression.  The other three specialized centers, the Center for High-Throughput Structural Biology (George DeTitta, PI), the Integrated Center for Structure and Function Innovation (Tom Terwilliger, PI) and the Accelerated Technologies Center for Gene to 3D Structure (Lance Stewart, PI), are engaged in wide ranging efforts developing methods and instruments for improving protein production, crystallization in miniaturized scalable formats, and structural determination for difficult proteins and protein complexes.  The specialized centers are expected to determine structures and contribute to the Network goal of structural coverage, but at much lower rates than those from the large-scale centers.

Homology Modeling Centers

Two homology modeling centers have been established to develop innovative methods of comparative homology modeling and methods to assess model accuracy.  This effort will help attain the goal of improving the ability to generate useful molecular models from sequences of proteins whose structure has not been experimentally determined. Roland Dunbrack is PI of Center for New Methods for High-Resolution Comparative Modeling with a major goal to improve the quality of comparative models both in the >30% sequence identity regime and in the 10- 30% sequence identity regime. Together with David Baker, Dunbrack and co-investigators are developing methods for refinement of comparative models within the Rosetta program. The Baker group has developed a new approach to refining protein models that combines the targeting of aggressive sampling to regions most likely in error with powerful global optimization techniques. Adam Godzik, director of the Joint Center for Molecular Modeling is developing a pipeline to build models for PSI target proteins to evaluate applications of modeling to speed up structure refinement.

Materials Repository

The PSI Materials Repository (PSI-MR), directed by Dr. Joshua LaBaer at the Harvard Institute of Proteomics, was established to provide centralized storage and distribution of the plasmid clones created by PSI-1 and PSI-2 centers.  These plasmids are a valuable resource that allows the research community to identify the biological function of proteins whose structures have been determined by the PSI.  To facilitate cross-referencing of a specific plasmid to its protein annotation and experimental information from TargetDB and PepcDB, each plasmid in the PSI-MR website is linked to the PSI Structural Genomic Knowledgebase (PSI SGKB).  Currently, over 27,000 plasmid clones are being processed and almost 9,000 clones are available for distribution.

The PSI-MR has also simplified the MTA process and decreased the time for institutions to deposit or receive plasmids.  After much negotiation, it modified and developed two documents: the depositor agreement (DA), which sets the terms with the depositor's institution enabling the PSI-MR to distribute the deposited PSI plasmids, and the expedited process material transfer agreement (epMTA), which institutions sign to allow its researchers to receive any plasmid without having to sign an MTA for each request.  So far, 40 institutions have signed the expedited process MTA. 

Knowledgebase

The PSI Structural Genomics Knowledgebase (SGKB) is directed by Dr. Helen Berman of Rutgers University.  The SGKB is a resource to explore the output of the PSI for information on advances in structural biology and structural genomics that improve our understanding of living systems and disease. The Knowledgebase provides a platform for scientific community involvement in target selection and functional annotation and will play an important role in increasing the impact of protein structures on biological and biomedical research.  The PSI SG Knowledgebase includes support for:

1.       A homology modeling portal that  provides the scientific community with facile access to computational models of proteins,

2.       A functional annotation module to facilitate community participation in assigning function to structures,

3.       A metrics module for the analysis of PSI progress,

4.       A technology portal to provide information and access to technologies developed by the PSI,

5.       A database module for tracking PSI targets (TargetDB) and PSI experimental methods (PepcDB),

6.       Integration with other data resources, such as the PDB, NCBI, model organism databases, etc,

7.       Integration with the materials repository, and

8.       Information regarding meetings, workshops, recent literature and news of important developments in structural biology and structural genomics.

PSI Production Phase Results

During the first three years of PSI-2 (July 2005- June 2008), the four large-scale centers developed additional new methods and jointly devised a target selection process to maximize structural coverage and the biomedical relevance of the structures.  They have determined about 1900 protein structures.  Over 70% of these are novel, and the cost per structure has been reduced to $57,000.  (Again, ongoing technology development activities within the centers make this figure an over-estimate of the current cost per structure.)  These structures represent about 40% of the novel structures deposited into the PDB from all sources, worldwide, during this period.  During the eight years of the PSI, over 3,250 structures were determined. 

Methodology and Technology Development

Continuing methodology and technology development provides advances valuable to the PSI supported centers and also to the structural biology community as a whole.  These efforts are ongoing within the large-scale centers as well as being the specific objective of the specialized centers and also within a portfolio of individual investigator research project grants and projects supported by the NIH SBIR/STTR program.  Short informal reports on technical developments and problems are exchanged quarterly between all fourteen PSI centers.  Additionally, at the annual PSI-sponsored "Bottlenecks" meeting, scientists from PSI centers and other laboratories discuss technical hurdles in protein production, crystallization, and structure determination.  This exchange has led to significant enhancements in methods and techniques incorporated by structural genomics pipelines and used by the structural biology community as a whole.  In addition, various specialized workshops are convened by the individual centers.

PSI Policies

As a public resource, the PSI-2 has special regulations and policies.  From the inception, PSI required rapid release of all results, including the deposition and release of coordinates and related information into the PDB.  The PSI-2 centers are not funded by the usual research grants mechanism, but via cooperative agreements.  The Principal Investigator is responsible for directing his/her center, but all centers are required to work together, and NIH staff and outside advisors share important roles in determining program goals and actions. The PSI-2 centers are also responsible for outreach activities -- to the scientific community, to minority scientists and students, and for research training.  As a network with joint activities and goals, the centers are continually discussing and fine-tuning issues such as target selection, management, operation, and cooperation.

PSI-2 Organization

The PSI Steering Committee (PSISC) is the internal governing body of PSI-2, providing direction and revising goals and plans within the framework established by the PSI-2 RFAs, the PSIAC, and the NIGMS Advisory Council.  The PSISC is also responsible for the implementation of plans and overall project operation.  It is composed of the PSI center directors, four NIH staff, and five outside scientists.  The PSISC chair interacts on a regular basis with the OMG and BIG and oversees work of four subcommittees:  Goals and Milestones, Target Selection, Center Interactions, and Communication with the Scientific Community.  Each subcommittee has produced a report that is available on the PSI Web site.   The Goals and Milestones report was developed with input from a large group of center and Institute staff.  It enumerates the expected deliverables for PSI-2.  An analysis of PSI-2 progress on structural coverage and other goals is available at the SGKB.  The SGKB also includes a list of publications, workshops, etc.  The PSI knowledgebase is responsible for providing periodic evaluations of progress and goals.

In addition to regular communication, the PSI center directors, the PSIAC, the PSISC, and NIH staff attend the PSI annual meeting to discuss progress, plans, and strategies. 

Future Directions

As a major NIH project intended to serve as a scientific resource, questions of policy and goals are discussed regularly by those involved with the PSI.  The focus of these deliberations is target selection, goals and milestones, operation, and management.  Longer range planning is also being addressed.  Over the past two years, the NIGMS/PSI Network has sought input, especially on the following issues:

  • What is the appropriate role of the specialized centers within the PSI Networks?
  • How can the PSI interact with other structural genomics efforts and with structural biology projects?
  • How can the PSI Knowledgebase increase the impact and value of biological and biomedical research? 
  • How can PSI further involve the scientific community in target selection and structural annotation?
  • How should the PSI coordinate activities and target selection, in particular, with international structural genomics projects?

The PSI and its network will continue to consult the scientific community.  More information on the history and background, plus summaries of PSI workshops, and program announcements, goals, requirements, and progress of the initiative are available on the NIGMS/PSI Web site.