Temperature-Sensitive Mutation Prediciton

KEYWORDS: DESIGN STABILITY_IMPROVEMENT ANALYSIS

Author

Chris Poultney

Brief Description

Uses Rosetta relax and score protocols, in combination with software listed below, to predict which mutations to a structure will result in a protein with a temperature-sensitive phenotype.

Related RosettaCon Talk

Title, Authors & Lab , Year, Session and Day of talk

Prediction of temperature sensitive mutations, Chris Poultney, Bonneau Lab Design session, 8/4/2010

Abstract

The function of a gene product is often interrogated by means of mutagenesis or knockout studies. However, in the case of essential genes such perturbations would simply result in the uninformative embryonic lethal phenotype. A solution is to design temperature-sensitive (ts) mutants, which exhibit a mutant phenotype only at high (or low) temperatures, providing a simple means to "switch on" a mutation at any stage. Finding ts mutations typically relies on generating and screening many thousands of mutations, which is an expensive and labor-intensive process. Here we describe an in-silico method that uses the Rosetta relax protocol and machine learning techniques to predict a highly accurate "top 5" list of ts mutations given a protein of interest.

Running

Flags

scripts/generate-scripts.sh flags (all required):

-protein: name of input protein (must be PDB file name without .pdb extension)
-species: species name abbreviation (used for display only, but required)
-cutoff: % surface area accessibility cutoff for residues to mutate in input structure
-mini_bin: path of Rosetta bin directory on computer where runs will be executed
-mini_db: path of Rosetta database directory on computer where runs will be executed

scripts/predict-ts.sh flags (required):

-protein: name of input protein (must be PDB file name without .pdb extension)

Command Line

These command lines are not called directly, but from scripts generated by the protocol (see "Other Comments"). All files are written to and read from the ts_mutant directory; the input_files and output_files directories are ignored.

~/rosetta-3.0/bin/relax.linuxgccrelease -database ~/rosetta-3.0/rosetta3_database -s YBR109C-F140A.pdb -native YBR109C.pdb -nstruct 50 -relax:fast -out:file:scorefile YBR109C-F140A.sc -out:pdb_gz

~/rosetta-3.0/bin/score.linuxgccrelease -database ~/rosetta-3.0/rosetta3_database -s YBR109C-F140A_????.pdb.gz -in:file:native YBR109C.pdb -in:file:fullatom -out:file:scorefile YBR109C-F140Arescore.sc

Example Overall Command Line

There are three steps (see "Other Comments" for details): generating scripts to runs protocols, running the scripts (usually on a cluster), and making ts predictions. All must be executed from the ts_mutant directory:

scripts/generate-scripts.sh -protein YBR109C -species Scer -cutoff 10 -mini_bin ~/rosetta-3.0/rosetta3_source/bin -mini_db ~/rosetta-3.0/rosetta3_database

for a in *.sh; do qsub -d $(pwd) $a; done

scripts/predict-ts.sh -protein YBR109C

Versions

Rosetta 3.0 release: https://svn.rosettacommons.org/source/branches/releases/rosetta-3.0

In order to compile Rosetta 3.0 on recent Linux distributions, the compilation settings need to be changed to use gcc 4.3. Switch into the rosetta-3.0/rosetta3_source directory and execute the following:

patch -p0 < [path_to_ts_protocol_dir]/patch/r30gcc43.patch

Then compile the relax and score executables, replacing the -j parameter (number of concurrent compilation threads) as appropriate:

scons -j 6 bin/relax.linuxgccrelease bin/score.linuxgccrelease mode=release cxx_ver=4.3 extras=static

Version for Other Codes Used

The following must all be installed and available on your PATH:

sed and awk

Probe 2.12 or better: http://kinemage.biochem.duke.edu/software/probe.php

PyMOL 1.2 or better: http://www.pymol.org/ (some Linux distros make this available as a package)

NCBI BLAST+ tools 2.2.22 or better: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ Good linux install instructions are here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html The "nr" database is also required, and is most easily installed using update_blastdb.pl per above instructions

Weka 3.6 or better: weka.jar is included with protocol capture in "svm" directory See COPYING-weka for copyright/redistribution details.

LibSVM 2.8.9: libsvm.jar is included with protocol capture in "svm" directory See COPYRIGHT-libsvm for copyright/redistribution details.

Other Comments

Running the temperature-sensitive allele prediction protocol

All the protocol needs to run is a starting structure. The structure must be avaliable as a .pdb file, and consist of exactly one chain. Protocol scripts take as an argument the name of the protein, which must be the file name without the .pdb extension. For example, if the input file is YBR109C.pdb, the protein name is YBR109C. YBR109C.pdb is provided as part of the protocol for testing purposes.

The protocol is split into three steps: creating the Rosetta run scripts, performing the runs, and analyzing/predicting. This is so that the execution stage can be run on a different computer from the generation and prediction stages, as some clusters do not handle manipulating many small files well. All scripts live in the scripts/ directory.
Generating script files

Generating script files for the Rosetta runs is done using scripts/generate-scripts.sh. For details, execute
```
scripts/generate-scripts.sh -usage
```
This script requires five arguments: the protein name, the species abbreviation, the path to the Rosetta executables, and the path to the Rosetta database. IMPORTANT: the Rosetta paths given must be the paths on the machine where the Rosetta runs will be performed! In other words, if you plan to generate the scripts on one computer and execute them on another, be sure the paths are valid for the execution computer.
For example, to generate scripts for predictions on the provided yeast protein YBR109C at all positions in the native structure with accessible surface area of 10% or less, using the Rosetta executables at ~/rosetta-3.0/rosetta3_source/bin and the Rosetta database at ~/rosetta-3.0/rosetta3_database (Bash derivatives only):
```
scripts/generate-scripts.sh -protein YBR109C -species Scer -cutoff 10 -mini_bin ~/rosetta-3.0/rosetta3_source/bin -mini_db ~/rosetta-3.0/rosetta3_database
```
This will generate shell scripts for each Rosetta run to be performed: one for each mutation at each position in the starting structure with accessible surface area <= 10%, plus one for the starting structure itself. The script for the starting structure will be called YBR109C-WT.sh, and the scripts for the mutations will be YBR109C-aNNNb.sh, where a is the native residue, NNN is the position (which may be any number of digits), and b is the mutated residue.
Performing Rosetta runs

Now that scripts have been generated and are ready to be run. Executing the scripts is system-dependent. The following command will queue all runs on a cluster running TORQUE:
```
for a in *.sh; do qsub -d $(pwd) $a; done
```
Analysis and prediction

Each of the Rosetta runs in the previous steps generates a score file. These score files are analyzed and used to predict ts mutations by the predict-ts script, which generates two ranked lists of predictions, one for each of the SVM classifiers. This stage includes a PSI-BLAST processing step, which currently takes 5-10 minutes on a reasonably new computer. Running the command below will generate the ranked lists:
```
scripts/predict-ts.sh -protein YBR109C
```
The ranked lists of predictions are now available as YBR109C-svmlin.txt and YBR109C-svmrbf.txt