The scripts and input files that accompany this demo can be found in the
demos/public directory of the Rosetta weekly releases.
KEYWORDS: DESIGN STABILITY_IMPROVEMENT ANALYSIS
Uses Rosetta relax and score protocols, in combination with software listed below, to predict which mutations to a structure will result in a protein with a temperature-sensitive phenotype.
Prediction of temperature sensitive mutations, Chris Poultney, Bonneau Lab Design session, 8/4/2010
The function of a gene product is often interrogated by means of mutagenesis or knockout studies. However, in the case of essential genes such perturbations would simply result in the uninformative embryonic lethal phenotype. A solution is to design temperature-sensitive (ts) mutants, which exhibit a mutant phenotype only at high (or low) temperatures, providing a simple means to "switch on" a mutation at any stage. Finding ts mutations typically relies on generating and screening many thousands of mutations, which is an expensive and labor-intensive process. Here we describe an in-silico method that uses the Rosetta relax protocol and machine learning techniques to predict a highly accurate "top 5" list of ts mutations given a protein of interest.
scripts/generate-scripts.sh flags (all required):
scripts/predict-ts.sh flags (required):
These command lines are not called directly, but from scripts generated by the protocol (see "Other Comments"). All files are written to and read from the ts_mutant directory; the input_files and output_files directories are ignored.
~/rosetta-3.0/bin/relax.linuxgccrelease -database ~/rosetta-3.0/rosetta3_database -s YBR109C-F140A.pdb -native YBR109C.pdb -nstruct 50 -relax:fast -out:file:scorefile YBR109C-F140A.sc -out:pdb_gz ~/rosetta-3.0/bin/score.linuxgccrelease -database ~/rosetta-3.0/rosetta3_database -s YBR109C-F140A_????.pdb.gz -in:file:native YBR109C.pdb -in:file:fullatom -out:file:scorefile YBR109C-F140Arescore.sc
There are three steps (see "Other Comments" for details): generating scripts to runs protocols, running the scripts (usually on a cluster), and making ts predictions. All must be executed from the ts_mutant directory:
scripts/generate-scripts.sh -protein YBR109C -species Scer -cutoff 10 -mini_bin ~/rosetta-3.0/rosetta3_source/bin -mini_db ~/rosetta-3.0/rosetta3_database for a in *.sh; do qsub -d $(pwd) $a; done scripts/predict-ts.sh -protein YBR109C
Rosetta 3.0 release: https://svn.rosettacommons.org/source/branches/releases/rosetta-3.0
In order to compile Rosetta 3.0 on recent Linux distributions, the compilation settings need to be changed to use gcc 4.3. Switch into the rosetta-3.0/rosetta3_source directory and execute the following:
Then compile the relax and score executables, replacing the -j parameter (number of concurrent compilation threads) as appropriate:
patch -p0 < [path_to_ts_protocol_dir]/patch/r30gcc43.patch
scons -j 6 bin/relax.linuxgccrelease bin/score.linuxgccrelease mode=release cxx_ver=4.3 extras=static
The following must all be installed and available on your PATH:
sed and awk
Probe 2.12 or better: http://kinemage.biochem.duke.edu/software/probe.php
PyMOL 1.2 or better: http://www.pymol.org/ (some Linux distros make this available as a package)
NCBI BLAST+ tools 2.2.22 or better: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ Good linux install instructions are here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html The "nr" database is also required, and is most easily installed using update_blastdb.pl per above instructions
Weka 3.6 or better: weka.jar is included with protocol capture in "svm" directory See COPYING-weka for copyright/redistribution details.
LibSVM 2.8.9: libsvm.jar is included with protocol capture in "svm" directory See COPYRIGHT-libsvm for copyright/redistribution details.
Running the temperature-sensitive allele prediction protocol
All the protocol needs to run is a starting structure. The structure must be avaliable as a .pdb file, and consist of exactly one chain. Protocol scripts take as an argument the name of the protein, which must be the file name without the .pdb extension. For example, if the input file is YBR109C.pdb, the protein name is YBR109C. YBR109C.pdb is provided as part of the protocol for testing purposes.
The protocol is split into three steps: creating the Rosetta run scripts, performing the runs, and analyzing/predicting. This is so that the execution stage can be run on a different computer from the generation and prediction stages, as some clusters do not handle manipulating many small files well. All scripts live in the scripts/ directory.
Generating script files
Generating script files for the Rosetta runs is done using scripts/generate-scripts.sh. For details, execute
This script requires five arguments: the protein name, the species abbreviation, the path to the Rosetta executables, and the path to the Rosetta database. IMPORTANT: the Rosetta paths given must be the paths on the machine where the Rosetta runs will be performed! In other words, if you plan to generate the scripts on one computer and execute them on another, be sure the paths are valid for the execution computer.
For example, to generate scripts for predictions on the provided yeast protein YBR109C at all positions in the native structure with accessible surface area of 10% or less, using the Rosetta executables at ~/rosetta-3.0/rosetta3_source/bin and the Rosetta database at ~/rosetta-3.0/rosetta3_database (Bash derivatives only):
This will generate shell scripts for each Rosetta run to be performed: one for each mutation at each position in the starting structure with accessible surface area <= 10%, plus one for the starting structure itself. The script for the starting structure will be called YBR109C-WT.sh, and the scripts for the mutations will be YBR109C-aNNNb.sh, where a is the native residue, NNN is the position (which may be any number of digits), and b is the mutated residue.
scripts/generate-scripts.sh -protein YBR109C -species Scer -cutoff 10 -mini_bin ~/rosetta-3.0/rosetta3_source/bin -mini_db ~/rosetta-3.0/rosetta3_database
Performing Rosetta runs
Now that scripts have been generated and are ready to be run. Executing the scripts is system-dependent. The following command will queue all runs on a cluster running TORQUE:
for a in *.sh; do qsub -d $(pwd) $a; done
Analysis and prediction
Each of the Rosetta runs in the previous steps generates a score file. These score files are analyzed and used to predict ts mutations by the predict-ts script, which generates two ranked lists of predictions, one for each of the SVM classifiers. This stage includes a PSI-BLAST processing step, which currently takes 5-10 minutes on a reasonably new computer. Running the command below will generate the ranked lists:
The ranked lists of predictions are now available as YBR109C-svmlin.txt and YBR109C-svmrbf.txt
scripts/predict-ts.sh -protein YBR109C