Member Site › Forums › Rosetta 3 › Rosetta 3 – Applications › Trouble running MPI docking protocol, please help!
- This topic has 2 replies, 2 voices, and was last updated 4 years, 3 months ago by Anonymous.
-
AuthorPosts
-
-
August 18, 2020 at 1:40 pm #3544Anonymous
Hi all,
I’m quite new to Rosetta (and computational approaches in general, I’ve only been using linux and bash-based interfaces for about 6 months) and I’ve spent a few weeks trying to understand the docking process, which I think I understand fairly well. I’ve run into a problem that I hope you can help me with. Apologies for any naivety/confusion/incorrect terms that I use, I’m still very much an amateur so I might not explain myself in the best way.
What I want to do
I’ve organised my docking flag and relaxed my input structure, and now I’m at the stage where I wish to do a global docking production run with ~100,000 models or more over 100 CPU cores. Naturally, for this purpose I’m using my university’s High Performance Computing hub. The hub uses SLURM as the resource manager and implements MPI, and the IT team have installed and compiled Rosetta with MPI. I’m attempting to run my simple global docking script over a number of CPUs and nodes. I’ve read the Rosetta MPI information and as far as I understand it, it should be as straightforward as executing the MPI version of the docking program and adding the relevant information to SLURM to allocate CPU/node resources.
The problem I’m having
The problem is, after I do this I can see that the resources have been allocated, but I don’t think the CPUs are being utilised. I’ve done a few trial runs to produce 10 models on 1 node (that has 28 CPU cores), with each run assigning more tasks to the node (1, 2, 4, 6, 8, 10 tasks per node in 6 different runs, respectively, with 1 CPU core allocated to each task). The issue is, I’m not seeing a linear reduction in the processing time relative to the amount of tasks, and I would think that (for example), a node running 10 tasks using MPI would take ~1/10th the time as 1 node using 1 task. I’m not particularly seeing any improvement in processing time with the increase of task number, so I’m thinking that I have an issue with my code. I’m pretty sure the SLURM script to add to the resource queue is fine, because I can see that there have been (for example in the 10 task test run) 10 CPUs allocated to the job, which makes me suspect that the Rosetta script isn’t working as intended. If anyone could have a look and suggest what I might be doing wrong, I’d be forever grateful! I’ll leave the plot of tasks per node vs processing time and script information down below.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Processing time relative to tasks allocated per job
Tasks per node Process time 1 541 2 334 4 387 6 336 8 358 10 293 Rosetta script (saved as test_script.sh)
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<either 1, 2, 4, 6, 8 or 10>
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1000M
module load apps/rosetta/2018.33
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
srun docking_protocol.mpi.linuxgccrelease @test_flagDocking flag (saved as test_flag)
-in:file:s F1_didomain_global.pdb
-nstruct 10
-partners A_B
-dock_pert 3 24
-spin
-randomize1
-randomize2
-ex1
-ex2aro
-out:suffix _test
-score:docking_interface_score 1Files in my working directory
[n00baccount@topsecretHPC tester]$ ls -lct
total 512
-rw-r--r-- 1 n00baccount bioc 292 Aug 17 10:22 test_script.sh
-rw-r--r-- 1 n00baccount bioc 175 Aug 17 09:34 test_flag
-rw-r--r-- 1 n00baccount bioc 359148 Aug 16 22:05 F1_didomain_global.pdbExample of submission
[n00baccount@topsecretHPC tester]$ sbatch test_script.sh
Submitted batch job 3961469Example of SLURM showing resources being allocated for two different jobs (1 task per node vs 10 tasks per node with 1 CPU per task)
[n00baccount@topsecretHPC tester]$ sacct -u n00baccount
JobID JobName Partition Account AllocCPUS State ExitCode
3961468 test_job test default 1 RUNNING 0:0
3961468.0 docking_p+ default 1 RUNNING 0:0
3961469 test_job test default 10 RUNNING 0:0
3961469.0 docking_p+ default 10 RUNNING 0:0~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Any help would be very very much appreciated.
Thanks very much for reading!
Rob
-
September 20, 2020 at 11:06 pm #15508Anonymous
Hi Rob,
This is more a question of how to use MPI correctly with Rosetta and that would depend on how Rosetta was installed with MPI. I can share with you what my SLURM scripts look like. The one attached will run 59 (not 60, 1 is a master node) independent docking processes.
Also, whenever running any Rosetta process with mpi, be sure to include the following flags:
-jd2:failed_job_exception false
-mpi_tracer_to_file logs/docking_tracer.outHope this helps,
Shourya
#!/bin/sh
#SBATCH --job-name=docking_job
#SBATCH --partition=mpi
#SBATCH --time=60:0:0
#SBATCH --ntasks=60
#SBATCH --mem=200GB
#SBATCH --output logs/docking.%j.out
#SBATCH --error logs/docking.%j.err
# loading and unloading modules
module load gcc/6.2.0
module load openmpi/3.1.0
# job description
ROSETTABIN=$HOME/Rosetta/main/source/bin
ROSETTAEXE=docking_protocol
COMPILER=mpi.linuxgccrelease
EXE=$ROSETTABIN/$ROSETTAEXE.$COMPILER
# running with a date and time stamp
echo Starting MPI job running $EXE
date
ulimit -l unlimited
time mpirun -np 60 $EXE @docking_flags
date
-
September 25, 2020 at 12:45 pm #15526Anonymous
Hi Shourya,
Thanks a lot for commenting, I fear that this was a learning curve in basic script-writing!
After asking around my lab, it appears that I did not need to use the MPI version of the software at all! I was under the impression I had to, but apparently when submitting a script on slurm, the normal version of docking protocols in Rosetta are fine to use and the program distributes to the nodes correctly, so long as you add appropriate the appropriate prefix, suffix and silent file outputs.
For posterity, in case anybody comes across this post and has the same issue, this is the script I used:
#!/bin/bash
#SBATCH --job-name=N00b_j0b
#SBATCH --partition=serial
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=72:00:00
#SBATCH --mem-per-cpu=4500M
#SBATCH --array=1-48
module load apps/rosetta/2018.33
srun docking_protocol.linuxgccrelease @Flag_48_arrays -out:suffix $SLURM_ARRAY_TASK_ID -out:prefix $SLURM_JOBID -out:file:silent 48_array_docks_$SLURM_ARRAY_TASK_IDThanks again for your help, I’ll be sure to defer to this comment again if I ever need MPI help!
Rob
-
-
-
AuthorPosts
- You must be logged in to reply to this topic.