Trouble running MPI docking protocol, please help!

Member Site Forums Rosetta 3 Rosetta 3 – Applications Trouble running MPI docking protocol, please help!

Viewing 1 reply thread
  • Author
    Posts
    • #3544
      Anonymous

         Hi all,

        I’m quite new to Rosetta (and computational approaches in general, I’ve only been using linux and bash-based interfaces for about 6 months) and I’ve spent a few weeks trying to understand the docking process, which I think I understand fairly well. I’ve run into a problem that I hope you can help me with. Apologies for any naivety/confusion/incorrect terms that I use, I’m still very much an amateur so I might not explain myself in the best way. 

        What I want to do

        I’ve organised my docking flag and relaxed my input structure, and now I’m at the stage where I wish to do a global docking production run with ~100,000 models or more over 100 CPU cores. Naturally, for this purpose I’m using my university’s High Performance Computing hub. The hub uses SLURM as the resource manager and implements MPI, and the IT team have installed and compiled Rosetta with MPI. I’m attempting to run my simple global docking script over a number of CPUs and nodes. I’ve read the Rosetta MPI information and as far as I understand it, it should be as straightforward as executing the MPI version of the docking program and adding the relevant information to SLURM to allocate CPU/node resources.

        The problem I’m having

        The problem is, after I do this I can see that the resources have been allocated, but I don’t think the CPUs are being utilised. I’ve done a few trial runs to produce 10 models on 1 node (that has 28 CPU cores), with each run assigning more tasks to the node (1, 2, 4, 6, 8, 10 tasks per node in 6 different runs, respectively, with 1 CPU core allocated to each task). The issue is, I’m not seeing a linear reduction in the processing time relative to the amount of tasks, and I would think that (for example), a node running 10 tasks using MPI would take ~1/10th the time as 1 node using 1 task. I’m not particularly seeing any improvement in processing time with the increase of task number, so I’m thinking that I have an issue with my code. I’m pretty sure the SLURM script to add to the resource queue is fine, because I can see that there have been (for example in the 10 task test run) 10 CPUs allocated to the job, which makes me suspect that the Rosetta script isn’t working as intended. If anyone could have a look and suggest what I might be doing wrong, I’d be forever grateful! I’ll leave the plot of tasks per node vs processing time and script information down below.

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        Processing time relative to tasks allocated per job

        Tasks per node Process time
        1 541
        2 334
        4 387
        6 336
        8 358
        10 293

         

        Rosetta script (saved as test_script.sh)


        #!/bin/bash

        #SBATCH --job-name=test_job
        #SBATCH --partition=test
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=<either 1, 2, 4, 6, 8 or 10>
        #SBATCH --cpus-per-task=1
        #SBATCH --time=00:10:00
        #SBATCH --mem-per-cpu=1000M

        module load apps/rosetta/2018.33

        export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
        srun docking_protocol.mpi.linuxgccrelease @test_flag

         

        Docking flag (saved as test_flag)


        -in:file:s F1_didomain_global.pdb

        -nstruct 10

        -partners A_B
        -dock_pert 3 24
        -spin
        -randomize1
        -randomize2

        -ex1
        -ex2aro

        -out:suffix _test

        -score:docking_interface_score 1

         

        Files in my working directory


        [n00baccount@topsecretHPC tester]$ ls -lct
        total 512
        -rw-r--r-- 1 n00baccount bioc 292 Aug 17 10:22 test_script.sh
        -rw-r--r-- 1 n00baccount bioc 175 Aug 17 09:34 test_flag
        -rw-r--r-- 1 n00baccount bioc 359148 Aug 16 22:05 F1_didomain_global.pdb

         

        Example of submission


        [n00baccount@topsecretHPC tester]$ sbatch test_script.sh
        Submitted batch job 3961469

         

        Example of SLURM showing resources being allocated for two different jobs (1 task per node vs 10 tasks per node with 1 CPU per task)


        [n00baccount@topsecretHPC tester]$ sacct -u n00baccount
        JobID JobName Partition Account AllocCPUS State ExitCode






        3961468 test_job test default 1 RUNNING 0:0
        3961468.0 docking_p+ default 1 RUNNING 0:0
        3961469 test_job test default 10 RUNNING 0:0
        3961469.0 docking_p+ default 10 RUNNING 0:0

         

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        Any help would be very very much appreciated.

        Thanks very much for reading!

        Rob

      • #15508
        Anonymous

          Hi Rob,

          This is more a question of how to use MPI correctly with Rosetta and that would depend on how Rosetta was installed with MPI. I can share with you what my SLURM scripts look like. The one attached will run 59 (not 60, 1 is a master node) independent docking processes.

          Also, whenever running any Rosetta process with mpi, be sure to include the following flags:


          -jd2:failed_job_exception false
          -mpi_tracer_to_file logs/docking_tracer.out

           

          Hope this helps,

          Shourya


          #!/bin/sh

          #SBATCH --job-name=docking_job
          #SBATCH --partition=mpi
          #SBATCH --time=60:0:0
          #SBATCH --ntasks=60
          #SBATCH --mem=200GB
          #SBATCH --output logs/docking.%j.out
          #SBATCH --error logs/docking.%j.err

          # loading and unloading modules
          module load gcc/6.2.0
          module load openmpi/3.1.0

          # job description
          ROSETTABIN=$HOME/Rosetta/main/source/bin
          ROSETTAEXE=docking_protocol
          COMPILER=mpi.linuxgccrelease
          EXE=$ROSETTABIN/$ROSETTAEXE.$COMPILER

          # running with a date and time stamp
          echo Starting MPI job running $EXE
          date
          ulimit -l unlimited
          time mpirun -np 60 $EXE @docking_flags
          date

           

          • #15526
            Anonymous

              Hi Shourya,

              Thanks a lot for commenting, I fear that this was a learning curve in basic script-writing!

              After asking around my lab, it appears that I did not need to use the MPI version of the software at all! I was under the impression I had to, but apparently when submitting a script on slurm, the normal version of docking protocols in Rosetta are fine to use and the program distributes to the nodes correctly, so long as you add appropriate the appropriate prefix, suffix and silent file outputs.

              For posterity, in case anybody comes across this post and has the same issue, this is the script I used:


              #!/bin/bash

              #SBATCH --job-name=N00b_j0b
              #SBATCH --partition=serial
              #SBATCH --nodes=2
              #SBATCH --ntasks-per-node=1
              #SBATCH --cpus-per-task=1
              #SBATCH --time=72:00:00
              #SBATCH --mem-per-cpu=4500M
              #SBATCH --array=1-48

              module load apps/rosetta/2018.33

              srun docking_protocol.linuxgccrelease @Flag_48_arrays -out:suffix $SLURM_ARRAY_TASK_ID -out:prefix $SLURM_JOBID -out:file:silent 48_array_docks_$SLURM_ARRAY_TASK_ID

              Thanks again for your help, I’ll be sure to defer to this comment again if I ever need MPI help!

              Rob

        Viewing 1 reply thread
        • You must be logged in to reply to this topic.