Member Site › Forums › Rosetta 3 › Rosetta 3 – Applications › running in MPI mode and multiple scores per output PDB file?
- This topic has 7 replies, 3 voices, and was last updated 5 years, 2 months ago by Anonymous.
-
AuthorPosts
-
-
October 30, 2019 at 4:42 pm #3294Anonymous
Hi Forum
I recently did a Rosetta fixbb run with MPI and found that the score file had a lot more lines of output than there were actual PDB files. Specifically, I’ve got 353 scores in score.sc but only 12 PDB files. is it possible that the parallel processors are simply overwriting the PDBs? Is there a flag I should be including to avoid this?
Thanks!
-
October 30, 2019 at 6:47 pm #15030Anonymous
353/12 is not a whole number, but otherwise with is 100% the symptom of “you didn’t actually run in MPI”. This is what happens if you run non-MPI-compiled Rosetta (with or without mpiexec). I assume you used -nstruct 12.
Does your rosetta binary have`mpi` in its name? it should be rosetta-app-name.mpi.(system)(compiler)(mode)
-
November 1, 2019 at 5:16 pm #15032Anonymous
Is it possible that, even though the binary has ‘mpi’ in the name, that perhaps it wasn’t compiled correctly? Is there a unit test or something for MPI-compiled Rosetta?
-
November 4, 2019 at 11:41 pm #15044Anonymous
No particularly useful tests I know of. Rocco’s comment about the tracer tags with MPI rank below might be diagnostic. Just the log files themselves should say something to; I haven’t done a run in a while but proably the job distributor choice is announced and you’ll see it in a log line near the top.
-
November 1, 2019 at 5:17 pm #15033Anonymous
Yup, the binary does have mpi in the name:
mpiexec $HOME/rosetta_src_2019.22.60749_bundle/main/source/bin/fixbb.mpi.linuxgccrelease -s filename.pdb -ex1 -ex2 -resfile resfile.txt -nstruct 15 -overwrite -linmem_ig 10
the numbers probably don’t work out just right because I hit the walltime on the job and the machine killed the job before it was finished.
-
-
November 1, 2019 at 7:27 pm #15031Anonymous
(comment removed and resubmitted as direct reply to previous poster)
-
November 4, 2019 at 7:56 pm #15035Anonymous
I’m wondering if it might be an MPI version mismatch. That is, if you compile with OpenMPI libraries, say, but your mpiexec for MPICH2 version, say, then the MPICH2 launcher won’t necessarily set things up properly for OpenMPI, and you might end up having each process think it’s running serial, despite being under an MI launcher.
Double check your compilation settings and where your mpiexec is coming from (e.g. `which mpiexec`). Sometimes with clusters you get a mixed environment where mpiexec goes to MPICH2 (for example), but mpirun goes to OpenMPI (or vice versa, etc.).
The other thing to take a look at is the tracer output. If MPI is properly set up, there should be an annotation about the MPI process in parenthesis for each line. If that’s missing, or if it’s all ‘(0)’, (with no other numbers, despite launching multiple processes in MPI) then it could be that the MPI environment is not set up correctly for Rosetta to realize it’s running under MPI, and may be running serially. There may be other information in the tracer about how thing are running under MPI as well.
-
November 6, 2019 at 5:51 pm #15049Anonymous
Yes!!! That seems to have been the problem! The version of Open MPI on the head node was different from that on the compute node. All fixed now!
Thank you all for your help !!
-
-
-
AuthorPosts
- You must be logged in to reply to this topic.