cluster.mpi.linuxgccrelease failed

Member Site Forums Rosetta 3 Rosetta 3 – Applications cluster.mpi.linuxgccrelease failed

Viewing 1 reply thread
  • Author
    • #1838

        Hi there,
        I was clustering a silent files with my 10% lowest energy decoys and the cluster.mpi.linuxgccrelease just stopped and issued the following on screen:

        mpirun noticed that process rank 4 with PID 19956 on node compute-1-5 exited on signal 9 (Killed).

        Well, it seems some problem with MPIRUN rather than with the cluster.mpi.linuxgccrelease binary.
        I’m running cluster.mpi.linuxgccrelease with the following command line:
        mpirun -x LD_LIBRARY_PATH=$LIB –mca btl_tcp_if_include eth0 -np 20 –host compute-1-11,compute-1-12,compute-1-13,compute-1-14,compute-1-15,compute-1-16,compute-1-17,compute-1-18,compute-1-19,compute-1-20 $BIN/cluster.mpi.linuxgccrelease -in:file:fullatom -in:file:silent_struct_type binary -in:file:silent ecut_10.out -cluster:radius -1

        Did I miss some special MPIRUN option?
        Thanks in advance.

      • #9836

          The clustering code was never multi-processor-ized to my knowledge. I don’t think it should actually fail in MPI, but it certainly won’t work better than the non-MPI.

        • #9837

            Hi smlewis,
            Thanks for your replay. Judging by the output on screen, the mpi version seems to work reasonable well, but it doesn’t writes the expected clusters before die. So, I thought I had missed some MPIRUN option. Well, if you don’t use the mpicluster, who am I to use it? Thanks for sharing.

            EDIT: the information bellow might be useful to another user and/or author.
            Feb 25 14:59:59 compute-1-20 kernel: Out of memory: Kill process 25129 (cluster.mpi.lin) score 445 or sacrifice child
            Feb 25 14:59:59 compute-1-20 kernel: Killed process 25129, UID 1006, (cluster.mpi.lin) total-vm:7986656kB, anon-rss:7777940kB, file-rss:2620kB
            For some reason the process has been killed with the status “Out of memory”. The same jobs was completed with the non-mpi version of cluster.default.linuxgccrelease.

          • #9929

              This problem has been solved decreasing the number of process per worknode.
              Hope it helps.

          Viewing 1 reply thread
          • You must be logged in to reply to this topic.