- This topic has 3 replies, 2 voices, and was last updated 9 years, 8 months ago by Anonymous.
February 24, 2014 at 8:11 pm #1838Anonymous
I was clustering a silent files with my 10% lowest energy decoys and the cluster.mpi.linuxgccrelease just stopped and issued the following on screen:
mpirun noticed that process rank 4 with PID 19956 on node compute-1-5 exited on signal 9 (Killed).
Well, it seems some problem with MPIRUN rather than with the cluster.mpi.linuxgccrelease binary.
I’m running cluster.mpi.linuxgccrelease with the following command line:
mpirun -x LD_LIBRARY_PATH=$LIB –mca btl_tcp_if_include eth0 -np 20 –host compute-1-11,compute-1-12,compute-1-13,compute-1-14,compute-1-15,compute-1-16,compute-1-17,compute-1-18,compute-1-19,compute-1-20 $BIN/cluster.mpi.linuxgccrelease -in:file:fullatom -in:file:silent_struct_type binary -in:file:silent ecut_10.out -cluster:radius -1
Did I miss some special MPIRUN option?
Thanks in advance.
February 24, 2014 at 8:27 pm #9836Anonymous
The clustering code was never multi-processor-ized to my knowledge. I don’t think it should actually fail in MPI, but it certainly won’t work better than the non-MPI.
February 25, 2014 at 7:04 pm #9837Anonymous
Thanks for your replay. Judging by the output on screen, the mpi version seems to work reasonable well, but it doesn’t writes the expected clusters before die. So, I thought I had missed some MPIRUN option. Well, if you don’t use the mpicluster, who am I to use it? Thanks for sharing.
EDIT: the information bellow might be useful to another user and/or author.
Feb 25 14:59:59 compute-1-20 kernel: Out of memory: Kill process 25129 (cluster.mpi.lin) score 445 or sacrifice child
Feb 25 14:59:59 compute-1-20 kernel: Killed process 25129, UID 1006, (cluster.mpi.lin) total-vm:7986656kB, anon-rss:7777940kB, file-rss:2620kB
For some reason the process has been killed with the status “Out of memory”. The same jobs was completed with the non-mpi version of cluster.default.linuxgccrelease.
March 27, 2014 at 6:29 pm #9929Anonymous
This problem has been solved decreasing the number of process per worknode.
Hope it helps.
- You must be logged in to reply to this topic.