MPI stall – Rosetta Commons

This topic has 9 replies, 2 voices, and was last updated 12 years, 4 months ago by Anonymous.

Viewing 2 reply threads

Author

Posts
- October 11, 2013 at 1:59 am #1723
  Anonymous
  Hi all,
  
  Has anyone ever had any problems with the master MPI process stalling as it’s cleaning up / giving more work to a slave or a little afterward?
  
  I am running more tests tomorrow, but I have had it stall (While still saying it at 100 % cpu) with its output garbled like this before it just stops communicating with the rest the slaves. Once it stops communicating, all the slaves continue to run, but nothing gets output and the job needs to be cancelled. Also, no error message is reported from MPI or Rosetta. It just silently stops:
  
  protocols.jd2.MPIWorkPoolJobDistributor: (81) Slave Node 81: Finished job successfully! Sendiprotocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from ng output reque 81 with tag 30
  protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Nst to master.
  ode: Receivedprotocols.jd2.MPIWorkPoolJobDistributor: (81) Slave job Node 81: Receivesd output confirmuccess message for jation from master.ob id 65 from node 81 blocking till output is d Writing oone
  utput.
  protocols.jd2.MPIWorkPoolJobDistributor: (81) Slave Node 81: Finished writing output in 0.3 seconds. Sendiprotocols.jd2.MPIWorkPoolJobDistributor: (0) Masterng messag Node: Received job oute to mastput finish messageer
  for job id protocols.jd2.MPIWorkPoolJobDistributor: (81) Slave0 f Node 81: Requesting rom node new job id from mas81
  protocols.jd2.MPIWorkPoolJobDistributor: (0) Mter
  aster Node: Waiting for 56 slaves to finish jobs
  protocols.jd2.MPIWorkPoolJobDistributor: (81) Slavprotocols.jd2.MPIWorkPoolJobDistributor: (0) Mastee Node 81: Receir Node: Received message from 81 ved job id 0 frwith tag 10
  om master
  protocols.jd2.MPIWorkPoolJobDistributor: (0) Master protocols.jd2.JobDistributor: (81) no more bNode: Sendinatches to procg spin down ess…
  signal to node 81
  protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: protocols.jd2.JobDistributor: (81) 134 joWaiting bs considfor 5ered, 1 jo5 slaves to bs attempted finish jobs
  in 191005 seconds
  
  And this:
  
  ode: Received jobprotocols.jd2.MPIWorkPoolJobDistributor: (21) Slave success messa Node 2ge for job id1: Received out 119 from nodput confirmation from master. Write 21 blocking ing output.
  till output is done
  protocols.jd2.MPIWorkPoolJobDistributor: (21) Slave Node 21: Finished writing output in 0.28 seconds. Sending message to master
  protocols.jd2.MPIWorkPoolJobDistributor: (0) Masterprotocols.jd2.MPIWorkPoolJobDistributor: (21) Slav Node: Received job e Node 21: Requesting output finish mesnew job id from mastersage for job
  id 133 from node 21
  protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Nprotocols.jd2.MPIWorkPoolJobDistributor: (21) Slave ode: WaitinNode 21: Received job ig for job reqd 133 from master
  uests…
  protocols.jd2.PDBJobInputter: (21) PDBJprotocols.jd2.MPIWorkPoolJobDistributor: (0) Master obInputter::pose_frNode: Received messagom_job
  protocols.jd2.PDBJobInputter: (21) fe fr
  
  We are running mpiexec (OpenRTE) 1.6.3
  
  Thanks for any help!
  
  -Jared
- October 11, 2013 at 2:57 pm #9394
  Anonymous
  Are you using -mpi_tracer_to_file (filestem)? That will preclude garbling by putting different nodes in different output files. I haven’t seen the behavior but it makes me wonder if you somehow have two head nodes…do the tracers’ node tags add up correctly?
- October 14, 2013 at 3:38 pm #9414
  Anonymous
  A follow up to the discussion from our cluster admin:
  
  #########
  J,
  The mpd daemons on nodes 63 and 64 failed for some reason leaving the mpd ring broken. To list the state of the mpd ring use /apps/mpich2/bin/mpdtrace -l. I have restarted the ring and will start a search for more robust code.
  #########
- October 11, 2013 at 3:37 pm #9397
  Anonymous
  I’m not using any extra MPI options, though the option you suggested may be handy (Is it generally recommended?).
  
  I don’t know how I would have two head nodes… It seems that looking through the log file, there is only one. I emailed you the log file if you can take a quick look. I have read that open MPI can stall on different occasions, and looking at the bugtracker for it, its a bit overwhelming to try and determine how or why its happening via some bug in the MPI libraries that just happens to dislike our cluster. Do you use open MPI or MPICH2 on your cluster?
- October 11, 2013 at 4:06 pm #9399
  Anonymous
  That flag is pretty much strictly necessary for MPI debugging, and strongly encouraged with MPI if you are not using -mute all instead…because otherwise you get garbage data. We’ve never discussed defaulting it to true; we could, but we’d need to put a system in to make sure the new log files won’t overwrite any existing files.
  
  I’ve used Rosetta with both OpenMPI and MPICH2.
  
  The log file you sent me does not seem to indicate an error… the last report from the head node is that it’s waiting for a request.
- October 11, 2013 at 4:31 pm #9400
  Anonymous
  Thanks Steven.
  
  I’ll add that to my flags and try to debug it further. Perhaps a slave node failed and something went wrong with the master-slave communication. Yea, no error, just stall. It stayed like that for almost a day, meanwhile on the cluster page, all my processes were running at 100 %. Other large runs came out fine, the structures that were complete came out fine as well. Do you know which version of OpenMPI you have run it on?
- October 11, 2013 at 5:09 pm #9401
  Anonymous
  I’ve seen issues like this on the killdevil cluster at UNC that I no longer have access too, but I generally just wrote it off as “hardware” and it recurred so rarely/irreproducibly that it was never worth worrying about further. (That’s MPICH2). (Given your situation, unless it reproduces reliably, I’d ignore it too).
  
  Contador in Brian’s lab uses openMPI, apparently version 1.5. We don’t use it as a “cluster” and I don’t think anyone’s seen this there.
  
  Can you log into slave nodes directly to run “top” and see what they are actually doing? The clusters I have used have allowed you to directly ssh into slave nodes…they’ll get angry if you use it to run jobs, but it’s cool for debugging.
- October 11, 2013 at 6:56 pm #9402
  Anonymous
  It happened twice in a row, out of 3 total runs. . The first may have been because people were oversubscribing nodes and our grid engine has no MPI communication, so the master could have just given up. I just started using MPI, so I’m not sure how common it will be for us, but I’ll run a few more and see what happens. There is a top like webpage for ours that lists all the processes and their speed, memory, etc. I’ll have our cluster admin update openMPI and see if it keeps happening. It didn’t happen on the numerous test runs which were short, but still the same number of nodes/processors/structures.
  
  If all else fails, I guess its back to batch jobs…
- October 11, 2013 at 7:26 pm #9403
  Anonymous
  This is “bad practice”, but if they’re failing at the _end_ of the run, just let them fail. Unless you need precisely 1000 structures for some reason (like statistics-sensitive thermodynamic ensemble work), then 9998 is good enough, so let it produce what it will, then kill the job.
- October 14, 2013 at 6:54 pm #9418
  Anonymous
  Will definitely consider this if the slave nodes keep going…
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.