Member Site › Forums › Rosetta 3 › Rosetta 3 – General › AbInitioRelax.mpi Hangs – Waiting for Job Request
- This topic has 2 replies, 2 voices, and was last updated 5 years ago by Anonymous.
-
AuthorPosts
-
-
October 31, 2019 at 2:30 pm #3286Anonymous
Hi guys,
I recently downloaded and compiled Rosetta with MPI capabilities to take advantage of the 32 core processor we have on our workstation. Compilation went well, and I can call protocols – but they all seem to hang.
To help narrow things down, I am working out of the DeNovo Structure Prediction tutorial demo directory – I can call the protocol and it seems to start running as normal:
mpirun -n 32 $ROSETTA_MPI/main/source/bin/AbinitioRelax.mpi.linuxgccrelease @input_files/options
Everything starts up like normal, but it always ends of hanging on this output:
~$: protocols.jobdist.JobDistributors: (0) Master Node — Waiting for job request; tag_ = 1
I dug around these forums – seems that the code is still trying to run off of only one core – not sure why. Is there a way to specify I want to run on many cores? I thought this was the purpose of running compiled binaries with extras=mpi.
I looked into the code where it gets stuck, seems like it is forever waiting on a return from the MPI_Recv( ) function.. I could be wrong – I cant read C++ all that well:
(From protocols.jobdist.JobDistributors)
418 while ( true ) {
419 int node_requesting_job( 0 );
420
421 JobDistributorTracer << “Master Node — Waiting for job request; tag_ = ” << tag_ << std::endl;
422 MPI_Recv( & node_requesting_job, 1, MPI_INT, MPI_ANY_SOURCE, tag_, MPI_COMM_WORLD, & stat_ );
423 bool const available_job_found = find_available_job();
424
425 JobDistributorTracer << “Master Node –available job? ” << available_job_found << std::endl;
426
427 Size job_index = ( available_job_found ? current_job_ : 0 );
428 int struct_n = ( available_job_found ? current_nstruct_ : 0 );
429 if ( ! available_job_found ) {
430 JobDistributorTracer << “Master Node — Spinning down node ” << node_requesting_job << std::endl;
431 MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
432 break;
433 } else {
434 JobDistributorTracer << “Master Node — Assigning job ” << job_index << ” ” << struct_n << ” to node ” << node_requesting_job << std::endl;
435 MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
436 MPI_Send( & struct_n, 1, MPI_INT, node_requesting_job, tag_, MPI_COMM_WORLD );
437 // ++current_nstruct_; handled now by find_available_job
438 }
439 }
440
441 // we’ve just told one node to spin down, and
442 // we don’t have to spin ourselves down.
443 Size nodes_left_to_spin_down( mpi_nprocs() – 1 – 1);
444
445 while ( nodes_left_to_spin_down > 0 ) {
446 int node_requesting_job( 0 );
447 int recieve_from_any( MPI_ANY_SOURCE );
448 MPI_Recv( & node_requesting_job, 1, MPI_INT, recieve_from_any, tag_, MPI_COMM_WORLD, & stat_ );
449 Size job_index( 0 ); // No job left.
450 MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
451 JobDistributorTracer << “Master Node — Spinning down node ” << node_requesting_job << ” with ” << nodes_left_to_spin_down << ” remaining nodes.” << std::endl;
452 –nodes_left_to_spin_down;
453 }
454
455 }
Any help is appreaicted!
Thanks!
Nathan
-
November 4, 2019 at 9:02 pm #15040Anonymous
In your output, are you getting any ‘(1)’ or other such (non-zero) labels?
The other thing I would double check is that the MPI libraries you compiled with are the proper “flavor” and version to go with the mpirun command you’re using. If you have a “flavor” mismatch (e.g. running a Rosetta compiled with OpenMPI with a MPICH2 mpirun), you might have issues getting Rosetta to recognize that it’s running under MPI.
-
November 8, 2019 at 8:44 pm #15053Anonymous
I just ran it again, and it apepars that all outputs have ‘(0)’ as a label – no non-zero labels.
I need to double check the MPI libraries. Do you have a suggestion as to how I can check that? I am attempting to run the protocols using mpirun. I have OpenMPI installed, and when I compiled Rosetta, it was calling mpicc to compile the source. I also had to comment out all the header file environment variables in the site.settings file to get the code to compile with extras=mpi – I am not sure if this is necessary information, but it seems that both the INCLUDE and LD_LIBRARY_PATH environment variables were empty when I compiled – and it was able to compile after I told it to ignore those.
I am not sure if this is sufficient information! Let me know… Thank you!
-
-
-
AuthorPosts
- You must be logged in to reply to this topic.