AbInitioRelax.mpi Hangs – Waiting for Job Request

Member Site Forums Rosetta 3 Rosetta 3 – General AbInitioRelax.mpi Hangs – Waiting for Job Request

Viewing 1 reply thread
  • Author
    Posts
    • #3286
      Anonymous

        Hi guys,

         

        I recently downloaded and compiled Rosetta with MPI capabilities to take advantage of the 32 core processor we have on our workstation. Compilation went well, and I can call protocols – but they all seem to hang.

         

        To help narrow things down, I am working out of the DeNovo Structure Prediction tutorial demo directory – I can call the protocol and it seems to start running as normal:

         

        mpirun -n 32 $ROSETTA_MPI/main/source/bin/AbinitioRelax.mpi.linuxgccrelease @input_files/options

         

        Everything starts up like normal, but it always ends of hanging on this output:

         

        ~$: protocols.jobdist.JobDistributors: (0) Master Node — Waiting for job request; tag_ = 1

         

        I dug around these forums – seems that the code is still trying to run off of only one core – not sure why. Is there a way to specify I want to run on many cores? I thought this was the purpose of running compiled binaries with extras=mpi.

         

        I looked into the code where it gets stuck, seems like it is forever waiting on a return from the MPI_Recv( ) function.. I could be wrong – I cant read C++ all that well:

        (From protocols.jobdist.JobDistributors)

           418          while ( true ) {

           419                  int node_requesting_job( 0 );

           420

           421                  JobDistributorTracer << “Master Node — Waiting for job request; tag_ = ” << tag_ << std::endl;

           422                  MPI_Recv( & node_requesting_job, 1, MPI_INT, MPI_ANY_SOURCE, tag_, MPI_COMM_WORLD, & stat_ );

           423                  bool const available_job_found = find_available_job();

           424

           425                  JobDistributorTracer << “Master Node –available job? ” << available_job_found << std::endl;

           426

           427                  Size job_index = ( available_job_found ? current_job_ : 0 );

           428                  int struct_n  = ( available_job_found ? current_nstruct_ : 0 );

           429                  if ( ! available_job_found ) {

           430                          JobDistributorTracer << “Master Node — Spinning down node ” << node_requesting_job << std::endl;

           431                          MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );

           432                          break;

           433                  } else {

           434                          JobDistributorTracer << “Master Node — Assigning job ” << job_index << ” ” << struct_n << ” to node ” << node_requesting_job << std::endl;

           435                          MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );

           436                          MPI_Send( & struct_n,  1, MPI_INT, node_requesting_job, tag_, MPI_COMM_WORLD );

           437                          //              ++current_nstruct_; handled now by find_available_job

           438                  }

           439          }

           440

           441          // we’ve just told one node to spin down, and

           442          // we don’t have to spin ourselves down.

           443          Size nodes_left_to_spin_down( mpi_nprocs() – 1 – 1);

           444

           445          while ( nodes_left_to_spin_down > 0 ) {

           446                  int node_requesting_job( 0 );

           447                  int recieve_from_any( MPI_ANY_SOURCE );

           448                  MPI_Recv( & node_requesting_job, 1, MPI_INT, recieve_from_any, tag_, MPI_COMM_WORLD, & stat_ );

           449                  Size job_index( 0 ); // No job left.

           450                  MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );

           451                  JobDistributorTracer << “Master Node — Spinning down node ” << node_requesting_job << ” with ” << nodes_left_to_spin_down << ” remaining nodes.”  << std::endl;

           452                  –nodes_left_to_spin_down;

           453          }

           454

           455  }

         

        Any help is appreaicted!

         

        Thanks!

         

        Nathan

      • #15040
        Anonymous

          In your output, are you getting any ‘(1)’ or other such (non-zero) labels?

          The other thing I would double check is that the MPI libraries you compiled with are the proper “flavor” and version to go with the mpirun command you’re using. If you have a “flavor” mismatch (e.g. running a Rosetta compiled with OpenMPI with a MPICH2 mpirun), you might have issues getting Rosetta to recognize that it’s running under MPI.

          • #15053
            Anonymous

              I just ran it again, and it apepars that all outputs have ‘(0)’ as a label – no non-zero labels.

               

              I need to double check the MPI libraries. Do you have a suggestion as to how I can check that? I am attempting to run the protocols using mpirun. I have OpenMPI installed, and when I compiled Rosetta, it was calling mpicc to compile the source. I also had to comment out all the header file environment variables in the site.settings file to get the code to compile with extras=mpi – I am not sure if this is necessary information, but it seems that both the INCLUDE and LD_LIBRARY_PATH environment variables were empty when I compiled – and it was able to compile after I told it to ignore those.

               

              I am not sure if this is sufficient information! Let me know… Thank you!

        Viewing 1 reply thread
        • You must be logged in to reply to this topic.