Mpirun launches processess but does not run

Member Site Forums Rosetta 3 Rosetta 3 – Build/Install Mpirun launches processess but does not run

Viewing 0 reply threads
  • Author
    Posts
    • #3917
      Anonymous

        Dear Rosetta Community,

        I have been frequently and successfully using the mpi installed version of rosetta_scripts on my institute high performance computing cluster for the past couple of years. However, I have been running into parallelisation issues of late. The command I use to submit jobs on our cluster using PBS is as follows:


        qsub -P chemical -N rosetta -l select=200:ncpus=1 -l walltime=168:00:00 design.sh

        The above command requests for 200 cores (not necessarily on the same node) for 168 hours and runs a script called design.sh. The script is as follows:


        ## design
        cd $PBS_O_WORKDIR # cd to current working directory
        module load apps/Rosetta/2020.03/intel2019 # Load rosetta module

        folder=outputs
        nstruct=9999

        rm tracer*
        rm -rf $folder
        mkdir $folder

        mpirun -np $PBS_NTASKS $ROSETTA_BIN/rosetta_scripts.mpi.linuxiccrelease -parser:protocol design.xml
        -s complex.pdb
        -nstruct $nstruct
        -overwrite -write_all_connect_info
        -jd2:failed_job_exception false
        -out:path:pdb ./$folder/
        -out:file:scorefile design.fasc
        -mpi_tracer_to_file tracer.log

        This script instructs mpirun to launch as many cores as were requested for (here 200). These scripts worked fine for many months until the past few weeks where the tracer.log_* files are not being produced. Although the “rosetta_scipts” process is seen running on the compute nodes with zero % memory and CPU utilisation. I have attached a screenshot to demonstrate this. The screenshot shows that five “rosetta_scripts” processes were launced on a node named csky111 but without actually utilising the resources. Most importantly, the issue is not persistent, the same scripts works perfectly sometimes but fails quite often too. I am unable to find any pattern why the same script fails or succeeds.

        Some observations that might help with resolving the issue:

        1. Scripts work fine when running on < 50 cores.
        2. Memory allocated to each process was not a limiting factor.
        3. -lselect=1:ncpus=96 (all 96 cores on the same node) is more likely to succeed instead of -lselect=96:ncpus=1 (possibly scattered).

         

        What could be the likely issue and resolution? Would you recommend re-installing the application ? Any thoughts on this would be really helpful.

        Thank you,

        Akshay

    Viewing 0 reply threads
    • You must be logged in to reply to this topic.