Mpirun launches processess but does not run

This topic has 0 replies, 1 voice, and was last updated 3 years, 11 months ago by Anonymous.

Viewing 0 reply threads

Author

Posts
- February 16, 2022 at 11:09 am #3917
  Anonymous
  Dear Rosetta Community,
  
  I have been frequently and successfully using the mpi installed version of rosetta_scripts on my institute high performance computing cluster for the past couple of years. However, I have been running into parallelisation issues of late. The command I use to submit jobs on our cluster using PBS is as follows:
```
qsub -P chemical -N rosetta -l select=200:ncpus=1 -l walltime=168:00:00 design.sh
```
  The above command requests for 200 cores (not necessarily on the same node) for 168 hours and runs a script called design.sh. The script is as follows:
```
## design

cd $PBS_O_WORKDIR                               # cd to current working directory

module load apps/Rosetta/2020.03/intel2019      # Load rosetta module



folder=outputs

nstruct=9999



rm tracer*

rm -rf $folder

mkdir $folder



mpirun -np $PBS_NTASKS $ROSETTA_BIN/rosetta_scripts.mpi.linuxiccrelease -parser:protocol design.xml 

-s complex.pdb 

-nstruct $nstruct 

-overwrite -write_all_connect_info 

-jd2:failed_job_exception false 

-out:path:pdb ./$folder/ 

-out:file:scorefile design.fasc 

-mpi_tracer_to_file tracer.log
```
  This script instructs mpirun to launch as many cores as were requested for (here 200). These scripts worked fine for many months until the past few weeks where the tracer.log_* files are not being produced. Although the “rosetta_scipts” process is seen running on the compute nodes with zero % memory and CPU utilisation. I have attached a screenshot to demonstrate this. The screenshot shows that five “rosetta_scripts” processes were launced on a node named csky111 but without actually utilising the resources. Most importantly, the issue is not persistent, the same scripts works perfectly sometimes but fails quite often too. I am unable to find any pattern why the same script fails or succeeds.
  
  Some observations that might help with resolving the issue:
  1. Scripts work fine when running on < 50 cores.
  2. Memory allocated to each process was not a limiting factor.
  3. -lselect=1:ncpus=96 (all 96 cores on the same node) is more likely to succeed instead of -lselect=96:ncpus=1 (possibly scattered).
  What could be the likely issue and resolution? Would you recommend re-installing the application ? Any thoughts on this would be really helpful.
  
  Thank you,
  
  Akshay
Author

Posts

Viewing 0 reply threads

You must be logged in to reply to this topic.