Member Site › Forums › Rosetta 3 › Rosetta 3 – Applications › UnfoldedStateEnergyCalculator MPI Error
- This topic has 1 reply, 2 voices, and was last updated 7 years, 2 months ago by Anonymous.
-
AuthorPosts
-
-
November 13, 2017 at 5:43 pm #2781Anonymous
Hi all,
I’m trying to use UnfoldedStateEnergyCalculator for some nonstandard Amino Acids and I’m having a persistent MPI_Recv error when run in parallel mode. I’m using Rosetta 3.8 compiled with icc 17.0.4 and impi 17.0.3 and running on TACC Stampede2. Error happens for every input I’ve tried (different AAs, stock / modified databases, smaller sets of input PDBs), and I’ve been able to successfully run the Rosetta 3.4 version of UFSEC with the same inputs on an older cluster (Stampede). UFSEC R3.8 runs in serial mode, but seems to be taking much longer than it should to run (5-10min per PDB). SysAdmins are stumped, and I’ve tried all of the solutions easily google-able without any luck. As far as I can tell, it looks like one of the MPI_Recv() calls from UnfoldedStateEnergyCalculatorMPIWorkPoolJobDistributor is choking (see error output below).
Is this a bug, or can someone suggest a way to get this working?
Running with command:
ibrun $TACC_ROSETTA_BIN/UnfoldedStateEnergyCalculator.cxx11mpi.linuxiccrelease -database /work/02984/cwbrown/stampede2/Data/Rosetta/rosetta3.8_database_nsAAmod -ignore_unrecognized_res -ex1 -ex2 -extrachi_cutoff 0 -l /work/02984/cwbrown/stampede2/Data/Rosetta/ncAA_rotamer_libs/scripts/cullpdb_list.txt -mute all -unmute devel.UnfoldedStateEnergyCalculator -unmute protocols.jd2.PDBJobInputer -residue_name NBY -no_optH true -detect_disulf false > ufsec_log_NBY.txt&
===========
Error:
===========
…
protocols.jd2.MPIWorkPoolJobDistributor: (2) Slave Node 2: Rprotocols.jd2.MPIWorkPoolJobDistributor: (3) Slave Node 3: Requesting new jobprotocols.jd2.MPIWorkPoolJobDistributor: (4) Slave Node 4: Requesting new job id fprotocols.jd2.MPIWorkPoolJobDistributor: (5) Slave Node 5: Requesting new job id from mequesting new job id from master
id from master
rom master
aster
protocols.jd2.MPIWorkPoolJobDistributor: (1) Slave Node 1: Requesting new job id from master
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Getting next job to assign from list id 1 of 5
protocols.UnfoldedStateEnergyCalculator.UnfoldedStateEnergyCalculatorMPIWorkPoolJobDistributor: (0) Master Node: Waiting for job requests…
TACC: MPI job exited with code: 14
TACC: Shutdown complete. Exiting.
Fatal error in MPI_Recv: Message truncated, error stack:
MPI_Recv(224)………………………: MPI_Recv(buf=0x7ffd7a5fb534, count=1, MPI_INT, src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0x7ffd7a5fb520) failed
MPIDI_CH3_PktHandler_EagerShortSend(455): Message from rank 1 and tag 10 truncated; 8 bytes received but buffer size is 4
-
November 13, 2017 at 6:43 pm #13881Anonymous
There was a bugfix for this code – although not necessarily this ISSUE – in mid-May (after 3.. I would suggest you try the most recently weekly to see if the problem goes away. I’ll also tag Doug (the author of this code).
-
-
AuthorPosts
- You must be logged in to reply to this topic.