multi-node mpiexec issue

From: Ryuzo Azuma (azuma.r.ac_at_m.titech.ac.jp)
Date: Fri Aug 31 2018 - 11:44:26 CDT

Dear Namd-l members:

We have started to explore multi-node capable namd by leveraging the
source code.
Compile and installation procedures in charm++ and namd2 have both been
passed.
However, execution tests haven't been successful so far.
Our efforts in debugging this issue haven't been successful either.
So we would like to ask for help from someone in this mailing list.

First, options in Make.config after config command are shown in the
following.

CHARMBASE = ${HOME}/apps/NAMD_Git-2018-08-23_Source/charm-6.8.2
include .rootdir/arch/Linux-x86_64-icc.arch
CHARMARCH = ofi-linux-x86_64-smp-icc
CHARMOPTS = -verbose
CHARM = $(CHARMBASE)/$(CHARMARCH)
NAMD_PLATFORM = $(NAMD_ARCH)-ofi-smp-CUDA-memopt
include .rootdir/arch/$(NAMD_ARCH).base
include .rootdir/arch/$(NAMD_ARCH).tcl
include .rootdir/arch/$(NAMD_ARCH).fftw3
MEMOPT=-DMEM_OPT_VERSION
TCLDIR = ${HOME}/apps/namd/tcl
FFTDIR = ${HOME}/apps
include .rootdir/arch/$(NAMD_ARCH).cuda
CUDADIR = ${CUDA_HOME}/8.0.61
CUDASODIR = ${CUDA_HOME}/8.0.61/lib64
LIBCUDARTSO = libcudart.so.8.0
LIBCUFFTSO = libcufft.so.8.0
CXXOPTS = -I${HOME}/apps/include
COPTS = -I${HOME}/apps/include
CXXOPTS = -g
CXXTHREADOPTS = -g
CXXSIMPARAMOPTS = -g
CXXNOALIASOPTS = -g
COPTS = -g

Next, command-line inputs for launching namd are as follows:

$ mpiexec.hydra -gwdir ${PWD} -gpath ${binpath} -genvall -v
-print-rank-map -ordered-output -rmk qrsh -binding 1 -OFI -PSM2 -RDMA
-perhost 1 -print-all-exitcodes -trace-pt2pt -np 2 namd2 ++ppn 6
+setcpuaffinity +pemap 0-55:7.6 +commap 6-55:7 +devices 0,1,2,3
minimize-equilibrate.namd &> output/minimize-equilibrate.log

Then an output from the above command is as follows:

$ tail  output/minimize-equilibrate.log

Info: SUMMARY OF PARAMETERS:
Info: 2336 BONDS
Info: 9466 ANGLES
Info: 10722 DIHEDRAL
Info: 391 IMPROPER
Info: 12 CROSSTERM
Info: 620 VDW
Info: 14 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PSF FILE: 0.00443578
Info:
Info: Entering startup at 3.52539 s, 1449.64 MB of memory in use
Info: Startup phase 0 took 0.000144958 s, 1449.67 MB of memory in use

namd2:3418 terminated with signal 11 at PC=0 SP=2aaabb0cc628. Backtrace:
/usr/lib64/libinfinipath.so.4(+0x45a8)[0x2aaaaf7a25a8]
/lib64/libpthread.so.0(+0x10b20)[0x2aaaaacdeb20]
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3418 RUNNING AT r7i5n6
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[mpiexec_at_r7i5n6] Exit codes: [r7i5n6] 1
[r6i0n8] 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3418 RUNNING AT r7i5n6
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
    Intel(R) MPI Library troubleshooting guide:
       https://software.intel.com/node/561764
===================================================================================

We also tested the same launching command for the alanin sample files.
In this case, we obtained the following output:

Info: Entering startup at 2.93095 s, 1445.21 MB of memory in use
Info: Startup phase 0 took 0.000135899 s, 1445.27 MB of memory in use
Info: Startup phase 1 took 0.000266075 s, 1445.41 MB of memory in use
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 705 POINTS
Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609
Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 6.88002e-15 AT 7.96477
Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609
Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 6.65646e-16 AT 7.96477
Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000290023 AT 0.251946
Info: ABSOLUTE IMPRECISION IN VDWA TABLE ENERGY: 1.26218e-29 AT 7.93332
Info: RELATIVE IMPRECISION IN VDWA TABLE ENERGY: 1.03763e-15 AT 7.96477
Info: ABSOLUTE IMPRECISION IN VDWA TABLE FORCE: 3.15544e-30 AT 7.96477
Info: RELATIVE IMPRECISION IN VDWA TABLE FORCE: 1.29505e-16 AT 7.96477
Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
Info: ABSOLUTE IMPRECISION IN VDWB TABLE ENERGY: 3.30872e-24 AT 7.93332
Info: RELATIVE IMPRECISION IN VDWB TABLE ENERGY: 1.17076e-15 AT 7.96477
Info: ABSOLUTE IMPRECISION IN VDWB TABLE FORCE: 8.27181e-25 AT 7.96477
Info: RELATIVE IMPRECISION IN VDWB TABLE FORCE: 1.30075e-16 AT 7.96477
Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00563612 AT 7.01338
Info: Running with 1 input processors.
Info: Running with 1 output processors (1 of them will output
simultaneously).
Info: INPUT PROC LOCATIONS: 4
Info: OUTPUT PROC LOCATIONS: 6
[4] Assertion "numAtomsPar > 0" failed in file src/Molecule.C line 4778.
------------- Processor 4 Exiting: Called CmiAbort ------------
Reason: Assertion "numAtomsPar > 0" failed in file src/Molecule.C line 4778.
Info: Startup phase 2 took 0.0141211 s, 1445.71 MB of memory in use
[4] Stack Traceback:
   [4:0] CmiAbortHelper+0xe9  [0x144b3d7]
   [4:1] CmiAbort+0x43  [0x144b426]
   [4:2] __cmi_assert+0x42  [0x145eaba]
   [4:3]
_ZN8Molecule21read_binary_atom_infoEiiR11ResizeArrayI9InputAtomE+0x56
[0xe1322e]
   [4:4] _ZN13ParallelIOMgr15readPerAtomInfoEv+0x68  [0xfd9c28]
   [4:5] _ZN4Node7startupEv+0x4c1  [0xe57d61]
   [4:6] _ZN12CkIndex_Node18_call_startup_voidEPvS0_+0x30 [0xe46de8]
   [4:7] CkDeliverMessageFree+0x5f  [0x12faa19]
   [4:8]   [0x12fabcf]
   [4:9]   [0x12fad34]
   [4:10]   [0x12fca94]
   [4:11]   [0x12fcbcd]
   [4:12] _Z15_processHandlerPvP11CkCoreState+0x1e3  [0x12fd28c]
   [4:13] CmiHandleMessage+0xa3  [0x1456c6e]
   [4:14] CsdScheduleForever+0xdb  [0x14570fe]
   [4:15] CsdScheduler+0x17  [0x1456ffa]
   [4:16] _Z10slave_initiPPc+0x83  [0x916f01]
   [4:17]   [0x144b08c]
   [4:18]   [0x1447b08]
   [4:19] +0x8724  [0x2aaaaacd6724]
   [4:20] clone+0x6d  [0x2aaaae550c9d]

We also have checked if charm++ sample programs run or not using
startupTest and megatest under
charm-6.8.2/ofi-linux-x86_64-smp-icc/tests/charm++.

They surely run normal using the same launching method as the above. For
instance:

mpiexec.hydra -gwdir ${PWD} -gpath ${PWD} -genvall  -v -print-rank-map
-ordered-output -rmk qrsh -binding 1 -OFI -PSM2 -RDMA -perhost 1
-print-all-exitcodes -trace-pt2pt -np 2 pgm &> pgm.log

Lastly, we are ready to show our device information currently using, if
necessary.

We are looking forward to hearing from anyone in the mailing list.

Best wishes,

--
Ryuzo Azuma
Researcher
Department of Computer Science, School of Computing
Tokyo Institute of Technology
J3-25, 4259, Nagatsuda, Midori-ku, Yokohama, Kanagawa
226-8502 JAPAN

This archive was generated by hypermail 2.1.6 : Tue Dec 10 2019 - 23:20:06 CST