Scaling problem: 4 nodes OK, 5 nodes fail

From: Stober, Spencer T (spencer.t.stober_at_exxonmobil.com)
Date: Wed Jan 16 2013 - 13:46:07 CST

Hello,

Thanks in advance for the assistance. I have compiled and successfully run NAMD 2.9 on a cluster but cannot run on more than 4 nodes. The cluster has infiniband interconnect and each node has dual 6-core xeon processors running CentOS 5.x and MPI mvapich-1.2.0-gcc-x86_64 with a torque queuing system.

If I run on 4 nodes, 12 ppn, 48 cores, everything works, simulations are all OK. Using the EXACT same input files with the only change being the number of nodes and cores in the torque submission script the run fails. I have no idea why this occurs, I am certain that I have access to the resources (I can run other MPI programs on any number of nodes). The problem occurs in version 2.6 of NAMD and also with the CUDA version of NAMD 2.9.

Any ideas are greatly appreciated. Details of the problem follow:

Thanks, Spencer Stober, Ph.D.

MPI version: mvapich-1.2.0-gcc-x86_64

Compiled NAMD 2.9 and Charm with the following commmands:

Charm:
env MPICXX=mpicxx ./build charm++ mpi-linux-x86_64 --no-build-shared --with-production

NAMD:
./config Linux-x86_64-g++ --charm-arch mpi-linux-x86_64

-------- torque script to launch namd---------
#!/bin/bash
#PBS -N namd2
#PBS -l nodes=5:ppn=12
#PBS -q short
#PBS -V

NAMD_CONF="$PBS_O_WORKDIR/namd.conf"
NAMD_EXEC="/home/ststobe/NAMD_exe_NOCUDA/namd2"

HOSTFILE=$PBS_NODEFILE
cd $PBS_O_WORKDIR
export LD_LIBRARY_PATH=/home/ststobe/NAMD_exe_NOCUDA:$LD_LIBRARY_PATH
mpirun_rsh -rsh -np 60 -hostfile $HOSTFILE $NAMD_EXEC $NAMD_CONF > $PBS_O_WORKDIR/namd.$PBS_JOBID
--------------------------------------

NAMD output file for run on 4 nodes, 12 ppn, 48 cores:
----------------------------------------------
Charm++> Running on MPI version: 1.2
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 4 unique compute nodes (12-way SMP).
Charm++> cpu topology info is gathered in 0.089 seconds.
Info: NAMD 2.9 for Linux-x86_64-MPI

... then the rest of the startup output and all works fine....
------------------------------------------------

NAMD output file for run on 5 nodes, 12 ppn, 60 cores:
------------------------------------------------
Charm++> Running on MPI version: 1.2
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.

... and I get this from the stderr output from the torque system

MPI process terminated unexpectedly
Exit code -5 signaled from i18
Killing remote processes...MPI process terminated unexpectedly
MPI process terminated unexpectedly
MPI process terminated unexpectedly
MPI process terminated unexpectedly
DONE
------------------------------------------------

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:20:51 CST