IBVerbs NAMD build segfaults

From: Jason Russler (jrussler_at_helix.nih.gov)
Date: Wed Jan 21 2009 - 08:47:28 CST


Among other things, I support builds of NAMD at a computing site at the
National Institutes of Health and I'm having a problem with builds
intended for our an Infiniband cluster. We've been running MVAPICH NAMD
trouble-free for quite a while but I've recently been trying to get a
stable IBVerbs build working for our users. The problem is that any
build I make invariably gets incredible performance and scaling and then
dies with some variation of this:

Stack Traceback:
Stack Traceback:
Stack Traceback:
Stack Traceback:
Stack Traceback:
  [0] /lib64/libc.so.6 [0x2b54c39871b0]
  [1] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x92e1fe]
  [2] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x92d01a]
  [3] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x937f90]
  [4] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x9351a5]
  [5] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x93533f]
  [6] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x934fca]
  [7] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x42e300]
  [8] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x425e9f]
  [9] __libc_start_main+0xf4  [0x2b54c39748b4]
  [0] /lib64/libc.so.6 [0x2b50f81d61b0]
I've tried charm-6.0/namd-2.6 using icc with a charm++ build target of 
net-linux-x86_64-ibverbs-icc10 and with charm-6.0/namd-cvs (1-14-09) 
with the same or similar results (with and without "-memory os" or 
-memory paranoid").  I've not been able to find much information about 
ibverbs builds of NAMD; only some reference to the same or similar 
problem (with no solution) and that people do run it successfully.
I test builds with the standard apoa1 and stmv benchmarks, both of which 
pass, but when I offer the build to users, they experience random 
segfaults like the one above.  Users report that there is no apparent 
instability in their systems when the crash occurs.  Knowing nothing 
about MD myself, I extended the default number of steps for the stmv 
bench and sure enough it faults well after 1000 steps (last time at 
22280, but for all I know, the system isn't suppose to go that long).
Given the profound scaling improvement with the IBVERBS version, I'd 
really like to get this working.  With larger systems our users can run 
at 1024+ procs at > 75% efficiency which we can't get close to with MPI 
(or at least that's what it looks like before the job dies).   Any 
advice would be very appreciated.
Jason Russler
Linux Systems Engineer
Helix Systems, CIT, NIH

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:16 CST