Update to CUDA error in NAMD 2.7: Increase MAX_EXCLUSIONS: problem persists and CPU-only MD scales poorly

From: Pietro Amodeo (pamodeo_at_icmib.na.cnr.it)
Date: Mon Dec 27 2010 - 14:01:04 CST


one month ago I posted a message (title: CUDA error in NAMD 2.7: Increase
MAX_EXCLUSIONS) about a systematic error I obtain when trying to simulate
a relatively large (120978 atoms) system on a 2 CPU (Xeon 5650 SixCore)
workstation (seen as a 24-core machine) equipped with two CUDA boards.
For detail about errors, including further HW info and a typical output
from a failed run, please see my previous post.

Meantime, I've performed further tests, by changing more radically the
simulation conditions (e.g. by switching to NVT runs), but the number of
exclusions and the error message didn't change. So, I'd like to know if
this limitation can be circumvented or the system is just too large (or
its composition generates this pathological behaviour).

By switching to pure CPU simulations, jobs run flawlessly but, when
submitting test simulations of 10.000 MD steps each on a variable number
of cores, the results obtained using "new load balancers -- ASB" show a
scaling that, considering that simulations were run on a single machine,
are far from ideal and, in any case, worse than those observed with NAMD
2.6 on a 112-core cluster (with dual-opteron 8-core nodes and infiniband
connection). Unfortunately, although the simulation setup was very
similar, the systems tested on the cluster were different (smaller) and
presently I can't align the two sets of benchmarks.

Here is a table of the relative scalings vs. the number of employed cores,
obtained using NAMD 2.7 on the 24-core workstation:

N Time(N)/Time(1)
 1 1
 2 1.8950753798
 4 3.4533628378
 6 4.9593143628
 8 6.277425646
10 7.7307594158
12 8.7555253315

16 8.4592641261

20 9.2227793696
22 9.7023360965
24 10.261008169

Comments/suggestions about both the errors for the GPU, and the scaling
for the CPU versions of NAMD 2.7 are welcome.
Again, I can provide any other information or execute tests that may be
useful for the resolution of the problem.

Thanks in advance,

