NAMD Wiki: NamdOnCrayXT3
NAMD is available on the new Cray XT3 at the Pittsburgh Supercomputing Center.
PSC's directions are at http://www.psc.edu/general/software/packages/namd/namd.html
PSC recommends the following environment variables be set in your job script:
setenv MPICH_PTL_SEND_CREDITS -1 setenv MPICH_MAX_SHORT_MSG_SIZE 8000 setenv MPICH_PTL_UNEX_EVENTS 80000 setenv MPICH_UNEX_BUFFER_SIZE 100M
The NAMD developers also recommend the following option for dual-core systems:
setenv MPICH_RANK_REORDER_METHOD 1
According to the MPI man page, this specificies SMP-style placement (rank 0 and 1 on node 0, etc.). NAMD selects processes for the communication-intensive PME reciprocal sum based on bit-reversal ordering. The default round-robin rank ordering results in concentrating PME communication on a small number of nodes for many job sizes. SMP-style placement (default on most machines) spreads them out properly.
You can use setenv PMI_DEBUG 1 to display MPI rank placement information.
There is a ~jphillip/NAMD_scripts/runbatch script which will point at the best working binary available.
If you are seeing hangs on the XT3 please try an updated Charm 5.9 to get workarounds for broken implementations of calloc (doesn't always clear memory) and MPI_WallTime (sometimes runs backwards).
A stack overflow in alloca() might be fixed by the option -mcmodel=medium
The -small_pages option to yod provides a significant performance increase for NAMD.
The Opteron processor TLB provides 512 entries for 4kB pages, or 8 entries for 2MB pages. By default, Catamount (the OS for the XT3 slave nodes) uses 2MB pages since this allows a total of 16MB to be mapped in the TLB (vs 2MB for 4kB pages). That's fine unless your code jumps around to more than 8 places in memory, in which case you end up with a ton of TLB misses and reduced performance.
This effect isn't as bad as the default Catamount stack-based malloc (100 times slower than with -lgmalloc), but is another example of Cray making design decisions based on assumptions about old FORTRAN codes. It would be nice if the default behavior provided typical performance for typical applications, with unusual features available as options for the applications that might see a small performance boost.
Be sure to specify -small_pages if you write your own job script.
Apparently the new Barcelona quad-core Opterons will have 1GB pages, in which case the entire memory of a node can be mapped in the TLB and large pages will again make sense.
Using fat nodes
For large simulations you may need to make sure that process 0 uses a large-memory node (a fat node). That can be done at PSC by submitting the job to the 'phat' queue.
We will not be releasing NAMD binaries for this platform, since they would not be portabile between OS releases. Single-precision FFTW libraries (stolen from PSC) and a hacked version of Tcl (because of all the missing system calls, as both source and a library) are at http://www.ks.uiuc.edu/Research/namd/libraries/ so you just need to build charm (mpi-crayxt3) and NAMD (CRAY-XT3). Note that the charm-5.9 source distributed with NAMD is missing workarounds for broken calloc() and MPI_WallTime() on the XT3, so you'll need to get a newer release or the developmental build.