Re: performance question

From: Thomas C. Bishop (
Date: Tue Apr 28 2015 - 15:53:42 CDT

see comments inline

On 04/28/2015 03:06 PM, Bennion, Brian wrote:
Hello Tom

Interesting data.  I have never been able to make namd2.10 run any faster than namd2.9 on a non-cuda cluster at LLNL.

i (thought) namd2.10 did better for me... but that may be system specific ... I'm using amber prm/tops and have pro/dna + water and ions in my system.

I am a little confused by the nomenclature in the text below the graph.  You mention having 20 cores/node. 
Two 10-core 2.8 GHz E5-2680v2 Xeon processors per node (or 20cores/node)
Two NVIDIA Tesla K20x GPU's
56 Gb/sec (FDR) InfiniBand 2:1 oversubscribed mesh)

After that you wrote that you left one core per CPU inactive.  Do you mean 1 core per node?  Is there one process per core?
yes I leave 1 core each cpu inactive so 2 cores per node

Finally, can you explain wht 432SU/ns means.

This is the number of Service Units (SU = cpu hrs) I must request on this shared resource to run 1 ns of my simulation.

it costs me 432 SU  = processors * run time /ns
this is not obvious from the plot b/c 180core = 9 nodes (full utilization@ 20core/node ) or 10 nodes (9/10's utilization of the 20core/node)
So the sin0-20 is using  only 9 nodes at full utilization and takes about 3.5hrs per ns of simulation = 3.5hrs *9nodes*20core = 630 SU
the sin0-18 is using 10 nodes (9/10 utilization) and takes about 2.25hs  so  2.25hrs *10nodes*20core = 450SU

This is a shared machine so I have to request a certain amount of service units (SU) measured as CPU hrs regardless of whether or not I use all CORE when I ask for them.
sin0-18 w/ 180 core is clearly a better use of my SU than sin0-20... it's more efficient (450SU/ns rather than 630SU/ns)  AND it runs faster :-)
but there are plenty of places where that are more efficient ...e.g.  sin0-18 at 36 core -> 5*2*18 = 180SU/ns  very efficient use of my SU
but it will take me 2x real time to get my simulations done...

HOw long am I willing to wait vs. how much time can I get on the machine is the optimization problem.


Thanks again for posting the data.

From: [] on behalf of Thomas C. Bishop []
Sent: Tuesday, April 28, 2015 12:32 PM
Subject: Re: namd-l: performance question

Thanks Norman,
The extra options were admittedly just there as I tried to tweak things and convince myself namd was really doing what I thought it was and performing optimally.

I always  conduct a set  of benchmark runs of my particular system before starting my studies. something Jim Phillips suggested long ago, even as you consider a potential cluster purchase.
CUDA does in fact give me significant speedups but as noted in my original message the CPU & GPU utilization seem rather  low , consistently < 30%

I thought I'd post this pdf and get some feedback or discussion about how best to decide
 "have I tweaked the performance enough?"

 Of course it's hard to compare MY simulations to YOURs but hopefully others can offer general comments that may be of use to all namd users.

For my production runs I get more simulation for my allotted SU when using ~ 180 core.
This splits the difference between efficiency and throughput, but I don't have a real metric or objective criteria for making this decision.


On 04/28/2015 02:51 AM, Norman Geist wrote:

No, usually the utilisation is higher, but this doesn’t matter if the speedup is satisfying.

So you should benchmark for various node counts and have a look on the speedup, relative

to one node.


I’ll give some hints on what to try:

 (forget about +ppn +pemap  +commap for now)


1.       Do not pass +devices, better pass +ignoresharing

2.       Try adding “twoawayx yes” in your namd script

On improvement try adding twoawayy.

On improvement try adding twoawayz.


Norman Geist.


From: [] On Behalf Of Thomas C. Bishop
Sent: Monday, April 27, 2015 10:54 PM
Subject: namd-l: performance question


Dear NAMD,

Is it typical to have ~30% CPU usage (reported by say uptime/top) and ~20% GPU usage
reported by nvidia-smi for  NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA/ runs ?

I'm used to seeing the CPUs pegged at 100% for non-gpu runs.

Any suggestions/feedback greatly appreciated.

I have a system w/ 266038 atoms
and I"m trying to optimize the run time performance on 200 core (10 nodes) of  a machine where each node has

Two 10-core 2.8 GHz E5-2680v2 Xeon processors
Two NVIDIA Tesla K20x GPU's
56 Gb/sec (FDR) InfiniBand 2:1 oversubscribed mesh)

I get the best performance when I leave one or two cores per node for communication

 ~/bin/NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA//charmrun ++p 180 ++ppn 18 ++nodelist $nodefile ~/bin/NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA/namd2 +pemap 0-8,10-18 +commap 9,19 +devices 0,1,0,1 dyn10.conf

OR  more simply

 ~/bin/NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA///charmrun ++p 180 ++ppn 18 ++nodelist $nodefile ~/bin/NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA/namd2 dyn10.conf

BUT the utilization of the cores is only 30% (+/-10)
and nnvidia-smi reports < 20% utilization

see below

typical node usage
Tasks: 707 total,   1 running, 706 sleeping,   0 stopped,   0 zombie
Cpu0  : 28.9%us, 24.2%sy,  0.0%ni, 46.6%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  : 24.7%us, 23.1%sy,  0.0%ni, 52.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 31.2%us, 22.4%sy,  0.0%ni, 46.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 13.8%us, 12.4%sy,  0.0%ni, 73.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 28.2%us, 23.8%sy,  0.0%ni, 48.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 33.4%us, 22.9%sy,  0.0%ni, 43.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 30.6%us, 19.9%sy,  0.0%ni, 49.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 21.4%us, 11.4%sy,  0.0%ni, 67.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 31.5%us, 21.9%sy,  0.0%ni, 46.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 33.1%us, 21.2%sy,  0.0%ni, 45.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 27.2%us, 24.1%sy,  0.0%ni, 48.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 28.1%us, 24.1%sy,  0.0%ni, 47.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu12 : 28.6%us, 22.6%sy,  0.0%ni, 48.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 26.5%us, 24.1%sy,  0.0%ni, 49.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 20.3%us, 28.4%sy,  0.0%ni, 51.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 32.0%us, 22.6%sy,  0.0%ni, 45.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 : 30.5%us, 21.9%sy,  0.0%ni, 47.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 : 28.8%us, 22.4%sy,  0.0%ni, 48.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 : 33.0%us, 21.1%sy,  0.0%ni, 45.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 : 30.6%us, 20.4%sy,  0.0%ni, 49.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65877348k total,  3202044k used, 62675304k free,   155228k buffers
Swap: 134217720k total,     8680k used, 134209040k free,   701920k cached

typical GPU usage

Mon Apr 27 15:47:19 2015      
| NVIDIA-SMI 340.32     Driver Version: 340.32         |                      
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K20Xm         On   | 0000:03:00.0     Off |                    0 |
| N/A   27C    P0    65W / 235W |    114MiB /  5759MiB |     17%      Default |
|   1  Tesla K20Xm         On   | 0000:83:00.0     Off |                    0 |
| N/A   27C    P0    70W / 235W |    113MiB /  5759MiB |     21%      Default |
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|    0    104307  ...n/NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA/namd2    97MiB |
|    1    104307  ...n/NAMD_2.10_Linux-x86_64-ibverbs-smp-CUDA/namd2    97MiB |
[bishop@qb091 ~]$

   Thomas C. Bishop
    Tel: 318-257-5209
    Fax: 318-257-3823

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:50 CST