From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Dec 23 2014 - 02:34:58 CST
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ???
Gesendet: Montag, 22. Dezember 2014 19:11
An: namd-l_at_ks.uiuc.edu; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark
>From my test results, +idlepoll and PME offload does not improve the
performance. However, when I create more patches, for 8 nodes, I got 20%
better, though no change when I run on 16 nodes.
Probably you're right, I need to ask our staff to check the network. What other things are need to be checked, other than bandwidth?
The really important thing is latency, but as it is usually inverse proportional to the bandwidth, checking the bandwidth should point out what’s wrong. Could you also please describe the topology of your fat-tree? So how many leafs and how many nodes per leaf. As a quick check for the network you could try using f.i. 4 nodes on the same leaf, vs. 4 nodes splitted up over different leafs. This should it practice give the same performance, if it does not, something is not properly set up or cabled.
You might also want to enable IPoIB and use a standard network build of NAMD to exclude problems with ibverbs and RDMA. Also you should really make sure that only the hpc network is used, means monitoring the transferred data on the other networks (eth0, eth1) during your benchmark to check that there’s no computational traffic on it. (easiest way is to frequently ifconfig and have a look at the transferred data counts)
Norman Geist
Thanks,
Wenchang
2014-12-22 2:16 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ???
Gesendet: Freitag, 19. Dezember 2014 23:07
An: namd-l_at_ks.uiuc.edu; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark
Hi Norman,
Thanks for your suggestions. Among those things you suggested, the twoaway[xyz] works for my system with 116K atoms, I got 20% better (but no
Sure, this is only supposed to bring improvement when using GPUs.
difference with CPU only). I also double the number on a system with 315K atoms. Could you explain how NAMD throws patches to GPU cores, why there is no difference using CPU only? I only have the tests on 8 nodes, will continue to test on 16, 32 nodes.
I’m not sure but I think that each patch uses the GPU to compute its non-bonded stuff individually.
I really think that you need to look for the problem on your network. NAMD is known to scale quite excellent. And on network topology it should be able to do so. Use the ib_* tools that are usually present to measure your bandwidth. What did +idlepoll do?
Thanks,
Wenchang
2014-12-19 1:30 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
Von: 周文昌 [mailto:wenchangyu2006_at_gmail.com]
Gesendet: Donnerstag, 18. Dezember 2014 20:25
An: namd-l_at_ks.uiuc.edu; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark
Hi Norman,
Thanks for your time, We use ibverbs directly (I did mention ibverbs in the 4th paragraph).
Ok, now I’ve seen it ^^
If I do /sbin/ifconfig -a, the output is following:
eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
inet addr:10.1.3.1 Bcast:10.1.255.255 Mask:255.255.0.0
inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6 GiB)
Memory:dfa20000-dfa3ffff
eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:dfa00000-dfa1ffff
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)
Our numbers are below, I took the WallClock time at the end of each run (100,000 steps), instead of from "Benchmark time" in the NAMD output
Number of nodes WallClock time ns/day
CPU only
1.
4372.5
0.4
CPU+GPU
1.
1220.4
1.4
CPU+GPU
4.
332.6
5.2
CPU+GPU
8.
208.2
8.3
CPU+GPU
16.
135.2
12.8
CPU+GPU
32.
106.3
16.3
CPU+GPU
48.
97.5
17.7
Doesn’t look that bad, although it feels like it should be better. Have you benchmarked your infiniband bandwidth already, FDR should do better here, especially in a fat tree topology. Some things you can try to generally improve scaling of namd:
1. Generally add “+idlepoll” to namd2
2. When using GPUs try adding “twoawayx yes” to the script, if that helps, try in addition “twoawayy yes”, if that helps try in addition “twowayz yes”. (This helps creating more patches and so might improve scalability of your system)
3. Try turning off/on the new pme reciprocal sum offload by “pmeoffload no/yes” in script.
Sometimes, depending on your mpi, it might be necessary to exclude slow networks from the computation. As you see eth0 has a lot of traffic so make sure that you do not use mixed networks during some of your tests. I only know how it would be done using openmpi:
mpirun ... --mca btl ^tcp … #this excluded all tcp networks
mpirun … --mca btl openib #this included only ibverbs
Also, you CPUs support HT, do you have it enabled? (It should be disabled better to prevent processes from sharing the same physical core)
Please report back on what above changes will do for you.
Norman Geist.
Wenchang
2014-12-18 3:39 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
Hi,
given the fact that you didn’t use the word “ibverbs” in your post, I suppose that you run your network traffic across IPoIB (ib0), is that right?
If so, could you please give me the output of:
cat /sys/class/net/ib0/m*
I suppose it will output something like:
datagram
2044
But it should be:
connected
65520
Also please give the output of:
/sbin/ifconfig -a
Additionally, could we please see your benchmark data (time/step or days/ns) for the 1,2,4,8,16 node cases ?
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ???
Gesendet: Mittwoch, 17. Dezember 2014 22:13
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Asking help on results of our GPU benchmark
Dear all,
We are asking help here concerning our GPU benchmark results, would be great and appreciate your reading (sorry for such a long letter) if you have experiences on using GPUs.
We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2 processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU per node). The nodes are interconnected by a non-blocking FDR InfiniBand fat-tree topology. We are testing the scalability of NAMD, and are running into some issues.
It seems that for a system of ~ 370K atoms, we are unable to scale beyond 16 nodes. We've tried both custom-compiling NAMD and using pre-built binaries (running version 2.10 in both cases). We get the best performance when custom compiling Charm++ and NAMD using Intel MPI version 5 (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn 12). However, as mentioned, we are unable to scale between 16 nodes.
We've also tried building Charm++ without an underlying MPI library (charm architectures net-linux-x86_64-icc-ibverbs and net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance is slower than with the mpi-linux-x86_64 builds. When we run with "+p X ++ppn 12" it seems like the CPU time is considerably less than wall time, indicating that a lot of time is spent waiting for communication. We understand that the SMP version funnels everything through a single communication thread, but it is weird that this so dramatically limits the scalability of the non-MPI built versions of Charm++. We get somewhat better results from the non-SMP versions (+p 12*X), but it is still not as fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.
We should note that for non-CUDA (CPU only) NAMD, running with net-linux-x86_64-icc-ibverbs builds is substantially faster than the mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for the CUDA case the situation is reversed so dramatically. We feel that we may not understand the optimal way to run on our new cluster. Does anyone have experience running on a distributed cluster where each node has a single GPU (as opposed to multiple GPUs per node)? Are there any performance tuning and optimization hints that you can share?
We've tried several different sizes of systems (with 370K atoms being the biggest, down to 70K atoms) and we are just not seeing scalability like we see from the CPU-only version.
Thanks!
Wenchang
This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:23:09 CST