From: 周文昌 (wenchangyu2006_at_gmail.com)
Date: Fri Dec 19 2014 - 16:06:33 CST
Hi Norman,
Thanks for your suggestions. Among those things you suggested, the
twoaway[xyz] works for my system with 116K atoms, I got 20% better (but no
difference with CPU only). I also double the number on a system with 315K
atoms. Could you explain how NAMD throws patches to GPU cores, why there
is no difference using CPU only? I only have the tests on 8 nodes, will
continue to test on 16, 32 nodes.
Thanks,
Wenchang
2014-12-19 1:30 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> *Von:* 周文昌 [mailto:wenchangyu2006_at_gmail.com]
> *Gesendet:* Donnerstag, 18. Dezember 2014 20:25
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> Hi Norman,
>
> Thanks for your time, We use ibverbs directly (I did mention ibverbs in
> the 4th paragraph).
>
> Ok, now I’ve seen it ^^
>
> If I do /sbin/ifconfig -a, the output is following:
>
>
> eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
> inet addr:10.1.3.1 Bcast:10.1.255.255 Mask:255.255.0.0
> inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
> TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6
> GiB)
> Memory:dfa20000-dfa3ffff
>
> eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
> BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> Memory:dfa00000-dfa1ffff
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:65536 Metric:1
> RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
> TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)
>
> Our numbers are below, I took the WallClock time at the end of each run
> (100,000 steps), instead of from "Benchmark time" in the NAMD output
>
>
>
> Number of nodes WallClock
> time ns/day
>
> CPU only
>
> 1.
>
> 4372.5
>
> 0.4
>
> CPU+GPU
>
> 1.
>
> 1220.4
>
> 1.4
>
> CPU+GPU
>
> 4.
>
> 332.6
>
> 5.2
>
> CPU+GPU
>
> 8.
>
> 208.2
>
> 8.3
>
> CPU+GPU
>
> 16.
>
> 135.2
>
> 12.8
>
> CPU+GPU
>
> 32.
>
> 106.3
>
> 16.3
>
> CPU+GPU
>
> 48.
>
> 97.5
>
> 17.7
>
>
>
> Doesn’t look that bad, although it feels like it should be better. Have
> you benchmarked your infiniband bandwidth already, FDR should do better
> here, especially in a fat tree topology. Some things you can try to
> generally improve scaling of namd:
>
>
>
> 1. Generally add “+idlepoll” to namd2
>
> 2. When using GPUs try adding “twoawayx yes” to the script, if that
> helps, try in addition “twoawayy yes”, if that helps try in addition
> “twowayz yes”. (This helps creating more patches and so might improve
> scalability of your system)
>
> 3. Try turning off/on the new pme reciprocal sum offload by
> “pmeoffload no/yes” in script.
>
>
>
> Sometimes, depending on your mpi, it might be necessary to exclude slow
> networks from the computation. As you see eth0 has a lot of traffic so make
> sure that you do not use mixed networks during some of your tests. I only
> know how it would be done using openmpi:
>
>
>
> mpirun ... --mca btl ^tcp … #this excluded all tcp networks
>
> mpirun … --mca btl openib #this included only ibverbs
>
>
>
> Also, you CPUs support HT, do you have it enabled? (It should be disabled
> better to prevent processes from sharing the same physical core)
>
>
>
> Please report back on what above changes will do for you.
>
>
>
> Norman Geist.
>
>
>
>
>
> Wenchang
>
>
>
> 2014-12-18 3:39 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> Hi,
>
>
>
> given the fact that you didn’t use the word “ibverbs” in your post, I
> suppose that you run your network traffic across IPoIB (ib0), is that
> right?
>
> If so, could you please give me the output of:
>
>
>
> cat /sys/class/net/ib0/m*
>
>
>
> I suppose it will output something like:
>
>
>
> datagram
>
> 2044
>
>
>
> But it should be:
>
>
>
> connected
>
> 65520
>
>
>
> Also please give the output of:
>
>
>
> /sbin/ifconfig -a
>
>
>
> Additionally, could we please see your benchmark data (time/step or
> days/ns) for the 1,2,4,8,16 node cases ?
>
>
>
> Norman Geist.
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Mittwoch, 17. Dezember 2014 22:13
> *An:* namd-l_at_ks.uiuc.edu
> *Betreff:* namd-l: Asking help on results of our GPU benchmark
>
>
>
> Dear all,
>
>
>
> We are asking help here concerning our GPU benchmark results, would be
> great and appreciate your reading (sorry for such a long letter) if you
> have experiences on using GPUs.
>
>
> We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2
> processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU
> per node). The nodes are interconnected by a non-blocking FDR InfiniBand
> fat-tree topology. We are testing the scalability of NAMD, and are running
> into some issues.
>
>
>
> It seems that for a system of ~ 370K atoms, we are unable to scale beyond
> 16 nodes. We've tried both custom-compiling NAMD and using pre-built
> binaries (running version 2.10 in both cases). We get the best performance
> when custom compiling Charm++ and NAMD using Intel MPI version 5
> (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per
> node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn
> 12). However, as mentioned, we are unable to scale between 16 nodes.
>
>
>
> We've also tried building Charm++ without an underlying MPI library (charm
> architectures net-linux-x86_64-icc-ibverbs and
> net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance
> is slower than with the mpi-linux-x86_64 builds. When we run with "+p X
> ++ppn 12" it seems like the CPU time is considerably less than wall time,
> indicating that a lot of time is spent waiting for communication. We
> understand that the SMP version funnels everything through a single
> communication thread, but it is weird that this so dramatically limits the
> scalability of the non-MPI built versions of Charm++. We get somewhat
> better results from the non-SMP versions (+p 12*X), but it is still not as
> fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.
>
>
>
> We should note that for non-CUDA (CPU only) NAMD, running with
> net-linux-x86_64-icc-ibverbs builds is substantially faster than the
> mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for
> the CUDA case the situation is reversed so dramatically. We feel that we
> may not understand the optimal way to run on our new cluster. Does anyone
> have experience running on a distributed cluster where each node has a
> single GPU (as opposed to multiple GPUs per node)? Are there any
> performance tuning and optimization hints that you can share?
>
>
>
> We've tried several different sizes of systems (with 370K atoms being the
> biggest, down to 70K atoms) and we are just not seeing scalability like we
> see from the CPU-only version.
>
>
>
> Thanks!
>
> Wenchang
>
This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:23:08 CST