From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Feb 27 2019 - 04:57:12 CST
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Homeo Morphism
Gesendet: Dienstag, 26. Februar 2019 18:38
Betreff: namd-l: strange benchmark results -- how would you explain them and what would you advice?
There's a rig with a 6-physical (12-virtual) core CPU (supports up to 40 PCIe lanes) and 2 x Nvidia GPUs (both on PCIe 16x).
Version of NAMD is 2.13, GPU drivers are from September, 2018, CUDA is 10.0 -- everything is current, that is.
There's also a certain, in-house simulation (call it S1) modelling ligand-receptor interaction which I use as a benchmark.
I next run S1 on a single GPU for different values of +p-option, ranging from 1 to all 12 CPU cores, that is:
namd2 ... +devices 0 +p1
namd2 ... +devices 0 +p2
namd2 ... +devices 0 +p11
namd2 ... +devices 0 +p12
These are the timings (wallclock time, in seconds) that I get:
I've re-run this series several times and ignoring slight variations the numbers are stable.
Because of hardware bottlenecks such as memory bandwidth, and because there are operating system processes eating up at least 1 core usually. This isn’t at all surprising. You would see that trend as well, without GPUs.
I'm curious as to why the best results are attained at the number of CPU cores equaling 9 and 10, not 12.
If I run these simulations on both GPUs (+devices 0,1), the effects are the same -- p9 and p10 stand out as the best.
Question #1: Could there be any explanation based on the understanding of inner-workings of NAMD why it's p9 and p10 that lead to optimal results, time-wise, and not the maximum number of cores?
My second question pertains to running two instances of S1 in parallel, each on one of the GPUs. The execution is trivial:
namd2 ... +devices 0 +p6
namd2 ... +devices 1 +p6
Same thing here. What causes one simulation to stop scaling, will also stop two simulations from scaling. Try +p 4 or 5 and away from PCIe bandwidth, performance might increase.
The wallclock time for both of them is about 33.5 seconds, which if you look in the p6-row in the table above differs markedly from 28.34 for when only a single instance of S1 is run on a single GPU.
Question #2: How so, if every instance runs on its own dedicated GPU?
I realize that in a two-instance scenario the bus is more work-loaded, there's slightly more writing on the disk, and so on, but a 20%-loss...?
Is NAMD so dependent on bus's throughput, bandwidth, and whatnot?
This has nothing to do with NAMD, just with parallel computing. However, each program might utilize the hardware differently and so suffer more or less from individual bottlenecks.
Question #3 (most important): With all these results in mind, does it even make sense to try to assemble a single 10-GPU node?
We are building a new rig, and one option we are considering is buying 10 Nvidia 2080 and hooking them up to a single very very multi-core CPU (or probably two CPUs on a single motherboard).
But will even the more powerful CPUs and motherboard be able to efficiently service ten modern GPUs?
I'm especially worried about that 20%-drop described in Question #2...
What if we wanted to run 10 instances of one simulation, with each running on its own dedicated GPU...
If even adding the second sim on the second GPU puts so much strain on the bus and CPU that there's a 20%-loss, how much of a loss should we expect if we tried to employ 10 GPUs in this fashion?
It's not the concrete numbers that I expect rather than a confirmation that NAMD does depend on the bus to a great extent indeed and that a 10-GPU node can't possibly be run in an efficient manner.
Or if it can, what CPU(s) and motherboard would you recommend?
You would have to carefully benchmark the impact on memory bandwidth due to increased number of used cores vs. the impact on PCIe bandwidth due to increased number of GPUs.
Thank you very much,
This archive was generated by hypermail 2.1.6 : Sun Nov 17 2019 - 23:20:37 CST