Re: strange benchmark results -- how would you explain them and what would you advice?

From: Homeo Morphism (homeo.morphizm_at_gmail.com)
Date: Thu Feb 28 2019 - 11:43:55 CST

Thanks for replying.

What you said in answers to questions #1 and #2 makes total sense.
I re-ran the simulations described in question #2 with smaller values
of +p, and now I get the optimal results at +p3 or +p4 with the best
wallclock time being about 32 seconds.
Compared to p3 and p4-values in the single-S1-on-a-single-GPU scenario, the
loss is 5-to-10%.
Of course, the best timings in that scenario are obtained at +p9, and
compared to them the loss is 20+%.

What I take from all this is that NAMD strongly depends on CPU
performance...
It might be obvious to everyone here, but I'm very new to the whole field.

So... If we wanted to build a rig with 8-to-10 GPUs, I can't see how we
could avoid choosing a dual-socket system, with each CPU having somewhere
around 20 physical cores.

But there still remains a question of PCIe bandwidth...
There are motherboards that support 64 lanes, which means 8 GPUs could be
run at 8x each.
But wouldn't it be an overkill?
Perhaps, putting GPUs on 4x, 2x, or even 1x PCIe wouldn't impair efficiency
of the NAMD compute?

I'll try to run some tests and measure PCIe traffic on our current 2-GPU
rig.
In the meantime, I'd be grateful if the NAMD community could share what
they know on this subject.

Oleg

On Wed, Feb 27, 2019 at 1:57 PM Norman Geist <norman.geist_at_uni-greifswald.de>
wrote:

>
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *Homeo Morphism
> *Gesendet:* Dienstag, 26. Februar 2019 18:38
> *An:* namd-l_at_ks.uiuc.edu
> *Betreff:* namd-l: strange benchmark results -- how would you explain
> them and what would you advice?
>
>
>
> Greetings, everyone!
>
>
>
> There's a rig with a 6-physical (12-virtual) core CPU (supports up to 40
> PCIe lanes) and 2 x Nvidia GPUs (both on PCIe 16x).
>
> Version of NAMD is 2.13, GPU drivers are from September, 2018, CUDA is
> 10.0 -- everything is current, that is.
>
>
>
> There's also a certain, in-house simulation (call it S1) modelling
> ligand-receptor interaction which I use as a benchmark.
>
> I next run S1 on a single GPU for different values of +p-option, ranging
> from 1 to all 12 CPU cores, that is:
>
>
>
> namd2 ... +devices 0 +p1
>
> namd2 ... +devices 0 +p2
>
> ...
>
> namd2 ... +devices 0 +p11
>
> namd2 ... +devices 0 +p12
>
>
>
> These are the timings (wallclock time, in seconds) that I get:
>
> p1 45.69
>
> p2 31.79
>
> p3 30.99
>
> p4 29.28
>
> p5 29.96
>
> p6 28.34
>
> p7 29.61
>
> p8 30.67
>
> p9 26.22
>
> p10 26.65
>
> p11 27.69
>
> p12 27.67
>
>
>
> I've re-run this series several times and ignoring slight variations the
> numbers are stable.
>
>
>
> Because of hardware bottlenecks such as memory bandwidth, and because
> there are operating system processes eating up at least 1 core usually.
> This isn’t at all surprising. You would see that trend as well, without
> GPUs.
>
>
>
> I'm curious as to why the best results are attained at the number of CPU
> cores equaling 9 and 10, not 12.
>
> If I run these simulations on both GPUs (+devices 0,1), the effects are
> the same -- p9 and p10 stand out as the best.
>
>
>
> Question #1: Could there be any explanation based on the understanding of
> inner-workings of NAMD why it's p9 and p10 that lead to optimal results,
> time-wise, and not the maximum number of cores?
>
>
>
> My second question pertains to running two instances of S1 in parallel,
> each on one of the GPUs. The execution is trivial:
>
> namd2 ... +devices 0 +p6
>
> namd2 ... +devices 1 +p6
>
>
>
> Same thing here. What causes one simulation to stop scaling, will also
> stop two simulations from scaling. Try +p 4 or 5 and away from PCIe
> bandwidth, performance might increase.
>
>
>
> The wallclock time for both of them is about 33.5 seconds, which if you
> look in the p6-row in the table above differs markedly from 28.34 for when
> only a single instance of S1 is run on a single GPU.
>
>
>
> Question #2: How so, if every instance runs on its own dedicated GPU?
>
> I realize that in a two-instance scenario the bus is more work-loaded,
> there's slightly more writing on the disk, and so on, but a 20%-loss...?
>
> Is NAMD so dependent on bus's throughput, bandwidth, and whatnot?
>
>
>
> This has nothing to do with NAMD, just with parallel computing. However,
> each program might utilize the hardware differently and so suffer more or
> less from individual bottlenecks.
>
>
>
> Question #3 (most important): With all these results in mind, does it even
> make sense to try to assemble a single 10-GPU node?
>
> We are building a new rig, and one option we are considering is buying 10
> Nvidia 2080 and hooking them up to a single very very multi-core CPU (or
> probably two CPUs on a single motherboard).
>
> But will even the more powerful CPUs and motherboard be able to
> efficiently service ten modern GPUs?
>
> I'm especially worried about that 20%-drop described in Question #2...
>
> What if we wanted to run 10 instances of one simulation, with each running
> on its own dedicated GPU...
>
> If even adding the second sim on the second GPU puts so much strain on the
> bus and CPU that there's a 20%-loss, how much of a loss should we expect if
> we tried to employ 10 GPUs in this fashion?
>
> It's not the concrete numbers that I expect rather than a confirmation
> that NAMD does depend on the bus to a great extent indeed and that a 10-GPU
> node can't possibly be run in an efficient manner.
>
> Or if it can, what CPU(s) and motherboard would you recommend?
>
>
>
> You would have to carefully benchmark the impact on memory bandwidth due
> to increased number of used cores vs. the impact on PCIe bandwidth due to
> increased number of GPUs.
>
>
>
> Thank you very much,
>
> Oleg
>
>
>

This archive was generated by hypermail 2.1.6 : Tue Sep 17 2019 - 23:20:35 CDT