strange benchmark results -- how would you explain them and what would you advice?

From: Homeo Morphism (homeo.morphizm_at_gmail.com)
Date: Tue Feb 26 2019 - 11:38:13 CST

Greetings, everyone!

There's a rig with a 6-physical (12-virtual) core CPU (supports up to 40
PCIe lanes) and 2 x Nvidia GPUs (both on PCIe 16x).
Version of NAMD is 2.13, GPU drivers are from September, 2018, CUDA is 10.0
-- everything is current, that is.

There's also a certain, in-house simulation (call it S1) modelling
ligand-receptor interaction which I use as a benchmark.
I next run S1 on a single GPU for different values of +p-option, ranging
from 1 to all 12 CPU cores, that is:

namd2 ... +devices 0 +p1
namd2 ... +devices 0 +p2
..
namd2 ... +devices 0 +p11
namd2 ... +devices 0 +p12

These are the timings (wallclock time, in seconds) that I get:
p1 45.69
p2 31.79
p3 30.99
p4 29.28
p5 29.96
p6 28.34
p7 29.61
p8 30.67
p9 26.22
p10 26.65
p11 27.69
p12 27.67

I've re-run this series several times and ignoring slight variations the
numbers are stable.

I'm curious as to why the best results are attained at the number of CPU
cores equaling 9 and 10, not 12.
If I run these simulations on both GPUs (+devices 0,1), the effects are the
same -- p9 and p10 stand out as the best.

Question #1: Could there be any explanation based on the understanding of
inner-workings of NAMD why it's p9 and p10 that lead to optimal results,
time-wise, and not the maximum number of cores?

My second question pertains to running two instances of S1 in parallel,
each on one of the GPUs. The execution is trivial:
namd2 ... +devices 0 +p6
namd2 ... +devices 1 +p6

The wallclock time for both of them is about 33.5 seconds, which if you
look in the p6-row in the table above differs markedly from 28.34 for when
only a single instance of S1 is run on a single GPU.

Question #2: How so, if every instance runs on its own dedicated GPU?
I realize that in a two-instance scenario the bus is more work-loaded,
there's slightly more writing on the disk, and so on, but a 20%-loss...?
Is NAMD so dependent on bus's throughput, bandwidth, and whatnot?

Question #3 (most important): With all these results in mind, does it even
make sense to try to assemble a single 10-GPU node?
We are building a new rig, and one option we are considering is buying 10
Nvidia 2080 and hooking them up to a single very very multi-core CPU (or
probably two CPUs on a single motherboard).
But will even the more powerful CPUs and motherboard be able to efficiently
service ten modern GPUs?
I'm especially worried about that 20%-drop described in Question #2...
What if we wanted to run 10 instances of one simulation, with each running
on its own dedicated GPU...
If even adding the second sim on the second GPU puts so much strain on the
bus and CPU that there's a 20%-loss, how much of a loss should we expect if
we tried to employ 10 GPUs in this fashion?
It's not the concrete numbers that I expect rather than a confirmation that
NAMD does depend on the bus to a great extent indeed and that a 10-GPU node
can't possibly be run in an efficient manner.
Or if it can, what CPU(s) and motherboard would you recommend?

Thank you very much,
Oleg

This archive was generated by hypermail 2.1.6 : Sun Sep 15 2019 - 23:20:21 CDT