From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed May 20 2015 - 12:28:37 CDT
It is not clear if you are using an ibverbs or ibverbs-smp binary, but in
either case you are oversubscribing your cores. When you specify +p64
that means 64 *worker* threads, but +pemap 0-63:16.15 is only 60 cores, so
you have 4 cores doubly occupied. You also have to specify ++ppn 15 to
break your 60 cores into 4 processes (assuming this is an smp binary);
without ++ppn you get one process per worker thread, and all of the
communication threads are lumped onto your 4 +commap cores.
Also, Intel systems map the hyperthreads to the upper half of the core
numbering, so to avoid hyperthreads you want cores 0-31. NAMD can get
some benefit from hyperthreads on small node counts.
You will probably get your best performance from an ibverbs-smp build
running something like this (for 4 nodes):
charmrun namd2 ++ppn 7 +p112 +pemap 0-31:8.7 +commap 7-31:8
If you want to try hyperthreads you need this:
charmrun namd2 ++ppn 14 +p224 +pemap 0-6,32-38,8-14,40-46,16-22,48-54,24-30,56-62 +commap 7-31:8
(In NAMD 2.10 you could do +pemap 0-31:8.7+32 instead.)
Run "top" and then hit "1" in a terminal with at least 80 rows to confirm
that the processes are running where you intend them to.
Jim
On Mon, 18 May 2015, Thanassis Silis wrote:
>
Hello everyone,
> I am running some relatively small andm simulations in a system of 6 blade processing servers. Each has the following specs
>
> POWEREDGE M910 BLADE SERVER
>
(4x) INTEL XEON E7-4830 PROCESSOR (2.13GHZ)
>
256GB MEMORY FOR 2/4CPU
>
900GB, SAS 6GBPS, 2.5-IN, 10K RPM HARD DISK
>
MELLANOX CONNECT X3 QDR 40GBPS INFINIBAND
>
> each of the 4 processors has 8 cores and due to hyper-threading 16 threads are available.
> Indeed, cat /proc/cpuinfo returns 64 cpus on each system.
>
> I
have created a nodelist file using the infiband interface ip address - I am also using the ibverbs namd executable. I
have run several test simulations to figure out which setting minimizes
processing time. Overall it seems that for 64 cpus/system * 6 systems = 384 cpus , I get to minimize the processing time by using "+p128 +setcpuaffinity"
>
> This seems odd as it is 1/3 of the available cpus. It's not half - which would seem sensible (if one of each core's threads works, it utilizes the full resources of the other thread of the core and this maximizes performance).
>
> One of the things I tried was to let the system decide which cpu's to use, with
> charmrn namd2 ++nodelist nodelist +setcpuaffinity `numactl --show | awk '/^physcpubind/ {printf "+p%d +pemap %d",(NF-1),$2; for(i=3;i<=NF;++i){printf ",%d",$i}}'` sim.conf > sim.log
>
> and also to manually assign worker threads and comminucation threads. I may (or may not!) have managed that with the command
> charmrun namd2 ++nodelist nodelist +setcpuaffinity +p64 +pemap 0-63:16.15 +commap 15-63:16
> In this above command, I am not sure how should I "see" the 64 * 6 cpus. as 6 same systems ? (so add +p64), or aggregate them to 384 cpus (so add +p384 above). I did try +p384 but it seems to be even slower - way too many threads have been spawned.
>
> So I am fuzzy. Why do I get minimized process time when 1/3 of the 384 cpus are used and no manual settings are in place? Are charmrun and namd2 clever enough at this version (2.10) that they assign worker and comm threads automagically?
>
> Is there some other parameter that you suggest I should append, because at the very least using 1/3 of the cpus seems Very odd.
>
> Thank you for your time and input.
>
This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:08 CST