From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Nov 11 2014 - 07:03:53 CST
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Bin He
Gesendet: Dienstag, 11. November 2014 13:19
An: namd-l_at_ks.uiuc.edu; Norman Geist
Betreff: Re: namd-l: Why CPU Usage is low when I run ibverbs-smp-cuda version NAMD
Hi,
Thanks a lot for your kind reply.
I am sorry that the time data I provided was confused.
So I used the default binary(Download from the NAMD website) to test again.
The binary I used:
NAMD_2.10b1_Linux-x86_64-multicore-CUDA.tar.gz
NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA.tar.gz
Hardware
E5-2670*2
GPU k20m*2
IB
command:
1 node with multicore-CUDA version:
./namd2 +p16 +deices 0,1 ../workload/f1atpase2000/f1atpase.namd
1 node with ibverbs-smp-CUDA
/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun ++p 16 ++ppn 8 ++nodelist nodelist ++scalable-start ++verbose /home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86 _64-ibverbs-smp-CUDA/namd2 +devices 0,1 /home/gpuusr/binhe/namd/workload/f1atpase2000/f1atpase.namd
With "++local", the application can not start. So I have to run with nodelist.
nodelist content
group main ++shell ssh-
host node330
host node330
2 node with ibverbs-smp-CUDA
/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun ++p 32 ++ppn 8 ++nodelist nodelist2node ++scalable-start ++verbose /home/gpuusr/binhe/namd/NAMD_2.10b1_Linu x-x86_64-ibverbs-smp-CUDA/namd2 +devices 0,1 /home/gpuusr/binhe/namd/workload/f1atpase2000/f1atpase.namd
nodelist content
group main ++shell ssh-
host node330
host node330
host node329
host node329
~~
4 node with ibverbs-smp-CUDA
/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun ++p 64 ++ppn 8 ++nodelist nodelist4node ++scalable-start ++verbose /home/gpuusr/binhe/namd/NAMD_2.10b1_Linu x-x86_64-ibverbs-smp-CUDA/namd2 +devices 0,1 /home/gpuusr/binhe/namd/workload/f1atpase2000/f1atpase.namd
nodelist content
group main ++shell ssh-
host node330
host node330
host node329
host node329
host node328
host node328
host node332
host node332
TIME
f1atpase
numstep 2000; outputEnergies 100
version
CPU/node
GPU/node
NODE
TIME
multicores-CUDA
16
2
1
90
ibverbs-smp-CUDA
16
2
1
111.24
ibverbs-smp-CUDA
16
2
2
60
ibverbs-smp-CUDA
16
2
4
35
Actually, the ibverbs-smp-CUDA version scales not bad. BUT the the cpu usage:
Cpu(s): 53.1%us, 29.0%sy, 0.0%ni, 17.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
means that not all computing resources are used well.
Hey Binhe,
so far your benchmarking procedure now was better. Please notice that the distribution of processes and threads can also have an influence on the performance, means different +ppn can cause significant different timings.
And We can find that ibverbs-smp-CUDA is slower than multicores-CUDA in a node. Yes, Network bandwidth and latency may cause it, but ibverbs version without CUDA scale well and the cpu usage of ibverbs version is perfect when running several nodes.
What you forgot is, that the requirements to network bandwidth and latency scales with the computing power of the communication endpoints. So without CUDA the endpoints computing power is much less compared to with CUDA. So generally spoken:
The more computing power per node->the faster the part problems are solved->the more messaging is required->the more waiting times and redistributing of work occur. ;)
So, I do not think network bandwidth and latency is the key reason to cause it. How can I increase the cpu usage and accelerate namd?
For any binary CUDA or nor that runs across network add to namd2 “+idlepoll”
Cheers Norman Geist
Thanks
Binhe
------------------------
Best Regards!
Bin He
Member of IT
Unique Studio
Room 811,Building LiangSheng,1037 Luoyu Road, Wuhan 430074,P.R. China
☎:(+86) 13163260252
Weibo:何斌_HUST
Email:binhe_at_hustunique.com <mailto:Email%3Abinhe_at_hustunique.com>
Email:binhe22_at_gmail.com <mailto:Email%3Abinhe22_at_gmail.com>
2014-11-11 16:48 GMT+08:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
Ok, you actually DON’T have a problem! You compare apples with oranges. To compare the performance of different binaries, you SHOULD use the same hardware. So you would want to test the ibverbs version on the machine with 4gpus+16 cores or vice versa the multicore binary on one of the 2GPU+12cores nodes.
Away from that, using multiple nodes introduce a new bottleneck which is network bandwidth and latency. So you always will have losses due the additional overhead and you cpus spending time in waiting for communication rather than working. This varies for different system sizes (Amdahl’s law). BUT actually your scaling isn’t that bad. From 2 to 4 nodes It scales by 46% instead of ideal 50% (u miss the 1 node case btw.).
So don’t care about CPU usage, only about the actual timings. Also try to namd2 “+idlepoll” which can improve parallel scaling across network.
Also for CUDA and small systems try in config:
twoawayx yes
only if that brings improvement try
twoawayx yes
twoawayy yes
only if that brings improvement try
twoawayx yes
twoawayy yes
twoawayz yes
Most of cases twoawayx is enough or already too much.
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Bin He
Gesendet: Montag, 10. November 2014 20:51
An: Norman Geist
Cc: namd-l_at_ks.uiuc.edu
Betreff: Re: namd-l: Why CPU Usage is low when I run ibverbs-smp-cuda version NAMD
1. Using the servers mentioned above, I got the result:
multicores-CUDA
GPU
CORE
TIME
4
16
64s
ibverbs-smp-cuda
GPU
CORE
NODE
TIME
2 per node
12 per node
2
57
2 per node
12 per node
4
37
when running ibverbs-smp-cuda, the cpu usr usage is less than 50%, and sys usage is about 30%.
The cpu usage is too ugly. What I want to do is to find the reason why the cpu usage is so strange.
2. If I want to get the best performance with cuda, what parameters in the config file I can modify?
------------------------
Best Regards!
Bin He
Member of IT
Unique Studio
Room 811,Building LiangSheng,1037 Luoyu Road, Wuhan 430074,P.R. China
☎:(+86) 13163260252 <tel:%28%2B86%29%2013163260252>
Weibo:何斌_HUST
Email:binhe_at_hustunique.com <mailto:Email%3Abinhe_at_hustunique.com>
Email:binhe22_at_gmail.com <mailto:Email%3Abinhe22_at_gmail.com>
2014-11-10 14:53 GMT+08:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
What you observe might be expectable as the CUDA code of NAMD is officially tuned for the multicore version. BUT, do you actually notice any performance difference regarding time/step?
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Bin He
Gesendet: Samstag, 8. November 2014 08:25
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Why CPU Usage is low when I run ibverbs-smp-cuda version NAMD
Hi everyone,
I am a fresh man to NAMD.
The Desc of our clusters:
cpu:E5-2670(8cores)
memory:32GB
socket:2
network:IB
GPU:k20m*2
CUDA:6.5
workload: f1atpase(numsteps2000)
When I run the multicores-namd version, the cpu usage is about 100% and GPU usage is about 50%;
CMD:./namd2 +p16 +devices 0,1 ../workload/f1atpase/f1atpase.namd
cpu time is about 88s.
When I run the ibverbs-smp-cuda version, the cpu usage is about just 40%us and 30 % sy. GPU usage is about 50%.
CMD:/home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/charmrun ++p 60 ++ppn 15 ++nodelist nodelist ++scalable-start ++verbose /home/gpuusr/binhe/namd/NAMD_2.10b1_Linux-x86_64-ibverbs-smp-CUDA/namd2 +devices 0,1 /home/gpuusr/binhe/namd/workload/f1atpase/f1atpase.namd
cpu time is about 37s.
when I try to use setcpuaffinity, the result is worst.
So what is wrong with my operation?
Thanks
------------------------
Best Regards!
Bin He
Member of IT
Unique Studio
_____
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus <http://www.avast.com/> Schutz ist aktiv.
_____
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus <http://www.avast.com/> Schutz ist aktiv.
--- Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv. http://www.avast.com
This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:59 CST