From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Dec 03 2013 - 08:47:54 CST
Your switch is too slow in switching. Try something like the netgear gs748t, not that expensive and “ok” scaling. You can temporarily improve the situation by trying the tcp congestion control algorithm “highspeed”. Set it via sysconfig on all the nodes.
Additionally, are these 16 cores per node physical or logical (HT). If it is HT, leave them out, no speed gain, only more network load.
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ???
Gesendet: Dienstag, 3. Dezember 2013 14:43
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: 50% system CPU usage when parallel running NAMD on Rocks cluster
Dear all,
¡¡¡¡I’m tuning NAMD performance on a 7 compute node Rocks cluster. The problems is when running NAMD (100,000 atoms) with 32 cores (on 2 nodes) the system CPU usage is about 50%. Increasing cores (48 cores, on 3 nodes) will increase system CPU usage and decrease speed.
The detail information of one compute node shows below:
CPU: 2 * Inter Xeon E5-2670 (8Cores/ 2.6GHz)
Mem: 64G (1600)
HardDrive: 300G (15000)
Network card: Intel Gigabit Ethernet Network Connection
Switch: 3Com Switch 2824 3C16479 (24-port unmanaged gigabit)(a pretty old switch :| )
¡¡¡¡Compiling & running :
Charm-6.4.0 was build with “./build charm++ mpi-linux-x86_64 mpicxx -j16 --with-production” options. Some error was ignored when compiling it. For example:
“Fatal Error by charmc in directory /apps/apps/namd/2.9/charm-6.4.0/mpi-linux-x86_64-mpicxx/tmp
Command mpif90 -auto -fPIC -I../bin/../include -O -c pup_f.f90 -o pup_f.o returned error code 1
charmc exiting...”.
NAMD was compiled with Linux-86_64-g++ option. Some warning was showed when compiling NAMD.
Openmpi (from HPC roll of Rocks) was used to run namd. The command is:
”mpirun -np {number of cores} -machinefile hosts /apps/apps/namd/2.9/Linux-x86_64-g++/namd2 {configuration file} > {output file}”
SGE(Sun Grid Engine) was also used. The job submitting command is:
“qsub –pe orte {number of cores} {job submitting script}”
“Job submitting script” contains:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
/opt/openmpi/bin/mpirun /apps/apps/namd/2.9/Linux-x86_64-g++/namd2 {configuration file} > {output file}
¡¡¡¡Performance test:
Test system contains about 100,000 atoms. Running (using mpirun) on 1 node with 16 cores, I got the following benchmark data:
1 node, 16cores:
Info: Benchmark time: 16 CPUs 0.123755 s/step 0.716176 days/ns 230.922 MB memory
CPU usage:
Tasks: 344 total, 17 running, 327 sleeping, 0 stopped, 0 zombie
Cpu0 : 85.0%us, 15.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
....
2 nodes, 32 cores:
Info: Benchmark time: 32 CPUs 0.101423 s/step 0.586941 days/ns 230.512 MB memory
CPU usage:
Tasks: 344 total, 9 running, 335 sleeping, 0 stopped, 0 zombie
Cpu0 : 56.3%us, 43.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
....
3nodes, 48cores:
Info: Benchmark time: 48 CPUs 0.125787 s/step 0.727932 days/ns 228.543 MB memory
CPU usage:
Tasks: 344 total, 9 running, 335 sleeping, 0 stopped, 0 zombie
Cpu0 : 39.3%us, 60.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
....
The problem is obvious. When using 48 cores (on 3 nodes), the speed is slower than 16 cores (on 1 node). Note that the number of process varies when running NAMD; some processes are sleeping. :///
Other information (on 48 cores,3 nodes)
vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
17 0 0 64395660 204864 389380 0 0 0 1 7 1 3 2 95 0 0
17 0 0 64399256 204864 389384 0 0 0 0 11367 2175 37 63 0 0 0
17 0 0 64403612 204864 389384 0 0 0 0 11497 2213 38 62 0 0 0
17 0 0 64397588 204864 389384 0 0 0 0 11424 2215 38 62 0 0 0
17 0 0 64396108 204864 389384 0 0 0 0 11475 2262 37 63 0 0 0
17 0 0 64400460 204868 389384 0 0 0 364 11432 2227 37 63 0 0 0
17 0 0 64401452 204868 389384 0 0 0 0 11439 2204 38 62 0 0 0
17 0 0 64405408 204868 389384 0 0 0 0 11400 2230 37 63 0 0 0
17 0 0 64396108 204868 389384 0 0 0 0 11424 2245 39 61 0 0 0
17 0 0 64395276 204868 389384 0 0 0 0 11396 2289 38 62 0 0 0
Mpstat –P ALL 1 10
Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
Average: all 37.27 0.00 61.80 0.00 0.03 0.90 0.00 0.00 11131.34
Average: 0 38.32 0.00 61.48 0.00 0.00 0.20 0.00 0.00 999.00
Average: 1 36.60 0.00 63.20 0.00 0.00 0.20 0.00 0.00 0.00
Average: 2 38.26 0.00 61.64 0.00 0.00 0.10 0.00 0.00 0.00
Average: 3 36.03 0.00 63.77 0.00 0.00 0.20 0.00 0.00 0.00
Average: 4 38.16 0.00 61.64 0.00 0.00 0.20 0.00 0.00 0.00
Average: 5 38.00 0.00 61.90 0.00 0.00 0.10 0.00 0.00 0.00
Average: 6 37.06 0.00 62.74 0.00 0.00 0.20 0.00 0.00 0.00
Average: 7 38.26 0.00 61.54 0.00 0.00 0.20 0.00 0.00 0.00
Average: 8 36.36 0.00 63.44 0.00 0.00 0.20 0.00 0.00 8.08
Average: 9 36.26 0.00 63.54 0.00 0.00 0.20 0.00 0.00 0.00
Average: 10 38.36 0.00 61.54 0.00 0.00 0.10 0.00 0.00 0.00
Average: 11 35.56 0.00 61.84 0.00 0.10 2.50 0.00 0.00 1678.64
Average: 12 35.66 0.00 61.34 0.00 0.10 2.90 0.00 0.00 1823.35
Average: 13 37.34 0.00 60.36 0.00 0.00 2.30 0.00 0.00 2115.77
Average: 14 36.90 0.00 60.40 0.00 0.10 2.60 0.00 0.00 2790.02
Average: 15 38.96 0.00 58.44 0.00 0.10 2.50 0.00 0.00 1716.67
Iostat 1
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 19.00 0.00 200.00 0 200
sda1 19.00 0.00 200.00 0 200
sda2 0.00 0.00 0.00 0 0
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %system %iowait %steal %idle
39.10 0.00 60.90 0.00 0.00 0.00
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sda1 0.00 0.00 0.00 0 0
sda2 0.00 0.00 0.00 0 0
sda3 0.00 0.00 0.00 0 0
sda4 0.00 0.00 0.00 0 0
sda5 0.00 0.00 0.00 0 0
The speed will be better if I use SGE (Sun Grid Engine) to submit NAMD job.
1 node, 16cores
Info: Benchmark time: 16 CPUs 0.125926 s/step 0.728737 days/ns 230.543 MB memory
CPU usage:
Tasks: 346 total, 11 running, 335 sleeping, 0 stopped, 0 zombie
Cpu0 : 87.5%us, 12.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
....
2node, 32cores:
Info: Benchmark time: 32 CPUs 0.0742307 s/step 0.429576 days/ns 228.188 MB memory
CPU usage:
Tasks: 341 total, 8 running, 333 sleeping, 0 stopped, 0 zombie
Cpu0 : 72.0%us, 27.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
....
3node, 48cores:
Info: Benchmark time: 48 CPUs 0.0791372 s/step 0.45797 days/ns 174.879 MB memory
CPU usage:
Tasks: 324 total, 12 running, 312 sleeping, 0 stopped, 0 zombie
Cpu0 : 45.8%us, 53.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
....
In general, the benchmark data is:
Mpirun:
1node,16cores 0.716176 days/ns 15% system cpu usage
2nodes,32cores 0.586941 days/ns 45% system cpu usage
3nodes,48cores 0.727932 days/ns 60% system cpu usage
SGE:
1node,16cores 0.728737 days/ns 15% system cpu usage
2nodes,32cores 0.429576 days/ns 35% system cpu usage
3nodes,48cores 0.45797 days/ns 50% system cpu usage
Number of running processes varies in both Mpirun and SGE. The maximum data transfer rate is about 200MB/s in these benchmark.
As you can see, the scaling is bad; system cpu usage increases when more cores are used. I don't know why. Maybe it has something to do with our switch.
If you know anything about the problem, please tell me. I really appreciate your help!
Neil Zhou
School of Life Science, Tsinghua University, Beijing
China
--- Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv. http://www.avast.com
This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:58 CST