Re: 50% system CPU usage when parallel running NAMD on Rocks cluster

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Dec 18 2013 - 01:28:34 CST

guys,

i think you are on a wild good chase here. please look at the numbers first.

you have a very fast CPU and an interconnect with a rather high
latency. yet the test system has "only" 100,000 atoms. to assess
parallel scaling performance, you have to consider two components:
- the more processor cores you use, the smaller the number of work
units per processor is
- the more processor cores you use, the more communication messages
you need to send

each message will *add* to the overall time based on latency (constant
amount of time per message) and bandwidth (added amount depends on the
chunk of data sent). so the more processors you use, the more overhead
you create. for systems with a very large number of atoms and few
processor cores this will primarily be due to bandwidth, for a smaller
number of atoms and more processors this will primarily be due to
latency.

now a lot of the good performance of NAMD is due to the fact that it
can "hide" the cost of communication behind doing computation (which
is different from many other MD codes and mostly due to using the
charm++ library), but that goes only so far and then it will quickly
become very bad (unlike other MD codes, that don't suffer as much). so
for this kind of setup and a rather small system, i would say getting
decent scaling to two nodes (32-processors) is quite good, but
expecting this to go much further is neglecting the fundamental
limitations of the hardware and the parallelization strategy in NAMD.
you can tweak it, but only up to a point.

what might be worth investigating would be the impact of using an SMP
executable vs. a regular TCP/UDP/MPI-only binary, but i would not get
my hopes up too high. with < 3000 atoms per CPU core, you don't have a
lot of work to hide communication behind. so if you seriously need to
go below and scale to more nodes, you need to invest a lot of money
into a low latency communication.

axel.

On Tue, Dec 17, 2013 at 3:12 PM, 周昵昀 <malrot13_at_gmail.com> wrote:
> [root_at_c1 ~]# ifconfig
> eth1 Link encap:Ethernet HWaddr F8:0F:41:F8:51:B2
> inet addr:10.1.255.247 Bcast:10.1.255.255 Mask:255.255.0.0
> UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
> RX packets:324575235 errors:0 dropped:0 overruns:0 frame:0
> TX packets:309132521 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:471233795133 (438.8 GiB) TX bytes:472407792256 (439.9
> GiB)
> Memory:dfd20000-dfd40000
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:44327434 errors:0 dropped:0 overruns:0 frame:0
> TX packets:44327434 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:242299288622 (225.6 GiB) TX bytes:242299288622 (225.6
> GiB)
>
> Mtu has been changed for some test, but shows no difference between 1500 and
> 9000. Normally, mtu is 1500.
>
> [root_at_c1 ~]# ethtool eth1
> Settings for eth1:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: pumbg
> Wake-on: g
> Current message level: 0x00000003 (3)
> Link detected: yes
>
> [root_at_c1 ~]# ethtool -k eth1
> Offload parameters for eth1:
> Cannot get device udp large send offload settings: Operation not supported
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: off
> generic-receive-offload: on
>
> [root_at_c1 ~]# ethtool -c eth1
> Coalesce parameters for eth1:
> Adaptive RX: off TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
>
> rx-usecs: 3
> rx-frames: 0
> rx-usecs-irq: 0
> rx-frames-irq: 0
>
> tx-usecs: 3
> tx-frames: 0
> tx-usecs-irq: 0
> tx-frames-irq: 0
>
> rx-usecs-low: 0
> rx-frame-low: 0
> tx-usecs-low: 0
> tx-frame-low: 0
>
> rx-usecs-high: 0
> rx-frame-high: 0
> tx-usecs-high: 0
> tx-frame-high: 0
>
>
> [root_at_c1 ~]# sysctl -a | grep tcp
> sunrpc.tcp_slot_table_entries = 16
> net.ipv4.netfilter.ip_conntrack_tcp_max_retrans = 3
> net.ipv4.netfilter.ip_conntrack_tcp_be_liberal = 0
> net.ipv4.netfilter.ip_conntrack_tcp_loose = 1
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
> net.ipv4.tcp_slow_start_after_idle = 1
> net.ipv4.tcp_dma_copybreak = 4096
> net.ipv4.tcp_workaround_signed_windows = 0
> net.ipv4.tcp_base_mss = 512
> net.ipv4.tcp_mtu_probing = 0
> net.ipv4.tcp_abc = 0
> net.ipv4.tcp_congestion_control = highspeed
> net.ipv4.tcp_tso_win_divisor = 3
> net.ipv4.tcp_moderate_rcvbuf = 1
> net.ipv4.tcp_no_metrics_save = 0
> net.ipv4.tcp_low_latency = 0
> net.ipv4.tcp_frto = 0
> net.ipv4.tcp_tw_reuse = 0
> net.ipv4.tcp_adv_win_scale = 2
> net.ipv4.tcp_app_win = 31
> net.ipv4.tcp_rmem = 4096 87380 4194304
> net.ipv4.tcp_wmem = 4096 16384 4194304
> net.ipv4.tcp_mem = 196608 262144 393216
> net.ipv4.tcp_dsack = 1
> net.ipv4.tcp_ecn = 0
> net.ipv4.tcp_reordering = 3
> net.ipv4.tcp_fack = 1
> net.ipv4.tcp_orphan_retries = 0
> net.ipv4.tcp_max_syn_backlog = 1024
> net.ipv4.tcp_rfc1337 = 0
> net.ipv4.tcp_stdurg = 0
> net.ipv4.tcp_abort_on_overflow = 0
> net.ipv4.tcp_tw_recycle = 0
> net.ipv4.tcp_syncookies = 1
> net.ipv4.tcp_fin_timeout = 60
> net.ipv4.tcp_retries2 = 15
> net.ipv4.tcp_retries1 = 3
> net.ipv4.tcp_keepalive_intvl = 75
> net.ipv4.tcp_keepalive_probes = 9
> net.ipv4.tcp_keepalive_time = 7200
> net.ipv4.tcp_max_tw_buckets = 180000
> net.ipv4.tcp_max_orphans = 65536
> net.ipv4.tcp_synack_retries = 5
> net.ipv4.tcp_syn_retries = 5
> net.ipv4.tcp_retrans_collapse = 1
> net.ipv4.tcp_sack = 1
> net.ipv4.tcp_window_scaling = 1
> net.ipv4.tcp_timestamps = 1
> fs.nfs.nfs_callback_tcpport = 0
> fs.nfs.nlm_tcpport = 0
>
> Tcp congestion control algorithm has been changed through following command:
> "echo highspeed > /proc/sys/net/ipv4/tcp_congestion_control"
> But it has no obvious improvement.
>
> MPI version is:
> $ mpicxx -v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --enable-shared --enable-threads=posix
> --enable-checking=release --with-system-zlib --enable-__cxa_atexit
> --disable-libunwind-exceptions --enable-libgcj-multifile
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
> --disable-dssi --disable-plugin
> --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic
> --host=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)
>
> [root_at_c1 bin]# mpirun -V
> mpirun (Open MPI) 1.4.3
>
> I used mpirun to run NAMD through SGE. Both NAMD and charm++ was compiled on
> frontnode and run on compute node. I think all the environment is the same
> due to the feature of Rocks Cluster.
>
> Yesterday, I compiled UDP version of charm++ and NAMD with gcc instead of
> mpicxx. So, I tried charmrun to run NAMD and got some "interesting"
> benchmark data:
>
> 91% user CPU, 4%system CPU, 5%idle CPU
>
> Info: Benchmark time: 16 CPUs 0.093186 s/step 0.539271 days/ns 65.9622 MB
> memory
>
> Info: Benchmark time: 16 CPUs 0.0918341 s/step 0.531448 days/ns 66.1155 MB
> memory
>
> Info: Benchmark time: 16 CPUs 0.0898816 s/step 0.520148 days/ns 66.023 MB
> memory
>
>
> total network speed=200Mb/s 40% user CPU, 5%system CPU, 54%idle CPU
>
> Info: Benchmark time: 32 CPUs 0.124091 s/step 0.71812 days/ns 57.4477 MB
> memory
>
> Info: Benchmark time: 32 CPUs 0.123746 s/step 0.716121 days/ns 57.4098 MB
> memory
>
> Info: Benchmark time: 32 CPUs 0.125931 s/step 0.728767 days/ns 57.6321 MB
> memory
>
>
> total network speed=270mb/s 28%user CPU, 5%system CPU, 66%idle CPU
>
> Info: Benchmark time: 48 CPUs 0.133027 s/step 0.769833 days/ns 55.1507 MB
> memory
>
> Info: Benchmark time: 48 CPUs 0.135996 s/step 0.787013 days/ns 55.2202 MB
> memory
>
> Info: Benchmark time: 48 CPUs 0.135308 s/step 0.783031 days/ns 55.2494 MB
> memory
>
>
> total network speed=340Mb/s 24%user CPU, 5%system CPU, 70%idle CPU
>
> Info: Benchmark time: 64 CPUs 0.137098 s/step 0.793394 days/ns 53.4818 MB
> memory
>
> Info: Benchmark time: 64 CPUs 0.138207 s/step 0.799812 days/ns 53.4665 MB
> memory
>
> Info: Benchmark time: 64 CPUs 0.137856 s/step 0.797777 days/ns 53.4743 MB
> memory
>
> There was no much system CPU usage anymore, but idle CPU was increasing when
> more cores was used. I guess the "high idle CPU" in UDP version has
> something to do with the "high system CPU usage" in MPI version.
>
>
> Neil Zhou
>
>
> 2013/12/16 Norman Geist <norman.geist_at_uni-greifswald.de>
>>
>> Additionally, what MPI are you using, or do you use charm++?
>>
>>
>>
>> Norman Geist.
>>
>>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:00 CST