From: Jan Saam (saam_at_charite.de)
Date: Wed Jun 21 2006 - 09:24:10 CDT
Thanks a lot for your comments!
As you could just read in the other email, the clock skew helped to some
extent, but not sufficiently.
Our cluster runs RedHat 9 and the node are connected with Gigabit.
The network achitecture is similar to yours:
The nodes have local addresses:
They are not visible from the internet but they are connected to a
frontend machines which is connected with the internet (100Mbit).
So, you think the private addresses 192.168... are the problem?
I guess with official addresses you mean the ones from the nameserver? I
can't set these since the nodes are not visible from outside. :-(
> Hi Jan and all who can help!
> I have the same or a similar problem not only with NAMD, but also with
> GAMESS, a quantum chemistry program. At least in my case the
> tremendous increase in execution time using 2 or more nodes is not
> connected with clock skew. We use a cluster of 6 PC`s and the
> performance does _not_ depend on clock synchronization (exact
> synchronization of one slave with the master and a time skew of 2
> hours for another slave). However the execution time seems to depend
> in a puzzling manner on network configuration: The nodes are connected
> with a Gigabit switch and Gigabit LAN and they have private
> IP-addresses and private names, but with a second 100 Mbit network
> card in each PC they can also be connected with the internet (official
> IP-addresses, official hostnames). Using private names in the nodes
> file, the wallclock time increases dramatically for two nodes compared
> to one node. With official hostnames however, the wall clock time
> decreases slightly with two nodes (too little I suppose).
> Timing for a 1000 steps simulation:
> One node:
> clock 125 s, CPU time 121 s.
> Two nodes (one or both private addresses) Wall clock 250 s, CPU
> time 94 s.
> Two nodes (both official addresses) Wall clock 104
> s, CPU time 73 s.
> Jan, which Linux distribution do you have? We have installed SUSE 9.3
> on the cluster.
> Jan Saam wrote:
>> I forgot to say that I checked already that the problem is not ssh
>> taking forever to make a connection.
>> This is at least proven by this simple test:
>> time ssh BPU5 pwd
>> real 0m0.236s
>> user 0m0.050s
>> sys 0m0.000s
>> Jan Saam wrote:
>>> Hi all,
>>> I'm experiencing some weird performance problems with NAMD or the
>>> charm++ library on a linux cluster:
>>> When I'm using NAMD or a simple charmm++ demo program on one node
>>> everything is fine, but when I use more that one node each step takes
>>> _very_ much longer!
>>> 2s for the program queens on 1 node, 445s on 2 nodes!!!
>>> on 1 LINUX ch_p4 processors
>>> There are 14200 Solutions to 12 queens. Finish time=1.947209
>>> End of program
>>> [jan_at_BPU1 queens]$ mpirun -v -np 2 -machinefile ~/machines ./pgm 12 6
>>> on 2 LINUX ch_p4 processors
>>> There are 14200 Solutions to 12 queens. Finish time=445.547998
>>> End of program
>>> The same is true when I'm building the net-linux versions instead of
>>> mpi-linux, thus the problem is probably independent of MPI.
>>> One thing I noticed is that there is a several minute clock skew between
>>> the nodes. Could that be part of my problem (unfortnately I don't have
>>> rights to simply synchronize the clocks)?
>>> Does anyone have an idea what the problem could be?
>>> Many thanks,
-- --------------------------- Jan Saam Institute of Biochemistry Charite Berlin Monbijoustr. 2 10117 Berlin Germany +49 30 450-528-446 saam_at_charite.de
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:14 CST