From: Cesar Luis Avila (cavila_at_fbqf.unt.edu.ar)
Date: Sat Aug 26 2006 - 17:00:08 CDT
I have also noticed that running one thread per node using 6 nodes is
faster than running two threads per node using only 3 nodes.
Cesar Luis Avila escribió:
> There was a problem with load balancer on a previous version of NAMD
> (2.6.B1) for which there was a workaround on NAMD's wiki. On 2.6.B2 it
> seems to be solved. I have run NAMD_2.6b2_Linux-amd64-TCP on AMD64
> dual-core nodes for a weak now and haven't experienced that problem. I
> am still experiencing some problems which I think might be related to
> charm++ or perhaps to the kernel itself. I suspect there is a problem
> with memory management when using both processors of each node. I saw
> these problems even on APOA1 simulation. Unfornately I don't know how
> to track down the problem. For now I am running simulations using only
> one processor on each node to test this hypothesis.
> I am using Debian Cluster Components (DCC) with custum compiled kernel
> 18.104.22.168 SMP.
> Leandro Martínez escribió:
>> Just for claryfing the problem a little bit more.
>> Now I put the simulation to run on a single node (the
>> master machine), which has two processors. It starts
>> running fine, two jobs each one in one processor and
>> using almost all cpu speed, as expected,
>> but eventually it returned the message:
>> Info: Adjusted background load on 1 nodes.
>> And the simulation starts running on only one processor.
>> Any clue on what may be going wrong?
>> On 8/25/06, *Leandro Martínez* <leandromartinez98_at_gmail.com
>> <mailto:leandromartinez98_at_gmail.com>> wrote:
>> Hi all,
>> I'm running a simulation with NAMD_2.6b2_Linux-amd64-TCP on
>> a cluster of nine Athlon64 nodes (each processor has a dual
>> core, so there are actually 18 processors). I'm having some
>> strange problems with simulations I have already ran on several
>> other machines, and I'm not being able to find a solution.
>> Basically I start running the simulation and eventually it either
>> stops without printing any error message or it eventually starts
>> on only one processor apparently. The only message I have
>> observed to be different from our previous runs is this one:
>> Info: Adjusted background load on 11 nodes.
>> That is printed the first time load balancing is performed. The
>> error does not occur necessarily after that, on the other hand,
>> but that may be part of the problem, since the simulation was
>> set to be running on 18 processors (9 nodes).
>> The only time I got an error message it was the one below, as you
>> may note was printed after a quite long simulation time.
>> The error is not easily reproducible, since it happens always
>> but not every time at the same point of the simulation.
>> Any help or idea will be appreciated.
>> ENERGY: 644800 804.7671 2363.3700 1332.0255
>> 131.9843 -201929.9812 17508.6136 0.0000
>> 0.0000 32575.8361 -147213.3846 297.3932
>> -147116.7637 -147117.4476 296.8970
>> Stack Traceback:
>>  /lib64/libc.so.6 [0x360b32f7c0]
>>  _ZN11WorkDistrib12enqueueBondsEP12LocalWorkMsg+0x16
>>  CkDeliverMessageFree+0x21 [0x785aab]
>>  _Z15_processHandlerPvP11CkCoreState+0x455 [0x7850b5]
>>  CsdScheduleForever+0xa2 [0x7f1752]
>>  CsdScheduler+0x1c [0x7f1350]
>>  _Z10slave_initiPPc+0x10 [0x4bb034]
>>  _ZN7BackEnd4initEiPPc+0x28f [0x4bb019]
>>  main+0x47 [0x4b697f]
>>  __libc_start_main+0xf4 [0x360b31d084]
>>  _ZNSt8ios_base4InitD1Ev+0x42 [0x4b2c9a]
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:30 CST