From: Josef Scheiber (Josef.Scheiber_at_mail.uni-wuerzburg.de)
Date: Tue Mar 29 2005 - 01:01:34 CST
Hi Brady & all,
We had the same problem. The reason was a corrupted network switch that
seemed to work properly but sometimes distributed the data in a wrong
way leading a node to crash. Replacing this switch readily solved the
problem. Maybe you have a look for this.
brady chang schrieb:
> Hi all, I'm having a very perculiar problem with NAMD.
> I was wondering if anybody have see this?
> Platform Rocks 3.3:
> dual xeon; ASUS PRDL533 MOBO.
> #!/bin/csh -f
> setenv CONV_RSH ssh
> ~~/apps/NAMD/NAMD_2.5_Linux-i686-TCP/namd2 +p26 ++verbose ++nodelist
> ./.nodelist md_1ns.inp >logmd
> after running for ~12 hours I get
> Charmrun: error on request socket--
> Socket closed before recv.
> and brought the node down
> modified the command to exclude the downed node in my .nodelist.
> then after running for ~ 4 hours I got the same error and brought down
> another node.
> So I'm running it again excluding the downed nodes.
> temperature is normal, load is average. I'm not seeing anything that
> could cause the node to go down.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:37 CST