Input/output error

From: (corarbor_at_163.com)
Date: Tue May 18 2010 - 01:55:20 CDT

Dear NAMD users:
I am running NAMD on Dawning5000A super computer, "http://www.ssc.net.cn/en/resources.asp". However, I found my NAMD processes vulnerable on such a platfrom. They usually died with an input/output error of the *.restart.coor, *.restart.vel or *.restart.xsc files. There is an example of stand output below:

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 4331000
WRITING COORDINATES TO DCD FILE AT STEP 4331000
WRITING COORDINATES TO RESTART FILE AT STEP 4331000
FATAL ERROR: Error on write to binary file coord.restart.coor: Input/output error
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Error on write to binary file coord.restart.coor: Input/output error

[0] Stack Traceback:
  [0] CmiAbort+0x2b [0x8b68f1]
  [1] _Z8NAMD_errPKc+0x84 [0x4d1444]
  [2] _ZN6Output17write_binary_fileEPciP6Vector+0xda [0x78d8ea]
  [3] _ZN6Output26output_restart_coordinatesEP6Vectorii+0x1d1 [0x78e001]
  [4] _ZN6Output10coordinateEiiP6VectorP11FloatVectorR7Lattice+0x1b2 [0x78ef92]
  [5] _ZN16CollectionMaster16receivePositionsEP16CollectVectorMsg+0x1f1 [0x4dd971]
  [6] CkDeliverMessageFree+0x38 [0x8583c0]
  [7] _Z15_processHandlerPvP11CkCoreState+0x982 [0x85dbbe]
  [8] CmiHandleMessage+0x27 [0x8b7f28]
  [9] CsdScheduleForever+0x64 [0x8b9a58]
  [10] CsdScheduler+0xd [0x8b9adb]
  [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x156 [0x7ceef6]
  [12] TclInvokeStringCommand+0x69 [0x2b662cc99ed9]
  [13] TclEvalObjvInternal+0xf8 [0x2b662cc9aeb8]
  [14] Tcl_EvalEx+0x166 [0x2b662cc9b366]
  [15] Tcl_FSEvalFile+0xec [0x2b662ccd85ec]
  [16] Tcl_EvalFile+0x27 [0x2b662ccd9792]
  [17] _ZN9ScriptTcl3runEPc+0x14 [0x7ceff4]
  [18] _Z18after_backend_initiPPc+0x223 [0x4d4133]
  [19] main+0x24 [0x4d4214]
  [20] __libc_start_main+0xf4 [0x2b662d6df184]
  [21] __gxx_personality_v0+0x139 [0x4d0b69]
[0] [MPI Abort by user] Aborting Program!
Abort signaled by rank 0: MPI Abort by user Aborting program !
Exit code -3 signaled from d544
Killing remote processes...MPI process terminated unexpectedly
DONE
Signal 15 received.

I contacted with the engineers of the super computer center, and they found there was a temporary lustre terminal connection break and reconnect event when such input/output error happened, which is quite often observed during the communication of compute nodes and OSS nodes.

Do you have any suggestion on this problem?
Cheers.
Shen.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:08 CST