Re: Re: Bug with FEP?

From: Brian Radak (brian.radak_at_gmail.com)
Date: Fri Apr 13 2018 - 10:03:17 CDT

Something tells me this is a filesystem error, although it might be related
to NAMD behavior.

How frequently are you writing to disk? It looks like you are writing a
restart file every 500 steps, which is incredibly frequent and stresses
both NAMD performance and the disk. Even writing energies to output that
frequently can measurably slow down a simulation.

On Fri, Apr 13, 2018 at 3:49 AM, Francesco Pietra <chiendarret_at_gmail.com>
wrote:

> Today, while some other FEPs finished regularly, another FEP (frwd-10) of
> the group illustrated above crashed at window 32
> (i.e., 8 windows before completion) with the same issue
>
> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 500000
>> TCL: Running FEP window 32: Lambda1 0.9774999999999984 Lambda2
>> 0.9799999999999983 [dLambda 0.0024999999999999467]
>> TCL: Setting parameter firsttimestep to 0
>> TCL: Setting parameter alchLambda to 0.9774999999999984
>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
>> .......................
>> .......................
>> EP: 73500 130.5390 -9397.2284 901.6677
>>
>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 73500
>> colvars: Synchronizing (emptying the buffer of) trajectory file
>> "frwd-09_0.colvars.traj".
>> colvars: Writing the state file "frwd-09.colvars.state".
>> WRITING COORDINATES TO DCD FILE frwd-09.dcd AT STEP 74000
>> WRITING COORDINATES TO RESTART FILE AT STEP 74000
>> FATAL ERROR: Unable to open binary file frwd-09.coor: File exists
>>
>
> This message is also forwarded to the cluster as either a bug exists with
> namd FEP, or there are defective nodes (it was job ID 695703, about which
> "scontrol show 695703" answers "invalid job ID", while I am deleting all
> generated frwd-10 files and restarting)
>
> francesco
>
>
>
>
> On Thu, Apr 12, 2018 at 10:28 PM, Francesco Pietra <chiendarret_at_gmail.com>
> wrote:
>
>> Hello
>> I am carrying out with namd2.12 a FEP for Unbound ligand in water (as a
>> preliminary for ligand-protein complex) using 10 segments frwd and 10
>> back, for a total 400 windows frwd and same back.
>>
>> Segment back-02 has already completed. Out of all other running, frwd-09
>> crashed at step 54,000 of first window
>>
>> colvars: The restart output state file will be "frwd-09.colvars.state".
>>> colvars: The final output state file will be "frwd-09_0.colvars.state".
>>> FEP: RESETTING FOR NEW FEP WINDOW LAMBDA SET TO 0.8 LAMBDA2 0.8025
>>> FEP: WINDOW TO HAVE 100000 STEPS OF EQUILIBRATION PRIOR TO FEP DATA
>>> COLLECTION.
>>> FEP: USING CONSTANT TEMPERATURE OF 300 K FOR FEP CALCULATION
>>> PRESSURE: 0 -368.104 569.455 -539.08 569.454 -304.251 -1415.15 -539.08
>>> -1415.15 181.994
>>> GPRESSURE: 0 -260.121 387.556 -718.786 295.641 -123.294 -1219.32
>>> -365.275 -1092.12 269.401
>>> ETITLE: TS BOND ANGLE DIHED
>>> IMPRP :
>>> ...................................
>>> .....................................
>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 53500
>>> WRITING COORDINATES TO DCD FILE frwd-09.dcd AT STEP 53500
>>> WRITING COORDINATES TO RESTART FILE AT STEP 53500
>>> FINISHED WRITING RESTART COORDINATES
>>> WRITING VELOCITIES TO RESTART FILE AT STEP 53500
>>> FINISHED WRITING RESTART VELOCITIES
>>> colvars: Synchronizing (emptying the buffer of) trajectory file
>>> "frwd-09_0.colva
>>> rs.traj".
>>> colvars: Writing the state file "frwd-09.colvars.state".
>>> WRITING COORDINATES TO DCD FILE frwd-09.dcd AT STEP 54000
>>> WRITING COORDINATES TO RESTART FILE AT STEP 54000
>>> FATAL ERROR: Unable to open binary file frwd-09.coor: File exists
>>> [0] Stack Traceback:
>>>
>>
>>
>> I had never encountered such a problem and wonder whether this stems from
>> the code or the cluster (each FEP on one node, 36 core, NextScale).
>> frwd-09.coor (which is bincoor) has normal size (did not try with
>> psf/vmd), while, curiously, frwd-09.dcd and frwd-dcd.BAK were generated
>> alongside back-09.fepout and back-09.fepout.BAK, as if the dcd and fepout
>> files had been present initially (but they were not). Also, no anomaly can
>> be seen in frwd-09.namd and frwd-09.job.
>>
>> At any event, I am deleting all generated frwd-09 files and restarting
>> from scratch in the hope that it was a non systematic error.
>>
>> thanks for advice
>> francesco pietra
>>
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 05 2019 - 23:19:45 CST