From: Götz, Alexander (alexander.goetz_at_mytum.de)
Date: Sat Oct 22 2016 - 10:19:42 CDT
it looks like this simple thing has solved my problem. The jobs are now running properly again. In the future I will just „ignore" the firsttimestep option by always setting it to 0. It is more or less just an old habit to increment this number because we had some old codes which used this as a checking and output numbering when reading trajectories. However, this codes have been retired some time ago. Because we have a strict numbering in our file names there shouldn’t be any problem to identify the right order.
As a quick solution, maybe a warning in the stdout output would be nice or an explicit error message to stderror?
Thanks for your help!
Am 21. Oktober 2016 um 22:20:34, Brian Radak (bradak_at_anl.gov<mailto:bradak_at_anl.gov>) schrieb:
As usual, I have missed something subtle - thanks Axel.
It sounds like the consensus is that the "firstTimestep" solution is the
best and fastest option.
Going forward, is this a problem that needs to be addressed? If not
switching to 64 bit ints, is there a more expected behavior that can
replace the current one? Should the step rollover to zero? Wouldn't this
have to be done in such a way that the output frequencies are still
respected? Should there just be a limit on what number of steps a user
should be permitted to reach?
On 10/21/2016 10:54 AM, Axel Kohlmeyer wrote:
> On Fri, Oct 21, 2016 at 11:41 AM, Brian Radak <bradak_at_anl.gov> wrote:
>> I think you are hitting the limit of 32 bit signed integers somewhere
>> (2147483647). There is not always good habit in the code of using unsigned
>> integers where applicable, probably because the step isn't really used for
>> anything other than checking the output frequency.
> signed vs. unsigned only buys you a factor of 2. that is usually not
> helping much and you lose the ability to detect overflows.
> using unsigned integers has a lot of issues regardless. outside of
> counts that have to be able to count bytes for the full address space
> range (e.g. size_t). it is generally better to avoid unsigned integers
> and switch to explicit 64-bit integers instead. generally, blindly
> converting signed integers to unsigned ones is solving the wrong
> problem (and creating new ones in the process).
>> It might not be satisfying, but you can probably solve this by using the
>> "firstTimestep" command to reset the count.
> in the case of the (regular) dcd file format (which is derived from
> fortran unformatted output with signed 32-bit integers) the latter is
> the reasonable option to follow.
>> On 10/21/2016 08:23 AM, Götz, Alexander wrote:
>> Hello everybody,
>> I currently face some troubles with NAMD2.10 and my simulation systems. The
>> systems (I have three nearly equal membrane systems) have all run for 2.1µs
>> in 21 chunks of 100ns (all atom CHARMM36, 2fs integration timestep).
>> Everything worked perfectly fine until step 22. Whenever I want to start
>> step 22 for any of my systems I get the following error in the NAMD output:
>> TCL: Running for 50000000 steps
>> ETITLE: TS BOND ANGLE DIHED IMPRP
>> ELECT VDW BOUNDARY MISC KINETIC
>> TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG
>> PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
>> ENERGY: 2100000000 1866.2220 9802.6548 6896.3270
>> 78.1218 -103131.8273 3410.1396 0.0000 0.0000
>> 26331.7058 -54746.6563 301.5845 -81078.3620 -54571.7237
>> 301.5845 -263.9492 -261.2014 396913.4892 -263.9492
>> OPENING EXTENDED SYSTEM TRAJECTORY FILE
>> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP -2144967296
>> CLOSING EXTENDED SYSTEM TRAJECTORY FILE
>> WRITING COORDINATES TO OUTPUT FILE AT STEP -2144967296
>> COORDINATE DCD FILE <path removed by the author> WAS NOT CREATED
>> The last position output (seq=-2) takes 0.006 seconds, 1399.262 MB of memory
>> in use
>> WRITING VELOCITIES TO OUTPUT FILE AT STEP -2144967296
>> The last velocity output (seq=-2) takes 0.004 seconds, 1400.191 MB of memory
>> in use
>> WallClock: 4.380243 CPUTime: 4.380243 Memory: 1400.191406 MB
>> [Partition 0][Node 0] End of program
>> I am quite confused about this, because I changed nothing in my NAMD
>> configuration files except for the file numbering of the restart and output
>> files and these are fine (checked by 3 different people). For me the
>> problem seems to be related with generation of the DCD file. For the cluster
>> part, the file system of the cluster (IBM GPFS) should be fine because other
>> jobs with equal configurations are working and there has not been any
>> maintenance that could be in relation to the observed problems. In addition
>> step 21 of one of the system worked properly while step 22 of the other two
>> systems failed at the same time. Looks a little bit like 22 is a magic
>> Furthermore, the negative step number in the output, which is not in a line
>> with the run steps, is also quite mysterious for me. I hope anybody has a
>> tip or a solution for me because I have checked nearly everything that came
>> into my mind until now.
>> Best Regards
>> Alexander Götz, M.Sc.
>> Technische Universität München // Fakultät für Physik
>> Lehrstuhl für Bioelektronik E.14
>> Maximus-von-Imhof Forum 4 (room P059)
>> 85350 Freising, Germany
>> T: +49 8161 71-3540
>> Please consider the environment before printing this email
>> Brian Radak
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 9700 South Cass Avenue, Bldg. 240
>> Argonne, IL 60439-4854
>> (630) 252-8643
Leadership Computing Facility
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240
Argonne, IL 60439-4854
This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:22:32 CST