From: Lei Shi (les2007_at_med.cornell.edu)
Date: Mon Apr 20 2009 - 15:04:00 CDT
Hi, Jeff
Do we need to change the "module load mvapich" in the submission
script? There seems no "mvapich_old" module.
Thanks.
Lei
2009/4/17 Jeff Wereszczynski <jmweresz_at_umich.edu>:
> Hi all,
> Just to follow up on this I believe I have solved the problem. On a
> recommendation from the people at TACC I recompiled NAMD with the
> 'mvapich-old' module (instead of 'mvapich') and it now appears to work.
> Jeff
>
> On Fri, Apr 17, 2009 at 12:12 AM, Haosheng Cui <haosheng_at_hec.utah.edu>
> wrote:
>>
>> Hello all:
>>
>> I do have the same problem. The job usually dies after 1000 steps. If I
>> restart the job several (3-5) times, it may go through once and runs
>> successfully for 24 hours. Seems it only happen to the big systems (for mine
>> is ~800k). The problem occurs since the beginning of 2009. Tried the same
>> job on Kraken, it works fine most of the time. I have already asked tacc for
>> help, but seems not helping.
>>
>> Thanks,
>> Haosheng
>>
>>
>> Quoting Jeff Wereszczynski <jmweresz_at_umich.edu>:
>>
>>> Hi All,
>>> I have a system of ~490k atoms I am trying to run on Ranger, however it
>>> will
>>> run for only 500-2000 steps before dying. Nothing of interest is printed
>>> in
>>> the log file:
>>>
>>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 2000
>>> Signal 15 received.
>>> Signal 15 received.
>>> Signal 15 received.
>>> TACC: MPI job exited with code: 1
>>> TACC: Shutting down parallel environment.
>>> TACC: Shutdown complete. Exiting.
>>>
>>>
>>> Whereas in the job output/error file I get this:
>>>
>>> TACC: Done.
>>> 193 - MPI_IPROBE : Communicator argument is not a valid communicator
>>> Special bit pattern b5000000 in communicator is incorrect. May indicate
>>> an
>>> out-of-order argument or a freed communicator
>>> [193] [] Aborting Program!
>>> Exit code -3 signaled from i182-206.ranger.tacc.utexas.edu
>>> Killing remote processes...Abort signaled by rank 193: Aborting program
>>> !
>>> MPI process terminated unexpectedly
>>> DONE
>>>
>>> Here is my job script:
>>>
>>> #!/bin/bash
>>> #$ -V # Inherit the submission environment
>>> #$ -N namd # Job Name
>>> #$ -j y # combine stderr & stdout into stdout
>>> #$ -o namd # Name of the output file (eg. myMPI.oJobID)
>>> #$ -pe 16way 256 # Requests 64 cores/node, 64 cores total
>>> #$ -q normal # Queue name
>>> #$ -l h_rt=2:00:00 # Run time (hh:mm:ss) - 1.5 hours
>>>
>>> module unload mvapich2
>>> module unload mvapich
>>> module swap pgi intel
>>> module load mvapich
>>>
>>> export VIADEV_SMP_EAGERSIZE=64
>>> export VIADEV_SMPI_LENGTH_QUEUE=256
>>> export VIADEV_ADAPTIVE_RDMA_LIMIT=0
>>> export VIADEV_RENDEZVOUS_THRESHOLD=50000
>>>
>>> ibrun tacc_affinity
>>> /share/home/00288/tg455591/NAMD_2.7b1_Linux-x86_64/namd2
>>> namd.inp >namd.log
>>>
>>> Any ideas what I might be doing wrong? I would guess from the error
>>> message
>>> its some sort of MPI problem. I've tried varying the number of
>>> processors
>>> (from 64 to 1104), editing out the "export ...." lines that control MPI
>>> parameters, and taken out the tacc_infinity part but nothing seems to
>>> help .
>>> I've never had these problems with smaller systems. Has anyone else had
>>> these sort of issues? Any suggestions how to fix them?
>>>
>>> Thanks,
>>> Jeff
>>>
>>
>>
>
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:39 CST