From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Dec 21 2009 - 02:12:08 CST
On Mon, Dec 21, 2009 at 12:03 AM, Jyh-Shyong <c00jsh00_at_nchc.org.tw> wrote:
> Hi,
>
> I tried NAMD 2.7b2 on our GPU cluster, but so far I have not been
> successful, any hint and suggestion
> is appreciated.
> 1. I download the binary NAMD_2.7b2_Linux-x86_64-ibverbs-CUDA, and ran
> a test case with command
>
> ./charmrun ++local ++p 4 namd2 +idlepoll ./alanin.namd
> Charmrun> IBVERBS version of charmrun
> Charmrun: Bad initnode data length. Aborting
have you checked the permissions on the infiniband device?
also, it is totally useless to use infiniband communication
on just a single node.
> 2. I tried again with command
>
> ./charmrun ++nodelist ./hostlist ++p 4 namd2 +idlepoll ./alanin.namd
>
> Here file hostlist contains two lines:
>
> group main
> host gc16
> gc16 is the hostname of the computer I was using. Here is the output of
> this command:
>
>
> ..
> Info:
> Info: Entering startup at 0.376303 s, 104.066 MB of memory in use
> Info: Startup phase 0 took 0.00472808 s, 104.066 MB of memory in use
> Info: Startup phase 1 took 0.00161982 s, 104.066 MB of memory in use
> Info: Startup phase 2 took 0.000169039 s, 104.066 MB of memory in use
> FATAL ERROR: CUDA-enabled NAMD requires more patches than processes.
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA-enabled NAMD requires more patches than processes.
here is the hint that NAMD gives you. obviously you are using a tiny input
example that is too small for a reasonable domain decomposition.
due to the way how GPUs work, there is no speed gain for small
domains (patches in NAMD-speak). if you don't have of the order
of 10000 atoms per domain, the GPU will not be fully occupied.
[...]
> There are 4 Tesla C1070s on this node:
> chem_at_gc16:/work/chem/alanin> ls -l /dev/nvi*
> crw-rw-rw- 1 root video 195, 0 2009-10-13 10:56 /dev/nvidia0
> crw-rw-rw- 1 root video 195, 1 2009-10-13 10:56 /dev/nvidia1
> crw-rw-rw- 1 root video 195, 2 2009-10-13 10:56 /dev/nvidia2
> crw-rw-rw- 1 root video 195, 3 2009-10-13 10:56 /dev/nvidia3
> crw-rw-rw- 1 root video 195, 255 2009-10-13 10:56 /dev/nvidiactl
>
> I wonder something in my environment settings might be wrong, but I
> don't know what it is.
> I also downloaded the latest version of source code and built the binary
> with ibverbs option
> for charm, and I got the same result.
no surprise there.
cheers,
axel.
> Regards
>
> Jyh-Shyong Ho, Ph.D.
> Research Scientist
> National Center for High Performance Computing
> Hsinchu, Taiwan, ROC
>
>
>
>
>
>
-- Dr. Axel Kohlmeyer akohlmey_at_gmail.com Institute for Computational Molecular Science College of Science and Technology Temple University, Philadelphia PA, USA.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:37 CST