Re: bpti example compiled source charmrun++ does not launch

From: Renfro, Michael (Renfro_at_tntech.edu)
Date: Thu Nov 01 2018 - 10:14:24 CDT

I’ve not built my own NAMD, just using the original ibverbs, CUDA, or multicore binaries from UIUC.

I suspect your run line might be incorrect, or maybe your situation is just different than ours.

For runs of charmrun and NAMD, I ended up writing helper shell functions at [1] to simplify things. If I’m reading them correctly, a multi-node run should end up running:

  charmrun `which namd2` ++p ${SLURM_NTASKS} ++ppn ${SLURM_CPUS_ON_NODE} ++nodelist nodelist.${SLURM_JOBID} \
              ${INPUT} >& ${OUTPUT}

with a nodelist file with contents of:

  host HOSTNAME.DOMAIN ++cpus SLURM_CPUS_PER_NODE
  …

when charmrun and namd2 are in the user’s PATH, SLURM_NTASKS is the total number of cores requested, and SLURM_CPUS_ON_NODE is the number of cores allocated per node. I run my test cases with a defined number of cores reserved per node, multiplied across as many nodes as needed.

[1] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+NAMD

> On Nov 1, 2018, at 9:31 AM, Hazard, E. Starr <hazards_at_musc.edu> wrote:
>
> This message was sent securely by MUSC
>
> RHEL v6 LSF manager
>
> I compiled NAMD/charm
>
> ~/COMPILE3/NAMD_Git-2018-09-21_Source/charm-6.8.2
>
> here's my smart-build log
> cat ~/COMPILE3/NAMD_Git-2018-09-21_Source/charm-6.8.2/smart-build.log
> Fri Sep 21 12:40:12 EDT 2018
> Using the following build command:
> ./build charm++ mpi-linux-x86_64 -j4 -g -O0
>
> Fri Sep 21 12:47:14 EDT 2018
> Using the following build command:
> ./build charm++ mpi-linux-x86_64 smp -j4 -g -O0
>
> Fri Sep 21 12:48:28 EDT 2018
> Using the following build command:
> ./build charm++ mpi-linux-x86_64 -j4 -g -O0
>
> Fri Sep 21 12:50:01 EDT 2018
> Using the following build command:
> ./build charm++ netlrts-linux-x86_64 gcc gfortran -j4 -g -O0
>
> Wed Oct 3 17:00:28 EDT 2018
> Using the following build command:
> ./build LIBS netlrts-linux-x86_64 gcc -j4 --with-production
>
>
> my LSF file
> #!/bin/bash
> #BSUB -J NAMD2018
> #BSUB -o NAMD2018_OUT%J
> #BSUB -e NAMDERR.e%J
> #BSUB -n 80
> #BSUB -u hazards_at_musc.edu
> export PWD=/home/hazards/NAMD/:$PWD
> export PATH=/home/hazards/NAMD/toppar:$PATH
> /shared/app/NAMD_Git-2018-09-21_Source/charmrun +p80 ++verbose ++remote-shell ssh ++nodelist /home/hazards/NAMD/nodelist \
> /shared/app/NAMD_Git-2018-09-21_Source/namd2 +isomalloc_sync /home/hazards/NAMD/bpti.namd > \
> /home/hazards/NAMD/BPTI-namdcompilecharm_allnodes80.out
>
> The LSF file captures this
>
> cat NAMDERR.e9041
> Charmrun> charmrun started...
> Charmrun> using /home/hazards/NAMD/nodelist as nodesfile
> Charmrun> remote shell (10.200.1.3:0) started
> Charmrun> remote shell (10.200.1.5:7) started
> Charmrun> remote shell (10.200.1.6:14) started
> Charmrun> remote shell (10.200.1.7:21) started
> Charmrun> remote shell (10.200.1.8:28) started
> Charmrun> remote shell (10.200.1.9:35) started
> Charmrun> remote shell (10.200.1.10:42) started
> Charmrun> remote shell (10.200.1.12:49) started
> Charmrun> remote shell (10.200.1.13:56) started
> Charmrun> remote shell (10.200.1.15:62) started
> Charmrun> remote shell (10.200.1.16:68) started
> Charmrun> remote shell (10.200.1.17:74) started
> Charmrun> node programs all started
> Charmrun> error attaching to node '10.200.1.3':
> Timeout waiting for node-program to connect
>
>
> The NAMD output looks like this
> more BPTI-namdcompilecharm_allnodes80.out
> Charmrun remote shell(10.200.1.13.56)> remote responding...
> Charmrun remote shell(10.200.1.13.56)> starting node-program...
> Charmrun remote shell(10.200.1.13.56)> remote shell phase successful.
> Charmrun remote shell(10.200.1.17.74)> remote responding...
> Charmrun remote shell(10.200.1.16.68)> remote responding...
> Charmrun remote shell(10.200.1.6.14)> remote responding...
> Charmrun remote shell(10.200.1.17.74)> starting node-program...
> Charmrun remote shell(10.200.1.17.74)> remote shell phase successful.
> Charmrun remote shell(10.200.1.6.14)> starting node-program...
> Charmrun remote shell(10.200.1.6.14)> remote shell phase successful.
> Charmrun remote shell(10.200.1.7.21)> remote responding...
> ..
> Charmrun remote shell(10.200.1.15.62)> remote responding...
> Charmrun remote shell(10.200.1.5.7)> starting node-program...
> Charmrun remote shell(10.200.1.5.7)> remote shell phase successful.
> Charmrun remote shell(10.200.1.12.49)> remote responding...
> ...
> Charmrun remote shell(10.200.1.7.21)> starting node-program...
> Charmrun remote shell(10.200.1.9.35)> starting node-program...
> Charmrun remote shell(10.200.1.9.35)> remote shell phase successful.
> Charmrun remote shell(10.200.1.12.49)> starting node-program...
> Charmrun remote shell(10.200.1.12.49)> remote shell phase successful.
> Charmrun remote shell(10.200.1.3.0)> starting node-program...
> Charmrun remote shell(10.200.1.3.0)> remote shell phase successful.
> Charmrun> scalable start enabled.
> Charmrun> adding client 0: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 1: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 2: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 3: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 4: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 5: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 6: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 7: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 8: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 9: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 10: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 11: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 12: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 13: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 14: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 15: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 16: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 17: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 18: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 19: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 20: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 21: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 22: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 23: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 24: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 25: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 26: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 27: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 28: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 29: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 30: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 31: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 32: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 33: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 34: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 35: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 36: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 37: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 38: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 39: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 40: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 41: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 42: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 43: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 44: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 45: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 46: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 47: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 48: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 49: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 50: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 51: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 52: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 53: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 54: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 55: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 56: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 57: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 58: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 59: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 60: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 61: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 62: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 63: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 64: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 65: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 66: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 67: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 68: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 69: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 70: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 71: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 72: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 73: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 74: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 75: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 76: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 77: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 78: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 79: "10.200.1.17", IP:10.200.1.17
> Charmrun> Charmrun = 10.200.1.13, port = 60873
> start_nodes_ssh
> Charmrun> Sending "0 10.200.1.13 60873 24622 0" to client 0.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 0.
> Charmrun> Starting ssh 10.200.1.3 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "7 10.200.1.13 60873 24622 0" to client 7.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 7.
> Charmrun> Starting ssh 10.200.1.5 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "14 10.200.1.13 60873 24622 0" to client 14.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 14.
> Charmrun> Starting ssh 10.200.1.6 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "21 10.200.1.13 60873 24622 0" to client 21.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 21.
> Charmrun> Starting ssh 10.200.1.7 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "28 10.200.1.13 60873 24622 0" to client 28.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 28.
> Charmrun> Starting ssh 10.200.1.8 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "35 10.200.1.13 60873 24622 0" to client 35.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 35.
> Charmrun> Starting ssh 10.200.1.9 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "42 10.200.1.13 60873 24622 0" to client 42.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 42.
> Charmrun> Starting ssh 10.200.1.10 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "49 10.200.1.13 60873 24622 0" to client 49.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 49.
> Charmrun> Starting ssh 10.200.1.12 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "56 10.200.1.13 60873 24622 0" to client 56.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 56.
> Charmrun> Starting ssh 10.200.1.13 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "62 10.200.1.13 60873 24622 0" to client 62.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 62.
> Charmrun> Starting ssh 10.200.1.15 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "68 10.200.1.13 60873 24622 0" to client 68.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 68.
> Charmrun> Starting ssh 10.200.1.16 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "74 10.200.1.13 60873 24622 0" to client 74.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 74.
> Charmrun> Starting ssh 10.200.1.17 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Waiting for 0-th client to connect.
>
> here are my nodelist files. I have tried both
>
> cat nodelist
> group main ++shell ssh
> host 10.200.1.3
> host 10.200.1.5
> host 10.200.1.6
> host 10.200.1.7
> host 10.200.1.8
> host 10.200.1.9
> host 10.200.1.10
> host 10.200.1.12
> host 10.200.1.13
> host 10.200.1.15
> host 10.200.1.16
> host 10.200.1.17
> hpcc3:/home/hazards/NAMD: cat nodelist.Oct31
> group main ++shell ssh
> host compute000
> host compute002
> host compute003
> host compute004
> host compute005
> host compute006
> host compute007
> host compute009
> host compute010
> host compute012
> host compute013
> host compute013
> host compute014
>
>
>
> I have tried to understand the advice given here https://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2013-2014/0538.html
> I can ping my hostname from any and all nodes.
>
> I need some help. Thanks in advance
>
> Starr
>
>
>
>
> -------------------------------------------------------------------------
> This message was secured via TLS by MUSC.

This archive was generated by hypermail 2.1.6 : Sun Sep 22 2019 - 23:19:59 CDT