Re: Performance on GPU

From: Anup Prasad (anup.prasad_at_monash.edu)
Date: Fri Nov 08 2019 - 05:35:37 CST

Thank you very much Vermaas...!
Based on your suggestion I used modified submitting script and this is
working well. The simulation performance is enhanced from 3ns/day to
22.44ns/day which is nearly equal to given NAMD benchmark. Thanks for your
kind support.

Cheers
Anup

On Wed, 6 Nov 2019 at 20:30, Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
wrote:

> OK! This is a multicore build, so your runscript should look something
> like this:
>
> time aprun -n 1 -N 1 -d 18
> /home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 +p18
> +idlepoll apoa1_npt_cuda.namd > prod_gpu.log
>
> The parallelism in your binary isn't coming from mpi or multinode
> parallelism, but instead it was compiled so that NAMD handles multiple
> threads across multiple processors on the same node internally. You can see
> this from:
>
> Info: Running on 1 processors, 1 nodes, 1 physical nodes.
>
> While you have 18 processors on the node, NAMD is being asked to only use
> 1, which means your performance won't be great.
>
> -Josh
>
>
>
> On 2019-11-05 22:40:44-07:00 Anup Prasad wrote:
>
> *The starting output in the log:-*
> Charm++: standalone mode (not using charmrun)
> Charm++> Running in Multicore mode: 1 threads
> Charm++> Using recursive bisection (scheme 3) for topology aware partitions
> Converse/Charm++ Commit ID:
> v6.7.1-0-gbdf6a1b-namd-charm-6.7.1-build-2016-Nov-07-136676
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 1 unique compute nodes (36-way SMP).
> Charm++> cpu topology info is gathered in 0.000 seconds.
> Info: Built with CUDA version 8000
> Did not find +devices i,j,k,... argument, using all
> Pe 0 physical rank 0 binding to CUDA device 0 on physical node 0: 'Tesla
> P100-PCIE-16GB' Mem: 16276MB Rev: 6.0
> Info: NAMD 2.12 for CRAY-XC-multicore-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60701 for multicore-linux64-gcc
> Info: Built Wed Aug 8 15:27:39 IST 2018 by crayadm on clogin72
> Info: Running on 1 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.523414 s
> CkLoopLib is used in SMP with a simple dynamic scheduling (converse-level
> notification) but not using node-level queue
> Info: 110.766 MB of memory in use based on /proc/self/stat
> Info: Configuration file is apoa1_npt_cuda.namd
> Info: Working in the current directory
> /home/PolymerSimulationLab/souravray/systems/test/gpu_1/apoa1/gpu
> TCL: Suspending until startup complete.
> Warning: ALWAYS USE NON-ZERO MARGIN WITH CONSTANT PRESSURE!
> Warning: CHANGING MARGIN FROM 0 to 0.48
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 2
> Info: NUMBER OF STEPS 10000
> Info: STEPS PER CYCLE 20
> Info: PERIODIC CELL BASIS 1 108.861 0 0
> Info: PERIODIC CELL BASIS 2 0 108.861 0
> Info: PERIODIC CELL BASIS 3 0 0 77.758
> Info: PERIODIC CELL CENTER 0 0 0
> Info: LOAD BALANCER None
> Info: MIN ATOMS PER PATCH 40
> Info: INITIAL TEMPERATURE 298
> Info: CENTER OF MASS MOVING INITIALLY? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 ELECTROSTATICS SCALED BY 1
> Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
> Info: NO DCD TRAJECTORY OUTPUT
> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
> Info: NO VELOCITY DCD OUTPUT
> Info: NO FORCE DCD OUTPUT
> Info: OUTPUT FILENAME apoa1-output
> Info: BINARY OUTPUT FILES WILL BE USED
> Info: NO RESTART FILE
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 10
> Info: SWITCHING OFF 12
> Info: PAIRLIST DISTANCE 13.5
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLISTS ENABLED
> Info: MARGIN 0.48
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 16.48
> Info: ENERGY OUTPUT STEPS 500
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 500
> Info: LANGEVIN DYNAMICS ACTIVE
> Info: LANGEVIN TEMPERATURE 298
> Info: LANGEVIN USING BBK INTEGRATOR
> Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
> Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
> Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
> Info: TARGET PRESSURE IS 1.01325 BAR
> Info: OSCILLATION PERIOD IS 100 FS
> Info: DECAY TIME IS 50 FS
> Info: PISTON TEMPERATURE IS 298 K
> Info: PRESSURE CONTROL IS GROUP-BASED
> Info: INITIAL STRAIN RATE IS 0 0 0
> Info: CELL FLUCTUATION IS ISOTROPIC
> Info: PARTICLE MESH EWALD (PME) ACTIVE
> Info: PME TOLERANCE 1e-06
> Info: PME EWALD COEFFICIENT 0.257952
> Info: PME INTERPOLATION ORDER 4
> Info: PME GRID DIMENSIONS 108 108 80
> Info: PME MAXIMUM GRID SPACING 1.5
> Info: Attempting to read FFTW data from
> FFTW_NAMD_2.12_CRAY-XC-multicore-CUDA.txt
> Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
> Info: Writing FFTW data to FFTW_NAMD_2.12_CRAY-XC-multicore-CUDA.txt
> Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 2
> Info: USING VERLET I (r-RESPA) MTS SCHEME.
> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
> Info: RIGID BONDS TO HYDROGEN : ALL
> Info: ERROR TOLERANCE : 1e-08
> Info: MAX ITERATIONS : 100
> Info: RIGID WATER USING SETTLE ALGORITHM
> Info: RANDOM NUMBER SEED 74269
> Info: USE HYDROGEN BONDS? NO
> Info: COORDINATE PDB apoa1.pdb
> Info: STRUCTURE FILE apoa1.psf
> Info: PARAMETER file: XPLOR format! (default)
> Info: PARAMETERS par_all22_prot_lipid.xplor
> Info: PARAMETERS par_all22_popc.xplor
> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
> Info: SUMMARY OF PARAMETERS:
> Info: 177 BONDS
> Info: 435 ANGLES
> Info: 446 DIHEDRAL
> Info: 45 IMPROPER
> Info: 0 CROSSTERM
> Info: 83 VDW
> Info: 6 VDW_PAIRS
> Info: 0 NBTHOLE_PAIRS
> Info: TIME FOR READING PSF FILE: 0.870474
> Info: Reading pdb file apoa1.pdb
> Info: TIME FOR READING PDB FILE: 0.209462
> Info:
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 92224 ATOMS
> Info: 70660 BONDS
> Info: 74136 ANGLES
> Info: 74130 DIHEDRALS
> Info: 1402 IMPROPERS
> Info: 0 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 1568 DIHEDRALS WITH MULTIPLE PERIODICITY (BASED ON PSF FILE)
> Info: 80690 RIGID BONDS
> Info: 195982 DEGREES OF FREEDOM
> Info: 32992 HYDROGEN GROUPS
> Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
> Info: 32992 MIGRATION GROUPS
> Info: 4 ATOMS IN LARGEST MIGRATION GROUP
> Info: TOTAL MASS = 553785 amu
> Info: TOTAL CHARGE = -14 e
> Info: MASS DENSITY = 0.997951 g/cm^3
> Info: ATOM DENSITY = 0.100081 atoms/A^3
> Info: *****************************
> Info:
> Info: Entering startup at 23.7553 s, 139.582 MB of memory in use
> Info: Startup phase 0 took 5.50747e-05 s, 139.629 MB of memory in use
> Info: ADDED 218698 IMPLICIT EXCLUSIONS
> Info: Startup phase 1 took 0.0340278 s, 161.539 MB of memory in use
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000325096 AT 11.9556
> Info: INCONSISTENCY IN SCOR TABLE ENERGY VS FORCE: 0.000324844 AT 11.9556
> Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
> Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00150189 AT 0.251946
> Info: Startup phase 2 took 0.000448227 s, 161.648 MB of memory in use
> Info: Startup phase 3 took 4.57764e-05 s, 161.711 MB of memory in use
> Info: Startup phase 4 took 9.41753e-05 s, 161.711 MB of memory in use
> Info: Startup phase 5 took 4.19617e-05 s, 161.711 MB of memory in use
> Info: PATCH GRID IS 6 (PERIODIC) BY 6 (PERIODIC) BY 4 (PERIODIC)
> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
> Info: REMOVING COM VELOCITY 0.00117565 0.0288209 0.0202255
> Info: LARGEST PATCH (56) HAS 718 ATOMS
> Info: TORUS A SIZE 1 USING 0
> Info: TORUS B SIZE 1 USING 0
> Info: TORUS C SIZE 1 USING 0
> Info: TORUS MINIMAL MESH SIZE IS 1 BY 1 BY 1
> Info: Placed 100% of base nodes on same physical node as patch
> Info: Startup phase 6 took 0.0221109 s, 177.691 MB of memory in use
> Info: PME using 1 x 1 x 1 pencil grid for FFT and reciprocal sum.
> Info: Startup phase 7 took 0.000143051 s, 177.98 MB of memory in use
> Info: Startup phase 8 took 0.00597692 s, 179.91 MB of memory in use
> Info: Startup phase 9 took 0.356868 s, 361.457 MB of memory in use
> Info: CREATING 3031 COMPUTE OBJECTS
> Info: Updated CUDA force table with 4096 elements.
> Info: Updated CUDA LJ table with 83 x 83 elements.
> Info: Found 318 unique exclusion lists needing 1060 bytes
> Info: Startup phase 10 took 0.023706 s, 368.844 MB of memory in use
> Info: Startup phase 11 took 5.6982e-05 s, 368.906 MB of memory in use
> Info: Startup phase 12 took 0.000685215 s, 369.418 MB of memory in use
> Info: Finished startup at 24.1996 s, 369.539 MB of memory in use
>
> On Tue, 5 Nov 2019 at 23:03, Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
> wrote:
>
>> What is the output in the log? Usually when I see weird performance, its
>> because NAMD didn't detect the hardware like you expected it to. The
>> startup information in the top of the log will report how many processors
>> are being used, and how the GPUs are being assigned.
>>
>> -Josh
>>
>>
>>
>> On 2019-11-05 06:36:29-07:00 owner-namd-l_at_ks.uiuc.edu wrote:
>>
>> Dear NAMD community,
>> I am using NAMD platform for my MD simulations. I want to use the GPU
>> nodes on the HPC facility here (CRAY XE) to run my simulations, for which I
>> am trying to run the "apoa1" benchmark. I compared the simulation output
>> performance on my HPC facility with given NAMD benchmark results, but got
>> very poor performance. Based on the NAMD benchmarks for apoa1 I should be
>> getting a performance of nearly 30 ns/day on the hardware we have here.
>> However, I am able to get only around 3 ns/day for the same system. I am
>> using the NAMD config files provided in the benchmark link below.
>> NAMD benchmark link- https://www.ks.uiuc.edu/Research/namd/benchmarks/
>> These are the specifications for the GPU nodes at my institute,
>> *HPC specifications*
>>
>> *Operating System -- Cray Linux Environment Version - 6.x*
>>
>> *Cray Programming Environment (CPE) -- Unlimited*
>>
>> *Intel Parallel Studio XE -- 5 Seats*
>>
>> *PGI Accelerator -- 2 Seats*
>>
>> *Workload Manager -- PBS Pro*
>> *Compute Node - CPU+GPU Node*
>>
>> *Processor -- 1X BDW 2.1 GHz 18C*
>>
>> *Accelerator -- 1X P100 16 GiB*
>>
>> *Memory Per Node -- 64 GB DDR4-2400 with Chipkill technology*
>>
>>
>> *This is the shell script I use to submit jobs,*
>>
>>
>> *##############################################################################*
>>
>> * submitting shell script*
>>
>>
>> *##############################################################################*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *## Queue it will run in #PBS -N gpu #PBS -q gpuq #PBS -l
>> select=1:ncpus=18:accelerator=True:vntype=cray_compute #PBS -l
>> walltime=00:30:00 #PBS -l place=pack #PBS -j oe module load
>> craype-broadwell module load craype-accel-nvidia60 module load
>> namd/2.12/gpu-8.0
>> EXEC=/home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 cd
>> $PBS_O_WORKDIR time aprun -n 1 -N 1 -d 18
>> /home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 +idlepoll
>> apoa1_npt_cuda.namd > prod_gpu.log*
>>
>>
>> *#################################################################################*
>>
>>
>> *Please help with suggestions. *
>>
>> *Kind regards*
>>
>> *Anup Kumar Prasad*
>>
>> *Ph.D scholar, IITB-Monash Research Academy*
>>
>> *Indian Institute of Technology Bombay, INDIA*
>>
>>

This archive was generated by hypermail 2.1.6 : Sat Dec 07 2019 - 23:20:53 CST