Re: Replica exchange simulation with GPU Accelaration

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Fri Jan 26 2018 - 12:32:55 CST

The two multiple-walker schemes use different code. I wrote the one for
metadynamics a few years back before NAMD had multiple-copy capability,
using the file system. Jeff Comer and others at UIUC wrote the one for
ABF, using the network: for this reason, its use is subject to the
constraints of Charm++, where the simultaneous use of MPI and CUDA has so
far been difficult.

The network-based solution should be more scalable in large HPC clusters,
but for a small commodity cluster of single-node replicas it should be OK.

By the way, I just noticed that you are launching 4 copies of NAMD over 2
GPUs? Don't do that. GPUs must be assigned exclusively to one process, or
their benefits go out the window.

Giacomo

On Fri, Jan 26, 2018 at 1:24 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
wrote:

> Thanks for the replies. I get that in the present scenario it is gonna be
> hard to get the gpu resources for my replica runs because of some
> difficulty in the parallelisation scheme for gpu usage as MPI execution.
>
> Is the replica exchange scheme for multiple walker ABF is differently
> implimented than for metadynamics or other NAMD replica exchange
> strategies? I am just curious because my understanding in this regard is
> not much of a mark.
> On 26 Jan 2018 20:43, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com> wrote:
>
>> In general the multicore version (i.e. SMP with no network) is the best
>> approach for CUDA, provided that the system is small enough. With nearly
>> everything offloaded to the GPUs in the recent version, the CPUs are mostly
>> idle, and adding more CPU cores only clogs up the motherboard bus.
>>
>> Running CUDA jobs in parallel, particularly with MPI, is a whole other
>> endeavor.
>>
>> In Souvik's case, it is a setup that is difficult to run fast. You may
>> consider using the multicore version for multiple-replicas metadynamics
>> runs, which can communicate between replicas using files and do not need
>> MPI.
>>
>> Giacomo
>>
>> On Thu, Jan 25, 2018 at 2:40 PM, Renfro, Michael <Renfro_at_tntech.edu>
>> wrote:
>>
>>> I can’t speak for running replicas as such, but my usual way of running
>>> on a single node with GPUs is to use the multicore-CUDA NAMD build, and to
>>> run namd as:
>>>
>>> namd2 +setcpuaffinity +devices ${GPU_DEVICE_ORDINAL} +p${SLURM_NTASKS}
>>> ${INPUT} >& ${OUTPUT}
>>>
>>> Where ${GPU_DEVICE_ORDINAL} is “0”, “1”, or “0,1” depending on which GPU
>>> I reserve; ${SLURM_NTASKS} is the number of cores needed, and ${INPUT} and
>>> ${OUTPUT} are the NAMD input file and the file to record standard output.
>>>
>>> Use HECBioSym’s 3M atom benchmark model, an single K80 card (presented
>>> as 2 distinct GPUs) could keep 8 E5-2680v4 CPU cores busy. But 16 or 28
>>> cores (the maxiumum on a single node of ours) was hardly any faster with 2
>>> GPUs than 8 cores.
>>>
>>> --
>>> Mike Renfro / HPC Systems Administrator, Information Technology Services
>>> 931 372-3601 / Tennessee Tech University
>>>
>>> > On Jan 25, 2018, at 12:59 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
>>> wrote:
>>> >
>>> > Thanks for your reply.
>>> > I was wondering, why 'idlepoll' can't even call gpu to work despite
>>> the probability of a poor performance.
>>> >
>>> > On 25 Jan 2018 19:53, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>> wrote:
>>> > Hi Souvik, this seems connected to the compilation options. Compiling
>>> with MPI + SMP + CUDA used to be very poor performance, although I haven't
>>> tried with the new CUDA kernels (2.12 and later).
>>> >
>>> > Giacomo
>>> >
>>> > On Thu, Jan 25, 2018 at 4:02 AM, Souvik Sinha <
>>> souvik.sinha893_at_gmail.com> wrote:
>>> > NAMD Users,
>>> >
>>> > I am trying to run replica exchange ABF simulations in a machine with
>>> 32 cores and 2 Tesla K40 cards. NAMD_2.12, compiled from source is what I
>>> am using.
>>> >
>>> > From this earlier thread, http://www.ks.uiuc.edu/Researc
>>> h/namd/mailing_list/namd-l.2014-2015/2490.html, I find out that using
>>> "twoAwayX" or "idlepoll" might help the GPUs to work but somehow in my
>>> situation it's not helping the GPUs to work ("twoAwayX" is speeding up the
>>> jobs though). The 'idlepoll' switch generally works fine for Cuda build
>>> NAMD versions for non-replica jobs. From the aforesaid thread, I get that
>>> running 4 replicas in 32 CPUs and 2 GPUs may not provide a big boost to my
>>> simulations but I just want to check whether it works or not?
>>> >
>>> > I am running command for the job:
>>> > mpirun -np 32 /home/sgd/program/NAMD_2.12_Source/Linux-x86_64-g++/namd2
>>> +idlepoll +replicas 4 $inputfile +stdout log/job0.%d.log
>>> >
>>> > My understanding is not helping me much, so any advice will be helpful.
>>> >
>>> > Thank you
>>> >
>>> > --
>>> > Souvik Sinha
>>> > Research Fellow
>>> > Bioinformatics Centre (SGD LAB)
>>> > Bose Institute
>>> >
>>> > Contact: 033 25693275
>>> >
>>> >
>>> >
>>> > --
>>> > Giacomo Fiorin
>>> > Associate Professor of Research, Temple University, Philadelphia, PA
>>> > Contractor, National Institutes of Health, Bethesda, MD
>>> > http://goo.gl/Q3TBQU
>>> > https://github.com/giacomofiorin
>>>
>>>
>>>
>>
>>
>> --
>> Giacomo Fiorin
>> Associate Professor of Research, Temple University, Philadelphia, PA
>> Contractor, National Institutes of Health, Bethesda, MD
>> http://goo.gl/Q3TBQU
>> https://github.com/giacomofiorin
>>
>

-- 
Giacomo Fiorin
Associate Professor of Research, Temple University, Philadelphia, PA
Contractor, National Institutes of Health, Bethesda, MD
http://goo.gl/Q3TBQU
https://github.com/giacomofiorin

This archive was generated by hypermail 2.1.6 : Sun Sep 15 2019 - 23:19:22 CDT