Re: Replica exchange simulation with GPU Accelaration

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Fri Jan 26 2018 - 13:18:17 CST

I'm not familiar with how the new CUDA code manages concurrency with the
GPU between different processes. Eventually, somebody at UIUC will provide
some info.

For sure, sharing a GPU is much worse than what you may expect: you
wouldn't just divide its speed in half. Transferring data to/from the GPU
is one of the slowest operations. The kernel will try sharing time on the
GPU between two processes in a manner that is completely unaware of the
processes' compute loops. You may well end up being with interrupted loops
on the GPUs, thus losing much more than half.

With NAMD being a performance-oriented code, there may very well be
instructions that prevent you from doing that, either explicit or
implicitly as a result of the Charm++ scheduler.

Giacomo

On Fri, Jan 26, 2018 at 2:02 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
wrote:

> Ok. Now it shines some light. I have mentioned in my earlier post that I'm
> not expecting much boost from gpu for replicas. I was just checking whether
> the multiple walker scheme at all has the privilage of gpu usage. I get
> that launching more processes over less number of gpus is completely
> useless. Earlier, with multicore-CUDA binary , single process performance
> has been greatly elevated with the use of 2 gpu.
>
> Just one question: is it because of launching 4 replicas over 2 gpu that
> completely abandoned the gpu cores to work at all? I mean if I launch 2
> replicas over 2 cores, will it put the gpus to work? Obviously I can check
> that myself and can get back to you. Thanks again.
> On 27 Jan 2018 12:03 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
> wrote:
>
>> The two multiple-walker schemes use different code. I wrote the one for
>> metadynamics a few years back before NAMD had multiple-copy capability,
>> using the file system. Jeff Comer and others at UIUC wrote the one for
>> ABF, using the network: for this reason, its use is subject to the
>> constraints of Charm++, where the simultaneous use of MPI and CUDA has so
>> far been difficult.
>>
>> The network-based solution should be more scalable in large HPC clusters,
>> but for a small commodity cluster of single-node replicas it should be OK.
>>
>> By the way, I just noticed that you are launching 4 copies of NAMD over 2
>> GPUs? Don't do that. GPUs must be assigned exclusively to one process, or
>> their benefits go out the window.
>>
>> Giacomo
>>
>> On Fri, Jan 26, 2018 at 1:24 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
>> wrote:
>>
>>> Thanks for the replies. I get that in the present scenario it is gonna
>>> be hard to get the gpu resources for my replica runs because of some
>>> difficulty in the parallelisation scheme for gpu usage as MPI execution.
>>>
>>> Is the replica exchange scheme for multiple walker ABF is differently
>>> implimented than for metadynamics or other NAMD replica exchange
>>> strategies? I am just curious because my understanding in this regard is
>>> not much of a mark.
>>> On 26 Jan 2018 20:43, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com> wrote:
>>>
>>>> In general the multicore version (i.e. SMP with no network) is the best
>>>> approach for CUDA, provided that the system is small enough. With nearly
>>>> everything offloaded to the GPUs in the recent version, the CPUs are mostly
>>>> idle, and adding more CPU cores only clogs up the motherboard bus.
>>>>
>>>> Running CUDA jobs in parallel, particularly with MPI, is a whole other
>>>> endeavor.
>>>>
>>>> In Souvik's case, it is a setup that is difficult to run fast. You may
>>>> consider using the multicore version for multiple-replicas metadynamics
>>>> runs, which can communicate between replicas using files and do not need
>>>> MPI.
>>>>
>>>> Giacomo
>>>>
>>>> On Thu, Jan 25, 2018 at 2:40 PM, Renfro, Michael <Renfro_at_tntech.edu>
>>>> wrote:
>>>>
>>>>> I can’t speak for running replicas as such, but my usual way of
>>>>> running on a single node with GPUs is to use the multicore-CUDA NAMD build,
>>>>> and to run namd as:
>>>>>
>>>>> namd2 +setcpuaffinity +devices ${GPU_DEVICE_ORDINAL}
>>>>> +p${SLURM_NTASKS} ${INPUT} >& ${OUTPUT}
>>>>>
>>>>> Where ${GPU_DEVICE_ORDINAL} is “0”, “1”, or “0,1” depending on which
>>>>> GPU I reserve; ${SLURM_NTASKS} is the number of cores needed, and ${INPUT}
>>>>> and ${OUTPUT} are the NAMD input file and the file to record standard
>>>>> output.
>>>>>
>>>>> Use HECBioSym’s 3M atom benchmark model, an single K80 card (presented
>>>>> as 2 distinct GPUs) could keep 8 E5-2680v4 CPU cores busy. But 16 or 28
>>>>> cores (the maxiumum on a single node of ours) was hardly any faster with 2
>>>>> GPUs than 8 cores.
>>>>>
>>>>> --
>>>>> Mike Renfro / HPC Systems Administrator, Information Technology
>>>>> Services
>>>>> 931 372-3601 / Tennessee Tech University
>>>>>
>>>>> > On Jan 25, 2018, at 12:59 PM, Souvik Sinha <
>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>> >
>>>>> > Thanks for your reply.
>>>>> > I was wondering, why 'idlepoll' can't even call gpu to work despite
>>>>> the probability of a poor performance.
>>>>> >
>>>>> > On 25 Jan 2018 19:53, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>>> wrote:
>>>>> > Hi Souvik, this seems connected to the compilation options.
>>>>> Compiling with MPI + SMP + CUDA used to be very poor performance, although
>>>>> I haven't tried with the new CUDA kernels (2.12 and later).
>>>>> >
>>>>> > Giacomo
>>>>> >
>>>>> > On Thu, Jan 25, 2018 at 4:02 AM, Souvik Sinha <
>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>> > NAMD Users,
>>>>> >
>>>>> > I am trying to run replica exchange ABF simulations in a machine
>>>>> with 32 cores and 2 Tesla K40 cards. NAMD_2.12, compiled from source is
>>>>> what I am using.
>>>>> >
>>>>> > From this earlier thread, http://www.ks.uiuc.edu/Researc
>>>>> h/namd/mailing_list/namd-l.2014-2015/2490.html, I find out that using
>>>>> "twoAwayX" or "idlepoll" might help the GPUs to work but somehow in my
>>>>> situation it's not helping the GPUs to work ("twoAwayX" is speeding up the
>>>>> jobs though). The 'idlepoll' switch generally works fine for Cuda build
>>>>> NAMD versions for non-replica jobs. From the aforesaid thread, I get that
>>>>> running 4 replicas in 32 CPUs and 2 GPUs may not provide a big boost to my
>>>>> simulations but I just want to check whether it works or not?
>>>>> >
>>>>> > I am running command for the job:
>>>>> > mpirun -np 32 /home/sgd/program/NAMD_2.12_Source/Linux-x86_64-g++/namd2
>>>>> +idlepoll +replicas 4 $inputfile +stdout log/job0.%d.log
>>>>> >
>>>>> > My understanding is not helping me much, so any advice will be
>>>>> helpful.
>>>>> >
>>>>> > Thank you
>>>>> >
>>>>> > --
>>>>> > Souvik Sinha
>>>>> > Research Fellow
>>>>> > Bioinformatics Centre (SGD LAB)
>>>>> > Bose Institute
>>>>> >
>>>>> > Contact: 033 25693275
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Giacomo Fiorin
>>>>> > Associate Professor of Research, Temple University, Philadelphia, PA
>>>>> > Contractor, National Institutes of Health, Bethesda, MD
>>>>> > http://goo.gl/Q3TBQU
>>>>> > https://github.com/giacomofiorin
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Giacomo Fiorin
>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>> Contractor, National Institutes of Health, Bethesda, MD
>>>> http://goo.gl/Q3TBQU
>>>> https://github.com/giacomofiorin
>>>>
>>>
>>
>>
>> --
>> Giacomo Fiorin
>> Associate Professor of Research, Temple University, Philadelphia, PA
>> Contractor, National Institutes of Health, Bethesda, MD
>> http://goo.gl/Q3TBQU
>> https://github.com/giacomofiorin
>>
>

-- 
Giacomo Fiorin
Associate Professor of Research, Temple University, Philadelphia, PA
Contractor, National Institutes of Health, Bethesda, MD
http://goo.gl/Q3TBQU
https://github.com/giacomofiorin

This archive was generated by hypermail 2.1.6 : Fri Dec 06 2019 - 23:19:30 CST