Re: Replica exchange simulation with GPU Accelaration

From: Souvik Sinha (souvik.sinha893_at_gmail.com)
Date: Mon Jan 29 2018 - 05:30:39 CST

I have built NAMD according to the given instructions in the earlier
thread. Now I can launch 4 replicas over 4 GPU cores and it is working now.
Thanks again.

On Sat, Jan 27, 2018 at 10:50 AM, Souvik Sinha <souvik.sinha893_at_gmail.com>
wrote:

> Thanks for the reply. This earlier thread is really helpful. I will
> definitely try your suggestion of building my NAMD to work around replica
> jobs.
> On 27 Jan 2018 03:16, "Jeff Comer" <jeffcomer_at_gmail.com> wrote:
>
> Dear Souvik,
>
> I routinely use GPUs for multiple-walker ABF with decent performance. I
> have a workstation with 3 GPUs and usually use 3 replicas. I haven't tried
> using more threads than GPUs. I posted on this mailing list about setting
> it up:
>
> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.
> 2016-2017/1721.html
>
> Jeff
>
>
> –––––––––––––––––––––––––––––––––––———————
> Jeffrey Comer, PhD
> Assistant Professor
> Institute of Computational Comparative Medicine
> Nanotechnology Innovation Center of Kansas State
> Kansas State University
> Office: P-213 Mosier Hall
> Phone: 785-532-6311
> Website: http://jeffcomer.us
>
> On Fri, Jan 26, 2018 at 1:35 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
> wrote:
>
>> Ok, I get that. Thanks.
>> On 27 Jan 2018 12:48 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>> wrote:
>>
>>> I'm not familiar with how the new CUDA code manages concurrency with the
>>> GPU between different processes. Eventually, somebody at UIUC will provide
>>> some info.
>>>
>>> For sure, sharing a GPU is much worse than what you may expect: you
>>> wouldn't just divide its speed in half. Transferring data to/from the GPU
>>> is one of the slowest operations. The kernel will try sharing time on the
>>> GPU between two processes in a manner that is completely unaware of the
>>> processes' compute loops. You may well end up being with interrupted loops
>>> on the GPUs, thus losing much more than half.
>>>
>>> With NAMD being a performance-oriented code, there may very well be
>>> instructions that prevent you from doing that, either explicit or
>>> implicitly as a result of the Charm++ scheduler.
>>>
>>> Giacomo
>>>
>>> On Fri, Jan 26, 2018 at 2:02 PM, Souvik Sinha <souvik.sinha893_at_gmail.com
>>> > wrote:
>>>
>>>> Ok. Now it shines some light. I have mentioned in my earlier post that
>>>> I'm not expecting much boost from gpu for replicas. I was just checking
>>>> whether the multiple walker scheme at all has the privilage of gpu usage. I
>>>> get that launching more processes over less number of gpus is completely
>>>> useless. Earlier, with multicore-CUDA binary , single process performance
>>>> has been greatly elevated with the use of 2 gpu.
>>>>
>>>> Just one question: is it because of launching 4 replicas over 2 gpu
>>>> that completely abandoned the gpu cores to work at all? I mean if I launch
>>>> 2 replicas over 2 cores, will it put the gpus to work? Obviously I can
>>>> check that myself and can get back to you. Thanks again.
>>>> On 27 Jan 2018 12:03 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>> wrote:
>>>>
>>>>> The two multiple-walker schemes use different code. I wrote the one
>>>>> for metadynamics a few years back before NAMD had multiple-copy capability,
>>>>> using the file system. Jeff Comer and others at UIUC wrote the one for
>>>>> ABF, using the network: for this reason, its use is subject to the
>>>>> constraints of Charm++, where the simultaneous use of MPI and CUDA has so
>>>>> far been difficult.
>>>>>
>>>>> The network-based solution should be more scalable in large HPC
>>>>> clusters, but for a small commodity cluster of single-node replicas it
>>>>> should be OK.
>>>>>
>>>>> By the way, I just noticed that you are launching 4 copies of NAMD
>>>>> over 2 GPUs? Don't do that. GPUs must be assigned exclusively to one
>>>>> process, or their benefits go out the window.
>>>>>
>>>>> Giacomo
>>>>>
>>>>> On Fri, Jan 26, 2018 at 1:24 PM, Souvik Sinha <
>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>
>>>>>> Thanks for the replies. I get that in the present scenario it is
>>>>>> gonna be hard to get the gpu resources for my replica runs because of some
>>>>>> difficulty in the parallelisation scheme for gpu usage as MPI execution.
>>>>>>
>>>>>> Is the replica exchange scheme for multiple walker ABF is differently
>>>>>> implimented than for metadynamics or other NAMD replica exchange
>>>>>> strategies? I am just curious because my understanding in this regard is
>>>>>> not much of a mark.
>>>>>> On 26 Jan 2018 20:43, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In general the multicore version (i.e. SMP with no network) is the
>>>>>>> best approach for CUDA, provided that the system is small enough. With
>>>>>>> nearly everything offloaded to the GPUs in the recent version, the CPUs are
>>>>>>> mostly idle, and adding more CPU cores only clogs up the motherboard bus.
>>>>>>>
>>>>>>> Running CUDA jobs in parallel, particularly with MPI, is a whole
>>>>>>> other endeavor.
>>>>>>>
>>>>>>> In Souvik's case, it is a setup that is difficult to run fast. You
>>>>>>> may consider using the multicore version for multiple-replicas metadynamics
>>>>>>> runs, which can communicate between replicas using files and do not need
>>>>>>> MPI.
>>>>>>>
>>>>>>> Giacomo
>>>>>>>
>>>>>>> On Thu, Jan 25, 2018 at 2:40 PM, Renfro, Michael <Renfro_at_tntech.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I can’t speak for running replicas as such, but my usual way of
>>>>>>>> running on a single node with GPUs is to use the multicore-CUDA NAMD build,
>>>>>>>> and to run namd as:
>>>>>>>>
>>>>>>>> namd2 +setcpuaffinity +devices ${GPU_DEVICE_ORDINAL}
>>>>>>>> +p${SLURM_NTASKS} ${INPUT} >& ${OUTPUT}
>>>>>>>>
>>>>>>>> Where ${GPU_DEVICE_ORDINAL} is “0”, “1”, or “0,1” depending on
>>>>>>>> which GPU I reserve; ${SLURM_NTASKS} is the number of cores needed, and
>>>>>>>> ${INPUT} and ${OUTPUT} are the NAMD input file and the file to record
>>>>>>>> standard output.
>>>>>>>>
>>>>>>>> Use HECBioSym’s 3M atom benchmark model, an single K80 card
>>>>>>>> (presented as 2 distinct GPUs) could keep 8 E5-2680v4 CPU cores busy. But
>>>>>>>> 16 or 28 cores (the maxiumum on a single node of ours) was hardly any
>>>>>>>> faster with 2 GPUs than 8 cores.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Renfro / HPC Systems Administrator, Information Technology
>>>>>>>> Services
>>>>>>>> 931 372-3601 / Tennessee Tech University
>>>>>>>>
>>>>>>>> > On Jan 25, 2018, at 12:59 PM, Souvik Sinha <
>>>>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Thanks for your reply.
>>>>>>>> > I was wondering, why 'idlepoll' can't even call gpu to work
>>>>>>>> despite the probability of a poor performance.
>>>>>>>> >
>>>>>>>> > On 25 Jan 2018 19:53, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>>>>>> wrote:
>>>>>>>> > Hi Souvik, this seems connected to the compilation options.
>>>>>>>> Compiling with MPI + SMP + CUDA used to be very poor performance, although
>>>>>>>> I haven't tried with the new CUDA kernels (2.12 and later).
>>>>>>>> >
>>>>>>>> > Giacomo
>>>>>>>> >
>>>>>>>> > On Thu, Jan 25, 2018 at 4:02 AM, Souvik Sinha <
>>>>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>>>> > NAMD Users,
>>>>>>>> >
>>>>>>>> > I am trying to run replica exchange ABF simulations in a machine
>>>>>>>> with 32 cores and 2 Tesla K40 cards. NAMD_2.12, compiled from source is
>>>>>>>> what I am using.
>>>>>>>> >
>>>>>>>> > From this earlier thread, http://www.ks.uiuc.edu/Researc
>>>>>>>> h/namd/mailing_list/namd-l.2014-2015/2490.html, I find out that
>>>>>>>> using "twoAwayX" or "idlepoll" might help the GPUs to work but somehow in
>>>>>>>> my situation it's not helping the GPUs to work ("twoAwayX" is speeding up
>>>>>>>> the jobs though). The 'idlepoll' switch generally works fine for Cuda build
>>>>>>>> NAMD versions for non-replica jobs. From the aforesaid thread, I get that
>>>>>>>> running 4 replicas in 32 CPUs and 2 GPUs may not provide a big boost to my
>>>>>>>> simulations but I just want to check whether it works or not?
>>>>>>>> >
>>>>>>>> > I am running command for the job:
>>>>>>>> > mpirun -np 32 /home/sgd/program/NAMD_2.12_Source/Linux-x86_64-g++/namd2
>>>>>>>> +idlepoll +replicas 4 $inputfile +stdout log/job0.%d.log
>>>>>>>> >
>>>>>>>> > My understanding is not helping me much, so any advice will be
>>>>>>>> helpful.
>>>>>>>> >
>>>>>>>> > Thank you
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Souvik Sinha
>>>>>>>> > Research Fellow
>>>>>>>> > Bioinformatics Centre (SGD LAB)
>>>>>>>> > Bose Institute
>>>>>>>> >
>>>>>>>> > Contact: 033 25693275
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Giacomo Fiorin
>>>>>>>> > Associate Professor of Research, Temple University, Philadelphia,
>>>>>>>> PA
>>>>>>>> > Contractor, National Institutes of Health, Bethesda, MD
>>>>>>>> > http://goo.gl/Q3TBQU
>>>>>>>> > https://github.com/giacomofiorin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Giacomo Fiorin
>>>>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>>>>> Contractor, National Institutes of Health, Bethesda, MD
>>>>>>> http://goo.gl/Q3TBQU
>>>>>>> https://github.com/giacomofiorin
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Giacomo Fiorin
>>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>>> Contractor, National Institutes of Health, Bethesda, MD
>>>>> http://goo.gl/Q3TBQU
>>>>> https://github.com/giacomofiorin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Giacomo Fiorin
>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>> Contractor, National Institutes of Health, Bethesda, MD
>>> http://goo.gl/Q3TBQU
>>> https://github.com/giacomofiorin
>>>
>>
>

-- 
Souvik Sinha
Research Fellow
Bioinformatics Centre (SGD LAB)
Bose Institute
Contact: 033 25693275

This archive was generated by hypermail 2.1.6 : Tue Dec 10 2019 - 23:19:32 CST