From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Jul 01 2015 - 02:13:56 CDT
From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf Of Mitchell Gleed
Sent: Wednesday, July 01, 2015 1:09 AM
To: NAMD list
Subject: namd-l: replica exchange and GPU acceleration
I'm struggling to get GPU-accelerated replica-exchange jobs to run. I thought that maybe this wasn't possible, but I found some older posts in which some users reported success with it.
In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple parameters in the configuration files, if I try to launch a replica exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas 4 ... etc) things work fine. However, when I jump on a node with 4 GPU's (2* tesla k80) and use a CUDA-enabled mpi NAMD build (mpirun -np 4 namd2 +idlepoll +replicas 4 ...), I get atom velocity errors at startup. These are preceded by several warnings, after load balancing, similar to the following: "0:18(1) atom 18 found 6 exclusions but expected 4"
I’ve run a lot of GPU REMD jobs successfully, so it’s possible and running fine.
The CUDA-enabled mpi build runs fine for non-REMD runs and provides great acceleration to my simulations. I recognize the test case here might not be the best-suited for CUDA acceleration due to its small size, but it should at least run, right?
Now, I notice the log files for each replica all say "Pe 0 physical rank 0 binding to CUDA device 0," as if each replica is trying to use the same GPU (device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve the problem if I got each of four replicas to bind to different GPU's, but I don't know how to assign GPU's to a specific replica. I assume that each replica should have its own GPU, right, or can a GPU be shared by multiple replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run 16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former isn't possible.
Sharing GPUs between replicas is ok. If you have multiple replica per node you can’t actually prevent it. You can of course oversubscribe your nodes as long as memory is sufficient, you just need to use #replicas*2 processes. You shouldn’t need to care about GPU binding at all.
Does anyone have any advice to resolve the problem I'm running into?
The warning and error you observe might be due starting a high temperature run without a leading minimization run. Try minimizing the system before starting the REMD. I actually extended the replica.namd script by a configurable minimization feature to get around such issues.
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:56 CST