RE: Decreasing performance of cluster running FEP

From: Vermaas, Joshua (Joshua.Vermaas_at_nrel.gov)
Date: Wed Jul 11 2018 - 18:23:23 CDT

Colvars are indeed driven by a single CPU. Most of the colvars perform well if the number of atoms involved isn't too big, and bond lengths and angles are typical examples of that. But if you are asking for colvars that involve many atoms in a complicated relationship, performance isn't all that good. To me, the weird thing is that the performance degrades only as the lambda changes. Are you getting any absurd bonds as the trajectory progresses?

-Josh

On 2018-07-11 15:32:12-06:00 Francesco Pietra wrote:

Thanks for your answer.

36 core Intel® Xeon® Broadwell/node, memory 115Gb/node, so that the problems are to look for elsewhere.

50555 atoms, including waters, whereby 4 nodes proved to be the best choice for MD
where the performance was excellent.

With the ligand alone in water the best choice proved to be one node.

In retrospect, are colvars driven by a single CPU? Is that the problem? I could not set less colvars that I described in order
to maintain the ligand in place.
francesco

On Wed, Jul 11, 2018 at 7:21 PM Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov<mailto:Joshua.Vermaas_at_nrel.gov>> wrote:
What is the hardware on your cluster? FEP is not accelerated with GPUs. Neither are colvars, which is I think where the problem may actually be. How many atoms are in your colvar definitions?

-Josh

On 2018-07-10 23:36:51-06:00 owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu> wrote:

Hello:
I am observing a marked decrease in the performance of a NextScale cluster running a FEP for protein-ligand, previously equilibrated for over 100ns. No such problems when running MD equilibration on the same system. Code NAMD 2.12, ad hoc compilation in house with Intel 2016 (NAMD2.12, compiled on more recent Intel, available as module at the cluster, proved unable to run a FEP)

The system is made of ca 460 residues in water, FEP 0.2-1.0 lambda 0.025 (32 windows), preeq 175,000/numSteps 750,000, ts=1.0fs.
FEP on 4 nodes/144core (optimal for scaling) starts with 0.0078/step performance until window 3. Thereafter 0.014/step until window 5, thereafter 0.021 until present window 9. The ligand, under modest r/angle/dih colvars, remains in place with no detectable rotation or distortion. Slowdown is such that it becomes extremely expensive carrying out a FEP, even if divided in two sectors like now. same problems for FEP 0.0-0.2.

I observed the same problem when running FEP on the ligand alone in water on one node /36 core.

In all cases, letting the code writing on disk less frequently did not help--_000_DM6PR09MB292251EC7C0309469EC80130E45A0DM6PR09MB2922namp_--

This archive was generated by hypermail 2.1.6 : Wed Dec 11 2019 - 23:20:03 CST