RE: NAMD QM/MM multi nodes performance bad

From: James Kress (
Date: Mon Dec 14 2020 - 13:21:33 CST



Why are you using 20 cores with namd but 60 with ORCA? Could this asymmetric utilization of cores be causing issues with core allocation and core swapping?


Also, what communication method is being used between the nodes (btw you claim 4 nodes but only list 3 in your node file)? Ethernet is infamous for latency issues. Hopefully Infiniband is being used.


Are there any file system I/O issues?


Also, ORCA scales quite well across my systems.




James Kress Ph.D., President

The KressWorks® Institute

An IRS Approved 501 (c)(3) Charitable, Nonprofit Corporation

“Engineering The Cure” ©

(248) 573-5499


Learn More and Donate At:

Website: <>


Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.


From: <> On Behalf Of Josh Vermaas
Sent: Monday, December 14, 2020 10:43 AM
To:; Chunli Yan <>; Josh Vermaas <>
Subject: Re: namd-l: NAMD QM/MM multi nodes performance bad


Hi Chunli,

You clearly get a performance win by running ORCA stand alone here. How do the slurm arguments compare? It could be that multiple nodes doesn't help you, since QM codes generally scale pretty poorly across multiple nodes. What you are showing here is that so long as NAMD can get the MM done in under a second, the NAMD part of the problem really doesn't matter.


On 12/13/20 9:21 PM, Chunli Yan wrote:

I took the input generated from NAMD and ran with orca only using 60 core, below is the timing:


Timings for individual modules:


Sum of individual times ... 32.070 sec (= 0.535 min)

GTO integral calculation ... 4.805 sec (= 0.080 min) 15.0 %

SCF iterations ... 19.318 sec (= 0.322 min) 60.2 %

SCF Gradient evaluation ... 7.947 sec (= 0.132 min) 24.8 %

                             ****ORCA TERMINATED NORMALLY****

TOTAL RUN TIME: 0 days 0 hours 0 minutes 39 seconds 917 msec

With NAMD and ORCA combined (60 core for orca):


Timings for individual modules:


Sum of individual times ... 77.582 sec (= 1.293 min)

GTO integral calculation ... 5.404 sec (= 0.090 min) 7.0 %

SCF iterations ... 67.242 sec (= 1.121 min) 86.7 %

SCF Gradient evaluation ... 4.937 sec (= 0.082 min) 6.4 %

                             ****ORCA TERMINATED NORMALLY****









On Sun, Dec 13, 2020 at 10:53 PM Josh Vermaas < <> > wrote:

Just a quick question: how fast is the QM part of the calculation? I don't know what your expectation is, but each timestep is taking over a minute. The vast majority of that is likely the QM, as I'm sure you will find that a MM only system with a handful of cores will calculate a timestep in under a second. My advice is to figure out the QM half of the calculation, and get it running optimally. Even then, your performance is going to be awful compared with pure MM calculations, since you are trying to evaluate a much harder energy functions.



On Sun, Dec 13, 2020, 7:49 PM Chunli Yan < <> > wrote:


NAMD QM/MM parallel runs cross multi nodes:

I wrote a nodelist file into the directory to where the orca runs. Below is the job submission script:


#SBATCH -A bip174

#SBATCH -J test


##SBATCH --tasks-per-node=32

##SBATCH --cpus-per-task=1

##SBATCH --mem=0

#SBATCH -t 48:00:00


#module load openmpi/3.1.4


export PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/bin/:$PATH"

export LD_LIBRARY_PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/lib/:$LD_LIBRARY_PATH"



# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from



# Generate ORCA nodelist

for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do

echo "$n slots=20 max-slots=32" >> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes


sed -i '1d' /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes


cd /gpfs/alpine/scratch/chunli/bip174/eABF/run.smd.dft5

/ccs/home/chunli/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +p30 +isomalloc_sync decarboxylase.1.conf > output.smd1.log

I also exclude the first node where NAMD launches to avoid competition between NAMD and ORCA.

The nodelist is below:


andes4 slots=20 max-slots=32

andes6 slots=20 max-slots=32

andes7 slots=20 max-slots=32


In order to use the host file for mpirun, I edited the

cmdline += orcaInFileName + " \"--hostfile /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes --bind-to core -nooversubscribe \" " + " > " + orcaOutFileName


QM methods: B3LYP def2-SVP Grid4 EnGrad SlowConv TightSCF RIJCOSX D3BJ def2/J


I request 4 nodes total, request 60 cores for ORCA and 20 for NAMD. But the performance is really bad:

for 48968 total atoms and 32 QM atoms. Below is performance:


Info: Initial time: 30 CPUs 75.0565 s/step 1737.42 days/ns 2285.66 MB memory

Info: Initial time: 30 CPUs 81.1294 s/step 1877.99 days/ns 2286 MB memory

Info: Initial time: 30 CPUs 87.776 s/step 2031.85 days/ns 2286 MB memory


Can someone help me to find out whether I did something wrong. Or whether NAMD QM/MM can scale well across the nodes. I checked orca MPI jobs on each node and found the cpu usage only 50-70%.


The namd was compiled with smp, icc:

./build charm++ verbs-linux-x86_64 icc smp -with-production

./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64-icc-smp





Chunli Yan



This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST