RE: charmrun error: Work completion error in sendCq

From: Vermaas, Joshua (Joshua.Vermaas_at_nrel.gov)
Date: Sat Oct 19 2019 - 15:55:11 CDT

Oh hey! I thought I did something wrong and was starting to dig into that error message myself. I've narrowed it down to something related to the system being large, possibly having to do with the exclusion lists being transferred, since the simulation dies somewhere in phase 1 of the setup (check your logs. For me, it gets past phase 0 and dies in phase 1). My system is only 3M particles or so, but because the bonds are all out of order, the exclusion lists are much larger than they would be for a typical system. Does this system work on a single GPU node? Also, have there been any recent updates to your software? I can dig up more of my own notes on monday.

-Josh

On 2019-10-19 10:20:11-06:00 owner-namd-l_at_ks.uiuc.edu wrote:

Dear all,
I get an error from charmrun from the precompiled NAMD2.13 ibverbs and verbs version . The error persist even for self-compiled version of charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as follows:
[0] wc[0] status 9 wc[i].opcode 0
mlx5: login-hive1.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008a12 0a001e80 0036b1d2
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
[0] Stack Traceback:
  [0:0] [0x6176e3]
  [0:1] [0x617736]
  [0:2] [0x613a78]
  [0:3] [0x61383e]
  [0:4] [0x61c881]
  [0:5] [0x61ead9]
  [0:6] [0x61315f]
  [0:7] [0x617857]
  [0:8] [0x625f28]
  [0:9] [0x626d93]
  [0:10] [0x621671]
  [0:11] [0x621ac9]
  [0:12] [0x6219a0]
  [0:13] [0x6174b6]
  [0:14] [0x617337]
  [0:15] [0x4e2a6b]
  [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
  [0:17] [0x408ba9]
Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any help will be appreciated!
Thank you!
Best,
Andrew Pang

This archive was generated by hypermail 2.1.6 : Sat Dec 07 2019 - 23:20:52 CST