Re: RE: charmrun error: Work completion error in sendCq

From: Julio Maia (jmaia_at_ks.uiuc.edu)
Date: Thu Oct 24 2019 - 15:30:24 CDT

Hi Andrew,

Can you try to rebuild Charm++ using UCX or MPI instead of verbs? We
recently discovered that the Verbs layer of our runtime system is broken
for modern Infiniband machines.
Please get back to us and see if that fixes your problem.

Thanks,

On Sat, Oct 19, 2019 at 4:31 PM Pang, Yui Tik <andrewpang_at_gatech.edu> wrote:

> Thanks for your help! In my case, I am pretty sure that is nothing to do
> with the system size because I get the same error by just running the
> charmrun megatest
> (charm6.8.2/verbs-linux-x86_64-ifort-iccstatic/tests/charm++/megatest)
> (charmrun ++p 4 ./pgm). We are testing it on a single CPU node. It is a
> brand new cluster, so everything is new, and we are just installing NAMD on
> it for the first time. Other version of NAMD (tried net and smp) works fine
> except the performance isn’t as good. We really want to try out the ibverbs
> version but run into that sendCq error. Thanks!
>
>
>
> Best,
>
> Andrew Pang
>
>
>
> *From: *Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
> *Sent: *Saturday, October 19, 2019 16:55
> *To: *Pang, Yui Tik <andrewpang_at_gatech.edu>; namd-l_at_ks.uiuc.edu
> *Subject: *RE: charmrun error: Work completion error in sendCq
>
>
>
> Oh hey! I thought I did something wrong and was starting to dig into that
> error message myself. I've narrowed it down to something related to the
> system being large, possibly having to do with the exclusion lists being
> transferred, since the simulation dies somewhere in phase 1 of the setup
> (check your logs. For me, it gets past phase 0 and dies in phase 1). My
> system is only 3M particles or so, but because the bonds are all out of
> order, the exclusion lists are much larger than they would be for a typical
> system. Does this system work on a single GPU node? Also, have there been
> any recent updates to your software? I can dig up more of my own notes on
> monday.
>
> -Josh
>
>
>
>
> On 2019-10-19 10:20:11-06:00 owner-namd-l_at_ks.uiuc.edu wrote:
>
> Dear all,
>
> I get an error from charmrun from the precompiled NAMD2.13 ibverbs and
> verbs version . The error persist even for self-compiled version of
> charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as
> follows:
>
> [0] wc[0] status 9 wc[i].opcode 0
>
> mlx5: login-hive1.pace.gatech.edu: got completion with error:
>
> 00000000 00000000 00000000 00000000
>
> 00000000 00000000 00000000 00000000
>
> 00000001 00000000 00000000 00000000
>
> 00000000 00008a12 0a001e80 0036b1d2
>
> ------------- Processor 0 Exiting: Called CmiAbort ------------
>
> Reason: Work completion error in sendCq
>
> [0] Stack Traceback:
>
> [0:0] [0x6176e3]
>
> [0:1] [0x617736]
>
> [0:2] [0x613a78]
>
> [0:3] [0x61383e]
>
> [0:4] [0x61c881]
>
> [0:5] [0x61ead9]
>
> [0:6] [0x61315f]
>
> [0:7] [0x617857]
>
> [0:8] [0x625f28]
>
> [0:9] [0x626d93]
>
> [0:10] [0x621671]
>
> [0:11] [0x621ac9]
>
> [0:12] [0x6219a0]
>
> [0:13] [0x6174b6]
>
> [0:14] [0x617337]
>
> [0:15] [0x4e2a6b]
>
> [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
>
> [0:17] [0x408ba9]
>
> Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any
> help will be appreciated!
>
> Thank you!
>
> Best,
>
> Andrew Pang
>
>
>

-- 
*JULIO MAIA*
*Research Programmer*
Beckman Institute for Advanced Science and Technology
Vice Chancellor Research Institutes
University of Illinois at Urbana-Champaign
405 N. Mathews Avenue | M/C 251
Urbana, IL 61801
217-244-1928 | jmaia_at_ks.uiuc.edu
beckman.illinois.edu
<http://illinois.edu/>
*Under the Illinois Freedom of Information Act any written communication to
or from university employees regarding university business is a public
record and may be subject to public disclosure. *

This archive was generated by hypermail 2.1.6 : Sat Dec 07 2019 - 23:20:52 CST