[AMBER] CUDA errors on a 700k system

From: Dmitry Suplatov <genesup.gmail.com>
Date: Sun, 14 Jul 2019 21:00:51 +0300

Dear Amber Users,

I run a classical NVT simulation of a 700k system on Tesla P100's. I run 10
MD replicas of the same system on different nodes.

All MDs generally run for 80-100 ns (i.e., the production run after EM, EQ,
etc.) then I get some problems.

When running on *two GPUs *in the *peer2peer* mode (single node = 2 cards)
I get the following error:
gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was
encountered

When running on *one GPU *I get the following error:
Error: an illegal memory access was encountered launching kernel kNLSkinTest

When adding the *vlimit=20,* option to my config file *some MDs run
normally on one GPU* while others encounter the same error. Nothing changes
in the peer-to-peer mode.

When setting the *vlimit=10,* option to my config file *all MDs run
normally* *on one GPU*. Nothing changes in the peer-to-peer mode.

My QUESTIONS are:

1/ What does the "vlimit" option do? I googled it from the amber mailing
lists but can not find the meaning.

2/ Does setting the vlimit affect performance of biological output of my
simulations?

3/ What would you suggest

Thank you,
Dmitry
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Jul 14 2019 - 11:30:03 PDT
Custom Search