Re: [AMBER] pmemd cuda error

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 18 Sep 2014 10:28:24 -0700

Hi Kshatresh,

There is unfortunately not enough information in this message to be able
to understand what is going on here. Although an illegal memory access is
not a common error.

First why are you using mpirun to run these calculations here? - Are you
running on 2 GPUs (and the mdout files says Peer to Peer is enabled?) - or
are you running on more than 2 GPUs or on 2 GPUs that do not support peer
to peer? - In the later two cases you are probabluy actually running
slower than if you just used one GPU at a time. Non-peer to peer parallel
is also so unlikely to give a performance improvement - due to the
slowness of the CPU chipset (and even worse a node to node interconnect)
that it is not heavily tested anymore. So this could be your problem,
although I am guessing here...

It could also be that one of your GPUs is faulty. What type of GPUs are
they and who built, burnt in and validated this system?

I would suggest validating the GPUs with this:
https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz

If all that works then take a more careful look at your simulation itself.
Maybe it blew up, maybe the parameters somewhere are inappropriate, maybe
the structure is bad - e.g. something sticking through a ring system etc.

It will likely take some more debugging to figure out what is going on
here.

All the best
Ross


On 9/18/14, 4:47 AM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com> wrote:

>Dear Users,
>
> I am using Amber GPU 14 for my simulations. I was successfully
>running
>several jobs before sometime on same machine but now I am getting error
>like :
>
>Error: an illegal memory access was encountered launching kernel
>kAddAccumulators
>cudaIpcCloseMemHandle failed on gpu->pbPeerAccumulator an illegal memory
>access was encountered
>-------------------------------------------------------
>Primary job terminated normally, but 1 process returned
>a non-zero exit code.. Per user-direction, the job has been aborted.
>-------------------------------------------------------
>--------------------------------------------------------------------------
>mpirun detected that one or more processes exited with non-zero status,
>thus causing
>the job to be terminated. The first process to do so was:
>
> Process name: [[5660,1],0]
> Exit code: 255
>--------------------------------------------------------------------------
>
>Please help me to solve the issue.
>--
>With best regards
>**************************************************************************
>**********************
>Dr. Kshatresh Dutta Dubey
>Post Doctoral Researcher,
>c/o Prof Sason Shaik,
>Hebrew University of Jerusalem, Israel
>Jerusalem, Israel
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Sep 18 2014 - 10:30:04 PDT
Custom Search