Re: [AMBER] pmemd cuda error

From: Kshatresh Dutta Dubey <kshatresh.gmail.com>
Date: Mon, 22 Sep 2014 15:51:33 +0300

Dear Prof Ross,

Thanks for your reply and sorry for late response. I am using 2 GPUs in
parallel therefore I used mpirun. I found the problem in system
minimization. other jobs run fine without any problem.

Thank you again for your reply.

On Thu, Sep 18, 2014 at 8:28 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Kshatresh,
>
> There is unfortunately not enough information in this message to be able
> to understand what is going on here. Although an illegal memory access is
> not a common error.
>
> First why are you using mpirun to run these calculations here? - Are you
> running on 2 GPUs (and the mdout files says Peer to Peer is enabled?) - or
> are you running on more than 2 GPUs or on 2 GPUs that do not support peer
> to peer? - In the later two cases you are probabluy actually running
> slower than if you just used one GPU at a time. Non-peer to peer parallel
> is also so unlikely to give a performance improvement - due to the
> slowness of the CPU chipset (and even worse a node to node interconnect)
> that it is not heavily tested anymore. So this could be your problem,
> although I am guessing here...
>
> It could also be that one of your GPUs is faulty. What type of GPUs are
> they and who built, burnt in and validated this system?
>
> I would suggest validating the GPUs with this:
> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>
> If all that works then take a more careful look at your simulation itself.
> Maybe it blew up, maybe the parameters somewhere are inappropriate, maybe
> the structure is bad - e.g. something sticking through a ring system etc.
>
> It will likely take some more debugging to figure out what is going on
> here.
>
> All the best
> Ross
>
>
> On 9/18/14, 4:47 AM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com> wrote:
>
> >Dear Users,
> >
> > I am using Amber GPU 14 for my simulations. I was successfully
> >running
> >several jobs before sometime on same machine but now I am getting error
> >like :
> >
> >Error: an illegal memory access was encountered launching kernel
> >kAddAccumulators
> >cudaIpcCloseMemHandle failed on gpu->pbPeerAccumulator an illegal memory
> >access was encountered
> >-------------------------------------------------------
> >Primary job terminated normally, but 1 process returned
> >a non-zero exit code.. Per user-direction, the job has been aborted.
> >-------------------------------------------------------
> >--------------------------------------------------------------------------
> >mpirun detected that one or more processes exited with non-zero status,
> >thus causing
> >the job to be terminated. The first process to do so was:
> >
> > Process name: [[5660,1],0]
> > Exit code: 255
> >--------------------------------------------------------------------------
> >
> >Please help me to solve the issue.
> >--
> >With best regards
> >**************************************************************************
> >**********************
> >Dr. Kshatresh Dutta Dubey
> >Post Doctoral Researcher,
> >c/o Prof Sason Shaik,
> >Hebrew University of Jerusalem, Israel
> >Jerusalem, Israel
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
With best regards
************************************************************************************************
Dr. Kshatresh Dutta Dubey
Post Doctoral Researcher,
c/o Prof Sason Shaik,
Hebrew University of Jerusalem, Israel
Jerusalem, Israel
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Sep 22 2014 - 06:00:05 PDT
Custom Search