Re: [AMBER] pmemd.cuda.MPI interruption error! from Ross Walker on 2015-02-02 (Amber Archive Feb 2015)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 2 Feb 2015 16:31:49 -0800

This means there is something wrong with the system you are trying to simulate. The download from the GPU fails because the memory contains corrupted data - most likely is NaNs in the forces. Chances the system you are simulating is unstable due to atom clashes or other innapropriate structures or settings. I would suggest running this simulation on the CPU with ntpr and ntwx set fairly low so you can see if anything is large (e.g. *'s for VDW, or atoms flying off, or high temperature etc). If that looks good then try just single GPU rather than multiple GPUs and then once your system has equilibrated, if need be, and if it really gives you a performance boost, you can switch back to running on multiple GPUs.

First though make your system runs stably and can be equilibrated on the CPU.

All the best
Ross

> On Feb 2, 2015, at 4:13 PM, 汪盛 <shengwang.hust.edu.cn> wrote:
>
> Dear,
> Recently I fount the amber 14 always interrupts during running tasks, as follows:
> mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i equil.in -o md.out -p 3vnochaindandewat.prmtop -c 3vnochaindandeheat.rst -r 3vnochaindandeequil.rst -x 3vnochaindandeequil.mdcrd -ref 3vnochaindandeheat.rst
> cudaMemcpy GpuBuffer::Download failed unspecified launch failure
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> Image PC Routine Line Source
> libintlc.so.5 00002AFD2F06D229 Unknown Unknown Unknown
> libintlc.so.5 00002AFD2F06BBA0 Unknown Unknown Unknown
> libifcore.so.5 00002AFD2DD3433F Unknown Unknown Unknown
> libifcore.so.5 00002AFD2DC9BD7F Unknown Unknown Unknown
> libifcore.so.5 00002AFD2DCACF4E Unknown Unknown Unknown
> libpthread.so.0 0000003BFF80F4A0 Unknown Unknown Unknown
> mca_btl_vader.so 00002AFD32CFF1A1 Unknown Unknown Unknown
> libopen-pal.so.6 00002AFD2F57612A Unknown Unknown Unknown
> libmpi.so.1 00002AFD2D7760A4 Unknown Unknown Unknown
> mca_coll_tuned.so 00002AFD386570D6 Unknown Unknown Unknown
> libmpi.so.1 00002AFD2D78C63E Unknown Unknown Unknown
> pmemd.cuda.MPI 00000000005F0631 Unknown Unknown Unknown
> pmemd.cuda.MPI 000000000049D9A9 Unknown Unknown Unknown
> pmemd.cuda.MPI 00000000004CD1C3 Unknown Unknown Unknown
> pmemd.cuda.MPI 0000000000521B54 Unknown Unknown Unknown
> pmemd.cuda.MPI 0000000000415246 Unknown Unknown Unknown
> libc.so.6 0000003BFF41ECDD Unknown Unknown Unknown
> pmemd.cuda.MPI 0000000000415139 Unknown Unknown Unknown
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus causing
> the job to be terminated. The first process to do so was:
>
> Process name: [[34608,1],0]
> Exit code: 255
>
> What do the above lines mean? Bugs in Amber 14?
>
> Thanks
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Feb 02 2015 - 17:00:03 PST