On Wed, Aug 7, 2013 at 3:37 PM, Victor Ma <victordsmagift.gmail.com> wrote:
> Thanks for both replies. The simulation does run fine in serial. I checked
> em1.out. It says "CUDA (GPU): Minimization is NOT supported in parallel on
> GPUs. " The message is pretty clear. I then tested the parallel calculation
> with a production run. The command is
> mpirun --machinefile=nodefile -np 2 pmemd.cuda.MPI -O -i prod.in -o
> prod.out -c md3.rst -p complex_wat.prm -r prod.rst -x prod.crd -ref md3.rst
> &
>
> In the nodefile, I put
> localhost:2
> (My machine has 4GPU and 24 CPU)
>
> This time, the error message is "cudaMemcpyToSymbol: SetSim copy to cSim
> failed all CUDA-capable devices are busy or unavailable". But I do have 2
> idle GPUs in the machine. Any idea?
>
Maybe there is a rogue process running that is 'occupying' the GPU...?
You could always print out the output of "nvidia-smi" to see, although I
don't know that that would necessarily tell you if you had a process
claiming to need the GPU. If you only run pmemd.cuda on the GPUs, you
could use the "ps" utility to look for pmemd.cuda jobs (or CUDA namd,
gromacs, etc.)
HTH,
Jason
>
> Thanks.
>
> Victor
>
>
> On Wed, Aug 7, 2013 at 1:50 PM, David A Case <case.biomaps.rutgers.edu
> >wrote:
>
> > On Wed, Aug 07, 2013, Victor Ma wrote:
> > >
> > > I have amber12 and openmpi installed and configured on a 4GPU machine.
> > I'd
> > > like to run multi-GPU amber simulation. Here is the command I used:
> > > mpirun --machinefile=nodefile -np 2 pmemd.cuda.MPI -O -i em1.in -o
> > em1.out
> > > -c complex_wat.inpcrd -p complex_wat.prm -r em1.rst -ref
> > complex_wat.inpcrd
> > > &
> > >
> > > And the error messageI got is,
> > > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> > >
> > > Does that mean my openmpi is not properly configured?
> >
> > Could be anything. First, run the test suite to see if you have a
> generic
> > problem (e.g. with openmpi configuration).
> >
> > Second, look at the output files, especially em1.out. Most likely, there
> > is
> > an error message there. The MPI_Abort message just informs you that the
> > process failed. You have to look at the acutal outputs to find out why.
> >
> > ...dac
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
--
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 07 2013 - 13:00:04 PDT