Re: [AMBER] pmemd.cuda.MPI problem from Ross Walker on 2013-08-07 (Amber Archive Aug 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 07 Aug 2013 13:02:50 -0700

Try 'unset CUDA_VISIBLE_DEVICES' - if you have that set and only exposing
a single GPU that would lead to this error.

All the best
Ross

On 8/7/13 12:51 PM, "Victor Ma" <victordsmagift.gmail.com> wrote:

>Thanks. I ran ps and there are just 2 pmemd.cuda jobs running.
>"nvidia-smi"
>also indicates that only 2 gpus are occupied. Anyway I am running these
>job
>in serial. Like you mentioned above, parallel scalability is limited
>currently. So there are not much advantage running parallel.
>
>Thanks again.
>
>Victor
>
>
>On Wed, Aug 7, 2013 at 2:44 PM, Jason Swails <jason.swails.gmail.com>
>wrote:
>
>> On Wed, Aug 7, 2013 at 3:37 PM, Victor Ma <victordsmagift.gmail.com>
>> wrote:
>>
>> > Thanks for both replies. The simulation does run fine in serial. I
>> checked
>> > em1.out. It says "CUDA (GPU): Minimization is NOT supported in
>>parallel
>> on
>> > GPUs. " The message is pretty clear. I then tested the parallel
>> calculation
>> > with a production run. The command is
>> > mpirun --machinefile=nodefile -np 2 pmemd.cuda.MPI -O -i prod.in -o
>> > prod.out -c md3.rst -p complex_wat.prm -r prod.rst -x prod.crd -ref
>> md3.rst
>> > &
>> >
>> > In the nodefile, I put
>> > localhost:2
>> > (My machine has 4GPU and 24 CPU)
>> >
>> > This time, the error message is "cudaMemcpyToSymbol: SetSim copy to
>>cSim
>> > failed all CUDA-capable devices are busy or unavailable". But I do
>>have 2
>> > idle GPUs in the machine. Any idea?
>> >
>>
>> Maybe there is a rogue process running that is 'occupying' the GPU...?
>>
>> You could always print out the output of "nvidia-smi" to see, although I
>> don't know that that would necessarily tell you if you had a process
>> claiming to need the GPU. If you only run pmemd.cuda on the GPUs, you
>> could use the "ps" utility to look for pmemd.cuda jobs (or CUDA namd,
>> gromacs, etc.)
>>
>> HTH,
>> Jason
>>
>>
>> >
>> > Thanks.
>> >
>> > Victor
>> >
>> >
>> > On Wed, Aug 7, 2013 at 1:50 PM, David A Case <case.biomaps.rutgers.edu
>> > >wrote:
>> >
>> > > On Wed, Aug 07, 2013, Victor Ma wrote:
>> > > >
>> > > > I have amber12 and openmpi installed and configured on a 4GPU
>> machine.
>> > > I'd
>> > > > like to run multi-GPU amber simulation. Here is the command I
>>used:
>> > > > mpirun --machinefile=nodefile -np 2 pmemd.cuda.MPI -O -i em1.in -o
>> > > em1.out
>> > > > -c complex_wat.inpcrd -p complex_wat.prm -r em1.rst -ref
>> > > complex_wat.inpcrd
>> > > > &
>> > > >
>> > > > And the error messageI got is,
>> > > > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>> > > >
>> > > > Does that mean my openmpi is not properly configured?
>> > >
>> > > Could be anything. First, run the test suite to see if you have a
>> > generic
>> > > problem (e.g. with openmpi configuration).
>> > >
>> > > Second, look at the output files, especially em1.out. Most likely,
>> there
>> > > is
>> > > an error message there. The MPI_Abort message just informs you that
>> the
>> > > process failed. You have to look at the acutal outputs to find out
>> why.
>> > >
>> > > ...dac
>> > >
>> > >
>> > > _______________________________________________
>> > > AMBER mailing list
>> > > AMBER.ambermd.org
>> > > http://lists.ambermd.org/mailman/listinfo/amber
>> > >
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>>
>>
>>
>> --
>> Jason M. Swails
>> BioMaPS,
>> Rutgers University
>> Postdoctoral Researcher
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Aug 07 2013 - 13:30:02 PDT