Re: [AMBER] Query about parallel GPU multijob

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 23 Jun 2014 14:15:11 -0700

Hi Kshatesh,

Well there are definitely 4 GPUs in the node you are showing here. Two of
them are on one IOH controller connected to one of the CPUs (devices 0 and
1) and two are on the other controller connected to the other CPU (devices
2 and 3) but they are most definitely in the same physical node. If you
have two physical nodes then you have 8 GPUs and not 4.

I will assume for now that you have one node with 2 CPUs and 4 GPUs (2 per
CPU).

In this case if you want to run two calculations each on 2 GPUs you
should, given the output from gpuP2PCheck run as follows:

cd run1
export CUDA_VISIBLE_DEVICES=0,1
nohup mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... &

cd ../run2
export CUDA_VISIBLE_DEVICES=2,3
nohup mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... &

Check the mdout file to make sure it says peer to peer is enabled (it
should) and you should be golden.


Note if you ran the first job with CUDA_VISIBLE_DEVICES unset then 'I
think' it will be running on 0 and 1. You can check this by running
nvidia-smi and looking at the GPU utilization %. In which case you are
fine to just run the second job but making sure you set
CUDA_VISIBLE_DEVICES=2,3. If you don't it will start oversubscribing the
GPUs which will destroy performance. Same goes for single GPU runs - you
should always specify which GPU you want to use using
CUDA_VISIBLE_DEVICES. The original approach (or setting process_exclusive
mode, nvidia-smi -c 3) where pmemd could detect whether GPUs are in use of
not doesn't work if you want to be able to run peer to peer parallel runs
since they require the GPUs to be set to default mode (nvidia-smi -c 0).

Hope that helps.

All the best
Ross


On 6/23/14, 1:54 PM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com> wrote:

>Hi Prof Ross,
>
> I am sure it has 2 nodes, per node has 2 GPU, I dont know why output
>is
>showing like this. The output of lspci is :
>02:00.0 3D controller: NVIDIA Corporation Device 1023 (rev a1)
>03:00.0 3D controller: NVIDIA Corporation Device 1023 (rev a1)
>.......
>.......
>83:00.0 3D controller: NVIDIA Corporation Device 1023 (rev a1)
>84:00.0 3D controller: NVIDIA Corporation Device 1023 (rev a1)
>
>Each node has 2 i7 intel processor and 2 GPU.
>
>
>Regards
>Kshatresh
>
>
>On Mon, Jun 23, 2014 at 11:45 PM, Ross Walker <ross.rosswalker.co.uk>
>wrote:
>
>> This means you have 4 (four) GPUs in 1 (one) node. But your initial
>>email
>> said:
>>
>> "I have 2 nodes x 2GPU ( each node has 2 GPU) Tesla K 40 machine."
>>
>> You should first make sure specifically what hardware you have and how
>>it
>> is configured. Then I can spend the time to help you run correctly on
>>that
>> hardware configuration.
>>
>>
>> On 6/23/14, 1:36 PM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com>
>>wrote:
>>
>> >Hi Prof Ross,
>> >
>> > I did the above and following is the output.
>> >CUDA_VISIBLE_DEVICES is unset.
>> >CUDA-capable device count: 4
>> > GPU0 " Tesla K40m"
>> > GPU1 " Tesla K40m"
>> > GPU2 " Tesla K40m"
>> > GPU3 " Tesla K40m"
>> >
>> >Two way peer access between:
>> > GPU0 and GPU1: YES
>> > GPU0 and GPU2: NO
>> > GPU0 and GPU3: NO
>> > GPU1 and GPU2: NO
>> > GPU1 and GPU3: NO
>> > GPU2 and GPU3: YES
>> >
>> >It means, simply I can submit the job with nohup
>> >$AMBERHOME....../pmemd.cuda.MPI and it will automatically take other
>>free
>> >node (since one parallel job is already going), isn't it?
>> >
>> >Thanks and regards
>> >Kshatresh
>> >
>> >
>> >
>> >
>> >
>> >On Mon, Jun 23, 2014 at 11:25 PM, Kshatresh Dutta Dubey
>> ><kshatresh.gmail.com
>> >> wrote:
>> >
>> >> Thank you Dr. Ross, I am using using Amber 14. I have one more query,
>> >> since I have already submitted one parallel job on 2 GPUs and they
>>are
>> >> running fine, I want to utilize other node for parallel run. Is there
>> >>any
>> >> way to get information whether running job is using node 1 or node 2?
>> >>
>> >> Thank you once again.
>> >>
>> >> Best Regards
>> >> Kshatresh
>> >>
>> >>
>> >> On Mon, Jun 23, 2014 at 11:04 PM, Ross Walker <ross.rosswalker.co.uk>
>> >> wrote:
>> >>
>> >>> Hi Kshatresh,
>> >>>
>> >>> Are you using AMBER 12 or AMBER 14?
>> >>>
>> >>> If it is AMBER 12 you have little or no hope of seeing much speedup
>>on
>> >>> multiple GPUs with K40s. I'd stick to running 4 x 1 GPU.
>> >>>
>> >>> If it is AMBER 14 then you should first check if your GPUs in each
>>node
>> >>> are connected to the same processor and can communicate by peer to
>> >>>peer. I
>> >>> will update the website instructions shortly to explain this but in
>>the
>> >>> meantime you can download the following:
>> >>>
>> >>> https://dl.dropboxusercontent.com/u/708185/check_p2p.tar.bz2
>> >>>
>> >>> untar it, then cd to the directory and run make. Then run
>> >>>./gpuP2PCheck.
>> >>> It should give you something like:
>> >>>
>> >>> CUDA_VISIBLE_DEVICES is unset.
>> >>> CUDA-capable device count: 2
>> >>> GPU0 "Tesla K40"
>> >>> GPU1 "Tesla K40"
>> >>>
>> >>> Two way peer access between:
>> >>> GPU0 and GPU1: YES
>> >>>
>> >>> You need it to say YES here. If it says NO you will need to
>>reorganize
>> >>> which PCI-E slots your GPUs are in so that they are on the same CPU
>> >>>socket
>> >>> otherwise you will be stuck running single GPU runs.
>> >>>
>> >>> If it says YES then you are good to go. Just login to the first node
>> >>>and
>> >>> do:
>> >>>
>> >>> unset CUDA_VISIBLE_DEVICES
>> >>> nohup mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... &
>> >>>
>> >>> Logout and repeat the same on the other node. You want the two MPI
>> >>> processes to run on the same node. The GPUs will automagically be
>> >>> selected.
>> >>>
>> >>> If you are using a queuing system you'll need to check the manual
>>for
>> >>>your
>> >>> specific queuing system but typically this would be something like:
>> >>>
>> >>> #PBS nodes=1,tasks_per_node=2
>> >>>
>> >>> Which would make sure each of your two jobs get allocated to their
>>own
>> >>> node. There is no point trying to span nodes these days, infiniband
>> >>>just
>> >>> isn't fast enough to keep up with modern GPUs and AMBER's
>>superdooper
>> >>>GPU
>> >>> breaking lightning speed execution mode(TM).
>> >>>
>> >>> Hope that helps.
>> >>>
>> >>> All the best
>> >>> Ross
>> >>>
>> >>>
>> >>>
>> >>> On 6/23/14, 12:43 PM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com>
>> >>> wrote:
>> >>>
>> >>> >Dear Users,
>> >>> >
>> >>> > I have 2 nodes x 2GPU ( each node has 2 GPU) Tesla K 40
>> >>>machine. I
>> >>> >want to run 2 parallel jobs (on 2 GPUs of each nodes). I followed
>> >>> >http://ambermd.org/gpus/ but still unable to understand how to
>> submit
>> >>> >jobs. The link describes about running single job either on four
>> >>>GPUs or
>> >>> >4
>> >>> >jobs on each GPUs, but there is no information about 2 parallel
>>jobs
>> >>>on 2
>> >>> >nodes. Following is the output of devicequery :
>> >>> >Device 0: "Tesla K40m"
>> >>> >Device 1: "Tesla K40m"
>> >>> >Device 2: "Tesla K40m"
>> >>> >Device 3: "Tesla K40m
>> >>> >
>> >>> > I will be thankful for all suggestion.
>> >>> >
>> >>> >Regards
>> >>> >Kshatresh
>> >>> >_______________________________________________
>> >>> >AMBER mailing list
>> >>> >AMBER.ambermd.org
>> >>> >http://lists.ambermd.org/mailman/listinfo/amber
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> AMBER mailing list
>> >>> AMBER.ambermd.org
>> >>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> With best regards
>> >>
>> >>
>>
>>>>***********************************************************************
>>>>**
>> >>***********************
>> >> Dr. Kshatresh Dutta Dubey
>> >> Post Doctoral Researcher,
>> >> c/o Prof Sason Shaik,
>> >> Hebrew University of Jerusalem, Israel
>> >> Jerusalem, Israel
>> >>
>> >>
>> >>
>> >
>> >
>> >--
>> >With best regards
>>
>>>************************************************************************
>>>**
>> >**********************
>> >Dr. Kshatresh Dutta Dubey
>> >Post Doctoral Researcher,
>> >c/o Prof Sason Shaik,
>> >Hebrew University of Jerusalem, Israel
>> >Jerusalem, Israel
>> >_______________________________________________
>> >AMBER mailing list
>> >AMBER.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
>--
>With best regards
>**************************************************************************
>**********************
>Dr. Kshatresh Dutta Dubey
>Post Doctoral Researcher,
>c/o Prof Sason Shaik,
>Hebrew University of Jerusalem, Israel
>Jerusalem, Israel
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 23 2014 - 14:30:03 PDT
Custom Search