Re: [AMBER] Query about parallel GPU multijob

From: Kshatresh Dutta Dubey <kshatresh.gmail.com>
Date: Mon, 23 Jun 2014 23:44:14 +0300

Hi Prof Ross,

   Thank you again for your reply, I am submitting directly by login
through ssh, not through PBS queuing.

Regards
Kshatresh


On Mon, Jun 23, 2014 at 11:37 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Kshatresh,
>
> It's hard to offer definitive guidance without knowing how you are
> submitting jobs. It sounds like you are not simply ssh'ing into the node
> and running the job but submitting it through some queuing system perhaps?
> If so then there should be one way to query your queuing system to see
> what node is allocated.
>
> If you can ssh into the machines then:
>
> ssh nodeXXX uptime
>
> Will give you the load average - should be 2.0 for the machine running the
> 2 GPU job and 0.0 for the one that is idle.
>
> Alternatively you can do
>
> ssh nodeXXX nvidia-smi
>
>
> Which will show you the GPU state on the node and you will be able to see
> which is running jobs.
>
> All the best
> Ross
>
> On 6/23/14, 1:25 PM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com> wrote:
>
> >Thank you Dr. Ross, I am using using Amber 14. I have one more query,
> >since
> >I have already submitted one parallel job on 2 GPUs and they are running
> >fine, I want to utilize other node for parallel run. Is there any way to
> >get information whether running job is using node 1 or node 2?
> >
> >Thank you once again.
> >
> >Best Regards
> >Kshatresh
> >
> >
> >On Mon, Jun 23, 2014 at 11:04 PM, Ross Walker <ross.rosswalker.co.uk>
> >wrote:
> >
> >> Hi Kshatresh,
> >>
> >> Are you using AMBER 12 or AMBER 14?
> >>
> >> If it is AMBER 12 you have little or no hope of seeing much speedup on
> >> multiple GPUs with K40s. I'd stick to running 4 x 1 GPU.
> >>
> >> If it is AMBER 14 then you should first check if your GPUs in each node
> >> are connected to the same processor and can communicate by peer to
> >>peer. I
> >> will update the website instructions shortly to explain this but in the
> >> meantime you can download the following:
> >>
> >> https://dl.dropboxusercontent.com/u/708185/check_p2p.tar.bz2
> >>
> >> untar it, then cd to the directory and run make. Then run ./gpuP2PCheck.
> >> It should give you something like:
> >>
> >> CUDA_VISIBLE_DEVICES is unset.
> >> CUDA-capable device count: 2
> >> GPU0 "Tesla K40"
> >> GPU1 "Tesla K40"
> >>
> >> Two way peer access between:
> >> GPU0 and GPU1: YES
> >>
> >> You need it to say YES here. If it says NO you will need to reorganize
> >> which PCI-E slots your GPUs are in so that they are on the same CPU
> >>socket
> >> otherwise you will be stuck running single GPU runs.
> >>
> >> If it says YES then you are good to go. Just login to the first node and
> >> do:
> >>
> >> unset CUDA_VISIBLE_DEVICES
> >> nohup mpirun -np 2 $AMBERHOME/bin/pmemd.cuda.MPI -O -i ... &
> >>
> >> Logout and repeat the same on the other node. You want the two MPI
> >> processes to run on the same node. The GPUs will automagically be
> >>selected.
> >>
> >> If you are using a queuing system you'll need to check the manual for
> >>your
> >> specific queuing system but typically this would be something like:
> >>
> >> #PBS nodes=1,tasks_per_node=2
> >>
> >> Which would make sure each of your two jobs get allocated to their own
> >> node. There is no point trying to span nodes these days, infiniband just
> >> isn't fast enough to keep up with modern GPUs and AMBER's superdooper
> >>GPU
> >> breaking lightning speed execution mode(TM).
> >>
> >> Hope that helps.
> >>
> >> All the best
> >> Ross
> >>
> >>
> >>
> >> On 6/23/14, 12:43 PM, "Kshatresh Dutta Dubey" <kshatresh.gmail.com>
> >>wrote:
> >>
> >> >Dear Users,
> >> >
> >> > I have 2 nodes x 2GPU ( each node has 2 GPU) Tesla K 40 machine.
> >>I
> >> >want to run 2 parallel jobs (on 2 GPUs of each nodes). I followed
> >> >http://ambermd.org/gpus/ but still unable to understand how to submit
> >> >jobs. The link describes about running single job either on four GPUs
> >>or
> >> >4
> >> >jobs on each GPUs, but there is no information about 2 parallel jobs
> >>on 2
> >> >nodes. Following is the output of devicequery :
> >> >Device 0: "Tesla K40m"
> >> >Device 1: "Tesla K40m"
> >> >Device 2: "Tesla K40m"
> >> >Device 3: "Tesla K40m
> >> >
> >> > I will be thankful for all suggestion.
> >> >
> >> >Regards
> >> >Kshatresh
> >> >_______________________________________________
> >> >AMBER mailing list
> >> >AMBER.ambermd.org
> >> >http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> >
> >--
> >With best regards
> >**************************************************************************
> >**********************
> >Dr. Kshatresh Dutta Dubey
> >Post Doctoral Researcher,
> >c/o Prof Sason Shaik,
> >Hebrew University of Jerusalem, Israel
> >Jerusalem, Israel
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
With best regards
************************************************************************************************
Dr. Kshatresh Dutta Dubey
Post Doctoral Researcher,
c/o Prof Sason Shaik,
Hebrew University of Jerusalem, Israel
Jerusalem, Israel
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Jun 23 2014 - 14:00:04 PDT
Custom Search