Re: [AMBER] GPUs job and GUI issue from Ross Walker on 2017-07-08 (Amber Archive Jul 2017)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Sun, 9 Jul 2017 00:28:53 -0400

This is a very good set of suggestions Pratul. I doubt very much it is power related though or anything physical with the GPU. That would result in nvidia-smi reporting the GPU as lost rather than just the job being killed. My guess would be that it is some incompatibility with Schrodinger's software - it may not like sharing GPUs - or more likely grabs all the memory on the GPU and causes the AMBER run to crash. Hirdesh, you really need to capture stderr from the AMBER pmemd runs since this would provide a hint as to why the job is dying. When you run pmemd.cuda run it as follows:

nohup $AMBERHOME/bin/pmemd.cuda -O -i mdin -o mdout ........... >& gpu_xx.log &

You'll then be able to look in gpu_xx.log for any error messages when the job dies.

As Pratul suggests also look at reproducing rather than predicting. Does this only happen when you run something from the Schrodinger suite? Or does it only happen when you run a GPU heavy program - such as maestro but also things like VMD? Or does it happen when you do nothing more than login and run firefox?

It would be helpful to try and tie down the problem. My bet is on Maestro (or some other similar program) grabbing all the GPU memory though and causing pmemd to crash with a GPU memory allocation failure.

All the best
Ross

> On Jul 7, 2017, at 20:02, Pratul K. Agarwal <agarwalpk.ornl.gov> wrote:
>
> Hirdesh,
>
> This seems highly unusual situation. Here are a few pointers to help you
> debug.
>
> 1. Instead of predicting, try reproducing. Do you remember what was the
> last application (and the feature of application) you started before the
> jobs got killed. Try doing that again. Look at the system log and
> Xorg.log for hints. Are you capturing the standard error from AMBER
> jobs? May be that will give you a hint. Memory shouldn't be an issue --
> as your jobs are not using much memory.
>
> 2. What I find intriguing is that you have Xserver running on 1st card
> (GPU 0) and an application on it kills all jobs on other cards. This
> shouldn't happen, unless it is a driver issue. You have fairly new
> driver, but doesn't hurt to update to the latest (375.66). (PS: Are you
> running 4 separate jobs or 1 MPI job using 4 GPU cards? It does look
> like you have 4 separate jobs but please confirm).
>
> 3. Exxact does pretty good job with their systems, but did you change
> the power supply or power cabling to the GPUs? If one GPU is not getting
> enough power (or one power rail is being overloaded) this could cause
> problems.
>
> 4. Lastly as Ross mentioned, this could be driver/BIOS/Xorg setting
> issue. Have the cards been set to exclusive mode? Or did you try to
> configure Xorg for multiple monitors?
>
> Pratul
>
> Pratul K. Agarwal, Ph.D.
> (Editorial Board Member: PLoS ONE, Microbial Cell Factories)
> Computational Biology Institute &
> Computer Science and Engineering Division
> Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
> Phone: +1-865-574-7184
> Web: http://www.agarwal-lab.org/
>
> On 7/7/2017 5:14 PM, Hirdesh Kumar wrote:
>> Thanks Ross and Pratul for your responses.
>>
>> GUI means whenever I use this system:
>> for browsing internet using mozilla; pymol; maestro (Schrodinger); gene
>> editor (Snapgene viewer) etc.
>>
>> Four days before, I submitted 4 GPU jobs using the command line and I never
>> login to this system in display mode. All went fine.
>>
>> But, Today, i login to this system and started schrodinger, the jobs got
>> killed (3 out of 4). This has happened several times in past. Job killing
>> is random and i can not predict it. Sometimes, job on 1 GPU or othertimes
>> jobs on more than 1 GPUs get killed.
>>
>>
>> I just restarted my 4 PMEMD jobs on 4 GPUs. and here is the output for
>> nvidia-smi:
>>
>>
>>
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 375.39 Driver Version:
>> 375.39 |
>> |-------------------------------+----------------------+----------------------+
>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
>> ECC |
>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
>> M. |
>> |===============================+======================+======================|
>> | 0 GeForce GTX 1080 Off | 0000:02:00.0 On |
>> N/A |
>> | 72% 84C P2 92W / 180W | 1103MiB / 8112MiB | 99%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> | 1 GeForce GTX 1080 Off | 0000:03:00.0 Off |
>> N/A |
>> | 66% 82C P2 116W / 180W | 549MiB / 8114MiB | 99%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> | 2 GeForce GTX 1080 Off | 0000:82:00.0 Off |
>> N/A |
>> | 65% 82C P2 131W / 180W | 549MiB / 8114MiB | 99%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>> | 3 GeForce GTX 1080 Off | 0000:83:00.0 Off |
>> N/A |
>> | 66% 82C P2 122W / 180W | 541MiB / 8114MiB | 99%
>> Default |
>> +-------------------------------+----------------------+----------------------+
>>
>>
>> +-----------------------------------------------------------------------------+
>> | Processes: GPU
>> Memory |
>> | GPU PID Type Process name
>> Usage |
>> |=============================================================================|
>> | 0 1775 G /usr/lib/xorg/Xorg
>> 141MiB |
>> | 0 28072 G /usr/lib/xorg/Xorg
>> 234MiB |
>> | 0 28659 G compiz
>> 102MiB |
>> | 0 29171 G /usr/lib/firefox/firefox
>> 2MiB |
>> | 0 37664 C pmemd.cuda
>> 545MiB |
>> | 1 37826 C pmemd.cuda
>> 545MiB |
>> | 2 38073 C pmemd.cuda
>> 545MiB |
>> | 3 37949 C pmemd.cuda
>> 537MiB |
>> +-----------------------------------------------------------------------------+
>>
>>
>> **
>>
>> On Fri, Jul 7, 2017 at 1:28 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
>>
>>> Hi Hirdesh,
>>>
>>> What does nvidia-smi report? It maybe the cards are set to exclusive mode
>>> and that is killing the jobs although this shouldn't normally happen. What
>>> do you mean by GUI? Just an Xwindows login? Or something else?
>>>
>>> This is the first time I've heard of this issue so it might take some
>>> debugging to figure it out. Are you jobs using almost all of the GPU
>>> memory, it's possible you are running them out of memory. Does AMBER give
>>> you any error messages? To stdout or to nohup.out if you are nohupping the
>>> jobs?
>>>
>>> All the best
>>> Ross
>>>
>>>> On Jul 7, 2017, at 1:28 PM, Hirdesh Kumar <hirdesh.iitd.gmail.com>
>>> wrote:
>>>> Hi All,
>>>>
>>>> I am using my EXXACT system ( 4 GPUs : GTX1080) to submit my Amber16
>>> jobs.
>>>> (Operating system: Ubuntu 16).
>>>>
>>>> In this system, whenever I use GUI to do some other task my amber jobs
>>> get
>>>> killed. I believe, GUI is randomly using any of these 4 GPUs.
>>>>
>>>> Please let me know how can I get rid of this issue.
>>>>
>>>> Thanks,
>>>> Hirdesh
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Jul 08 2017 - 21:30:04 PDT