Re: [AMBER] GPUs job and GUI issue from Pratul K. Agarwal on 2017-07-07 (Amber Archive Jul 2017)

From: Pratul K. Agarwal <agarwalpk.ornl.gov>
Date: Fri, 7 Jul 2017 20:02:21 -0400

Hirdesh,

This seems highly unusual situation. Here are a few pointers to help you
debug.

1. Instead of predicting, try reproducing. Do you remember what was the
last application (and the feature of application) you started before the
jobs got killed. Try doing that again. Look at the system log and
Xorg.log for hints. Are you capturing the standard error from AMBER
jobs? May be that will give you a hint. Memory shouldn't be an issue --
as your jobs are not using much memory.

2. What I find intriguing is that you have Xserver running on 1st card
(GPU 0) and an application on it kills all jobs on other cards. This
shouldn't happen, unless it is a driver issue. You have fairly new
driver, but doesn't hurt to update to the latest (375.66). (PS: Are you
running 4 separate jobs or 1 MPI job using 4 GPU cards? It does look
like you have 4 separate jobs but please confirm).

3. Exxact does pretty good job with their systems, but did you change
the power supply or power cabling to the GPUs? If one GPU is not getting
enough power (or one power rail is being overloaded) this could cause
problems.

4. Lastly as Ross mentioned, this could be driver/BIOS/Xorg setting
issue. Have the cards been set to exclusive mode? Or did you try to
configure Xorg for multiple monitors?

Pratul

Pratul K. Agarwal, Ph.D.
(Editorial Board Member: PLoS ONE, Microbial Cell Factories)
Computational Biology Institute &
Computer Science and Engineering Division
Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
Phone: +1-865-574-7184
Web: http://www.agarwal-lab.org/

On 7/7/2017 5:14 PM, Hirdesh Kumar wrote:
> Thanks Ross and Pratul for your responses.
>
> GUI means whenever I use this system:
> for browsing internet using mozilla; pymol; maestro (Schrodinger); gene
> editor (Snapgene viewer) etc.
>
> Four days before, I submitted 4 GPU jobs using the command line and I never
> login to this system in display mode. All went fine.
>
> But, Today, i login to this system and started schrodinger, the jobs got
> killed (3 out of 4). This has happened several times in past. Job killing
> is random and i can not predict it. Sometimes, job on 1 GPU or othertimes
> jobs on more than 1 GPUs get killed.
>
>
> I just restarted my 4 PMEMD jobs on 4 GPUs. and here is the output for
> nvidia-smi:
>
>
>
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 375.39 Driver Version:
> 375.39 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
> |===============================+======================+======================|
> | 0 GeForce GTX 1080 Off | 0000:02:00.0 On |
> N/A |
> | 72% 84C P2 92W / 180W | 1103MiB / 8112MiB | 99%
> Default |
> +-------------------------------+----------------------+----------------------+
> | 1 GeForce GTX 1080 Off | 0000:03:00.0 Off |
> N/A |
> | 66% 82C P2 116W / 180W | 549MiB / 8114MiB | 99%
> Default |
> +-------------------------------+----------------------+----------------------+
> | 2 GeForce GTX 1080 Off | 0000:82:00.0 Off |
> N/A |
> | 65% 82C P2 131W / 180W | 549MiB / 8114MiB | 99%
> Default |
> +-------------------------------+----------------------+----------------------+
> | 3 GeForce GTX 1080 Off | 0000:83:00.0 Off |
> N/A |
> | 66% 82C P2 122W / 180W | 541MiB / 8114MiB | 99%
> Default |
> +-------------------------------+----------------------+----------------------+
>
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name
> Usage |
> |=============================================================================|
> | 0 1775 G /usr/lib/xorg/Xorg
> 141MiB |
> | 0 28072 G /usr/lib/xorg/Xorg
> 234MiB |
> | 0 28659 G compiz
> 102MiB |
> | 0 29171 G /usr/lib/firefox/firefox
> 2MiB |
> | 0 37664 C pmemd.cuda
> 545MiB |
> | 1 37826 C pmemd.cuda
> 545MiB |
> | 2 38073 C pmemd.cuda
> 545MiB |
> | 3 37949 C pmemd.cuda
> 537MiB |
> +-----------------------------------------------------------------------------+
>
>
> **
>
> On Fri, Jul 7, 2017 at 1:28 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Hirdesh,
>>
>> What does nvidia-smi report? It maybe the cards are set to exclusive mode
>> and that is killing the jobs although this shouldn't normally happen. What
>> do you mean by GUI? Just an Xwindows login? Or something else?
>>
>> This is the first time I've heard of this issue so it might take some
>> debugging to figure it out. Are you jobs using almost all of the GPU
>> memory, it's possible you are running them out of memory. Does AMBER give
>> you any error messages? To stdout or to nohup.out if you are nohupping the
>> jobs?
>>
>> All the best
>> Ross
>>
>>> On Jul 7, 2017, at 1:28 PM, Hirdesh Kumar <hirdesh.iitd.gmail.com>
>> wrote:
>>> Hi All,
>>>
>>> I am using my EXXACT system ( 4 GPUs : GTX1080) to submit my Amber16
>> jobs.
>>> (Operating system: Ubuntu 16).
>>>
>>> In this system, whenever I use GUI to do some other task my amber jobs
>> get
>>> killed. I believe, GUI is randomly using any of these 4 GPUs.
>>>
>>> Please let me know how can I get rid of this issue.
>>>
>>> Thanks,
>>> Hirdesh
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jul 07 2017 - 17:30:02 PDT