Re: [AMBER] Anomalous Termination of PMEMD.CUDA jobs from kurisaki on 2013-02-15 (Amber Archive Feb 2013)

From: kurisaki <kurisaki.ncube.human.nagoya-u.ac.jp>
Date: Sat, 16 Feb 2013 14:09:40 +0900

Dear Professor Walker

Thank you for your advice.

Fortunately, the GPU-based MD simulations still work for more than two days,
Where I have not found overheat of GPUs: both GPUs are keeping 80C.
These simulations are performed in local directory, /work.

Then, I guess the anomalous Termination is not due to
GPU hardware problem but due to the system setting of our lab.

Yours sincerely,

Ikuo KURISAKI

PS
The temperature of GPU is monitored by "nvidia-smi"
Once in two minute.

-----Original Message-----
From: Ross Walker [mailto:ross.rosswalker.co.uk]
Sent: Saturday, February 16, 2013 3:36 AM
To: AMBER Mailing List
Subject: Re: [AMBER] Anomalous Termination of PMEMD.CUDA jobs

Hi Ikuo,

>PS.
>
>I am now executing pmemd.cuda in local directory.
>This job has been running for 1 day.
>Thus, I doubt that the communication of the GPU machine And the other
>machines forming a PC cluster.
>

I think it is unlikely it is the communication to other machines. It is most
likely a flakey GPU.

>Yes,
>the model of GPU (GTX680 x 2)
>CPU (Intel Xeon E5-1620(3.6 GHz))
>Compiler (gfortran)
>OS (CentOS release 6.2 (Final))
>and driver (CUDA-4.2)
>are exactly same between two machines.
>(Moreover, I bought them from a agency in the same purchase.)
>
>Although Amber12 was complied on one of them, the one working properly
>Amber test for GPU was successfully passed for both machine.

The test cases won't normally tell you if there is something wrong with the
hardware unless it is grossly misbehaving. They don't run long enough and often
problems occur with overheating etc which can mean running for several hours
before you see an issue.

>Then, I consider MD runs itself were performed accurately.

The test cases tell you things compiled correctly and the installation appears
to be good. They won't unfortunately tell you if the hardware is a little
screwy.

>>the GPU is faulty, perhaps faulty memory or it is overheating but it
>>is hard to be sure with the details you give.
>
>At the end of the last year, I sent the machine back to the agency for
>maintenance and they came yet.
>The technical staffs replaced all device (CPU, GPU and mother board)
>and recompiled Amber in the same setting.
>They performed 5 AMBER-GPU MD simulations (each takes 1 day) and
>obtained
>4
>non-stopped simulations.
>Their result is much better
>because my simulations are usually quitted within a few hours.

That is VERY worrying. ALL 5 of the MD simulations should have run without
issue. To get 20% failure over 5 days is terrible. I suspect that there is
something very wrong with one of the GPUs.

A few other things to check. Is it always the same GPU that the job fails on.
I.e. you have two GTX680s in each node. Does it matter which one the calculation
is running on? Or does it only fail if you use a specific GPU?
This is very important. If it is always the same GPU then I suggest identifying
which GPU it is in the machine and swapping it with a GPU in the other machine.
Then see if the problem 'follows' the GPU. If it does then there's your answer,
something is wrong, likely memory with that GPU.
If it stays with the node then it could be motherboard or power supply on that
specific machine. Especially if the software stack is identical between the two
machines.

My money is still on one of your GPUs being faulty however.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not be
read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 15 2013 - 21:30:02 PST