Re: [AMBER] Anomalous Termination of PMEMD.CUDA jobs from Ross Walker on 2013-02-15 (Amber Archive Feb 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Fri, 15 Feb 2013 10:36:14 -0800

Hi Ikuo,

>PS.
>
>I am now executing pmemd.cuda in local directory.
>This job has been running for 1 day.
>Thus, I doubt that the communication of the GPU machine
>And the other machines forming a PC cluster.
>

I think it is unlikely it is the communication to other machines. It is
most likely a flakey GPU.

>Yes,
>the model of GPU (GTX680 x 2)
>CPU (Intel Xeon E5-1620(3.6 GHz))
>Compiler (gfortran)
>OS (CentOS release 6.2 (Final))
>and driver (CUDA-4.2)
>are exactly same between two machines.
>(Moreover, I bought them from a agency in the same purchase.)
>
>Although Amber12 was complied on one of them, the one working properly
>Amber
>test for GPU was successfully passed for both machine.

The test cases won't normally tell you if there is something wrong with
the hardware unless it is grossly misbehaving. They don't run long enough
and often problems occur with overheating etc which can mean running for
several hours before you see an issue.

>Then, I consider MD runs itself were performed accurately.

The test cases tell you things compiled correctly and the installation
appears to be good. They won't unfortunately tell you if the hardware is a
little screwy.

>>the GPU is faulty, perhaps faulty memory or it is overheating but it
>> is hard to be sure with the details you give.
>
>At the end of the last year, I sent the machine back to the agency for
>maintenance and they came yet.
>The technical staffs replaced all device (CPU, GPU and mother board) and
>recompiled Amber in the same setting.
>They performed 5 AMBER-GPU MD simulations (each takes 1 day) and obtained
>4
>non-stopped simulations.
>Their result is much better
>because my simulations are usually quitted within a few hours.

That is VERY worrying. ALL 5 of the MD simulations should have run without
issue. To get 20% failure over 5 days is terrible. I suspect that there is
something very wrong with one of the GPUs.

A few other things to check. Is it always the same GPU that the job fails
on. I.e. you have two GTX680s in each node. Does it matter which one the
calculation is running on? Or does it only fail if you use a specific GPU?
This is very important. If it is always the same GPU then I suggest
identifying which GPU it is in the machine and swapping it with a GPU in
the other machine. Then see if the problem 'follows' the GPU. If it does
then there's your answer, something is wrong, likely memory with that GPU.
If it stays with the node then it could be motherboard or power supply on
that specific machine. Especially if the software stack is identical
between the two machines.

My money is still on one of your GPUs being faulty however.

All the best
Ross

/\
\/
|\oss Walker

---------------------------------------------------------
| Assistant Research Professor |
| San Diego Supercomputer Center |
| Adjunct Assistant Professor |
| Dept. of Chemistry and Biochemistry |
| University of California San Diego |
| NVIDIA Fellow |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
---------------------------------------------------------

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 15 2013 - 11:00:02 PST