Re: [AMBER] Anomalous Termination of PMEMD.CUDA jobs from kurisaki on 2013-02-15 (Amber Archive Feb 2013)

From: kurisaki <kurisaki.ncube.human.nagoya-u.ac.jp>
Date: Fri, 15 Feb 2013 18:27:38 +0900

PS.

I am now executing pmemd.cuda in local directory.
This job has been running for 1 day.
Thus, I doubt that the communication of the GPU machine
And the other machines forming a PC cluster.

Thank you for your advice.

First, I am so sorry for not sufficient information for my problem.
Would you let me answer your questions.

> Specifically what is different between the machine it works on and the
> one it
doesn't.
> Are they the same model of GPU? Same driver version, same compilers /
executables etc?

Yes,
the model of GPU (GTX680 x 2)
CPU (Intel Xeon E5-1620(3.6 GHz))
Compiler (gfortran)
OS (CentOS release 6.2 (Final))
and driver (CUDA-4.2)
are exactly same between two machines.
(Moreover, I bought them from a agency in the same purchase.)

>Do you only see this issue on the one simulation you are running or
>does it
happen with all simulations?

Although Amber12 was complied on one of them, the one working properly Amber
test for GPU was successfully passed for both machine.
Then, I consider MD runs itself were performed accurately.
Accordingly, I guess the problem occurs due to the machine for whether hardware
or software.
Such a termination brings about independent of a system for MD.
A job is quitted at intermediate of

> the GPU is faulty, perhaps faulty memory or it is overheating but it
> is hard to be sure with the details you give.

At the end of the last year, I sent the machine back to the agency for
maintenance and they came yet.
The technical staffs replaced all device (CPU, GPU and mother board) and
recompiled Amber in the same setting.
They performed 5 AMBER-GPU MD simulations (each takes 1 day) and obtained 4
non-stopped simulations.
Their result is much better
because my simulations are usually quitted within a few hours.

I am most grateful if you give me advices.

Thank you for your usual, kind support.

Yours sincerely,

Ikuo Kurisaki

PS

I just try to exchange electrical source, And examine the setting by the agency.

-----Original Message-----
From: Ross Walker [mailto:ross.rosswalker.co.uk]
Sent: Thursday, February 14, 2013 2:17 AM
To: AMBER Mailing List
Subject: Re: [AMBER] Anomalous Termination of PMEMD.CUDA jobs

Hi Ikuo,

There is very limited information for us to go on here to be able to help you.
Can you please provide some more details. Specifically what is different between
the machine it works on and the one it doesn't. Are they the same model of GPU?
Same driver version, same compilers / executables etc?

Do you only see this issue on the one simulation you are running or does it
happen with all simulations?

My guess based on what you describe and assuming the two machines are identical
would be that the GPU is faulty, perhaps faulty memory or it is overheating but
it is hard to be sure with the details you give. Can you try one simple test.
Swap the GPUs between the two machines and see if the error 'follows' the GPU or
stays with the machine itself.

All the best
Ross

On 2/13/13 2:17 AM, "kurisaki" <kurisaki.ncube.human.nagoya-u.ac.jp> wrote:

>
>Dear Amber developers and users,
>
>I'm sorry, That is typo..
>
>"SFDP" must be "SPFP".
>
>Thank you for your kind support.
>
>
>!! question starts here..
>
>I have been in trouble for anomalous termination of PMEMD.CUDA when I
>use
>Amber12 with GTX680 at SFDP level in my machine.
>
>Although an MD job normally runs for several hours, I often encounter
>anomalous termination of MD jobs, Where "segmentation fault" occurs.
>
>Curiously, such an anomalous termination never happens for another GPU
>machine (this is completely same in terms of Machine spec as the
>previous one).
>
>I am grad if you have similar experience and Tell me how to overtake
>this situation.
>
>Sincerely, yours.
>
> Ikuo KURISAKI
>
>PS. I attached the messages saved in /var/log/messages for a reference.
> Is this a system problem, e.g. s
>
>Feb 12 11:37:02 gps102 kernel: imklog 4.6.2, log source = /proc/kmsg
>started.
>Feb 12 11:37:02 gps102 rsyslogd: [origin software="rsyslogd"
>swVersion="4.6.2"
>x-pid="2051" x-info="http://www.rsyslog.com"] (re)start Feb 12 12:28:30
>gps102
>kernel: pmemd.cuda[32307]: segfault at 2e3000002eb7 ip
>00007f2939843248 sp 00007ffff927ef40 error 4 in
>libgfortran.so.3.0.0[7f2939784000+f0000]
>Feb 12 12:28:31 gps102 abrt[32309]: saved core dump of pid 32307
>(/home/kurisaki/amber/amber12gpu/amber12/bin/pmemd.cuda_SPFP) to
>/var/spool/abrt/ccpp-2013-02-12-12:28:30-32307.new/coredump (84373504
>bytes) Feb
>12 12:28:31 gps102 abrtd: Directory 'ccpp-2013-02-12-12:28:30-32307'
>creation detected
>Feb 12 12:28:31 gps102 abrtd: Executable
>'/home/kurisaki/amber/amber12gpu/amber12/bin/pmemd.cuda_SPFP' doesn't
>belong to any package Feb 12 12:28:31 gps102 abrtd: Corrupted or bad
>dump
>/var/spool/abrt/ccpp-2013-02-12-12:28:30-32307 (res:2), deleting
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Feb 15 2013 - 01:30:04 PST