Re: [AMBER] Anomalous Termination of PMEMD.CUDA jobs from Ross Walker on 2013-02-13 (Amber Archive Feb 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 13 Feb 2013 09:16:41 -0800

Hi Ikuo,

There is very limited information for us to go on here to be able to help
you. Can you please provide some more details. Specifically what is
different between the machine it works on and the one it doesn't. Are they
the same model of GPU? Same driver version, same compilers / executables
etc?

Do you only see this issue on the one simulation you are running or does
it happen with all simulations?

My guess based on what you describe and assuming the two machines are
identical would be that the GPU is faulty, perhaps faulty memory or it is
overheating but it is hard to be sure with the details you give. Can you
try one simple test. Swap the GPUs between the two machines and see if the
error 'follows' the GPU or stays with the machine itself.

All the best
Ross

On 2/13/13 2:17 AM, "kurisaki" <kurisaki.ncube.human.nagoya-u.ac.jp> wrote:

>
>Dear Amber developers and users,
>
>I'm sorry, That is typo..
>
>"SFDP" must be "SPFP".
>
>Thank you for your kind support.
>
>
>!! question starts here..
>
>I have been in trouble for anomalous termination of PMEMD.CUDA when I use
>Amber12 with GTX680 at SFDP level in my machine.
>
>Although an MD job normally runs for several hours, I often encounter
>anomalous
>termination of MD jobs, Where "segmentation fault" occurs.
>
>Curiously, such an anomalous termination never happens for another GPU
>machine
>(this is completely same in terms of Machine spec as the previous one).
>
>I am grad if you have similar experience and Tell me how to overtake this
>situation.
>
>Sincerely, yours.
>
> Ikuo KURISAKI
>
>PS. I attached the messages saved in /var/log/messages for a reference.
> Is this a system problem, e.g. s
>
>Feb 12 11:37:02 gps102 kernel: imklog 4.6.2, log source = /proc/kmsg
>started.
>Feb 12 11:37:02 gps102 rsyslogd: [origin software="rsyslogd"
>swVersion="4.6.2"
>x-pid="2051" x-info="http://www.rsyslog.com"] (re)start Feb 12 12:28:30
>gps102
>kernel: pmemd.cuda[32307]: segfault at 2e3000002eb7 ip
>00007f2939843248 sp 00007ffff927ef40 error 4 in
>libgfortran.so.3.0.0[7f2939784000+f0000]
>Feb 12 12:28:31 gps102 abrt[32309]: saved core dump of pid 32307
>(/home/kurisaki/amber/amber12gpu/amber12/bin/pmemd.cuda_SPFP) to
>/var/spool/abrt/ccpp-2013-02-12-12:28:30-32307.new/coredump (84373504
>bytes) Feb
>12 12:28:31 gps102 abrtd: Directory 'ccpp-2013-02-12-12:28:30-32307'
>creation detected
>Feb 12 12:28:31 gps102 abrtd: Executable
>'/home/kurisaki/amber/amber12gpu/amber12/bin/pmemd.cuda_SPFP' doesn't
>belong to
>any package Feb 12 12:28:31 gps102 abrtd: Corrupted or bad dump
>/var/spool/abrt/ccpp-2013-02-12-12:28:30-32307 (res:2), deleting
>
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Feb 13 2013 - 09:30:03 PST