Re: [AMBER] Does it mean the card is damaged?

From: Karolina Markowska <markowska.kar.gmail.com>
Date: Fri, 28 Aug 2015 10:04:13 +0200

Thank you very much, Ross!

I suspect that our only Titan Z card is having problems with its fans,
because once, when I run a simulation on it, the fan speed fall down to 0%
ad the simulation just went down. But after restarting the machine,
everything is OK for now.

The second machine also gives that kind of errors, so I'll run the script
on it too.

I'll give you the outputs probably on Monday.

Best regards,
Karolina

2015-08-27 17:06 GMT+02:00 Ross Walker <ross.rosswalker.co.uk>:

> Hi Karolina,
>
> A bad GPU is one possible explanation for the error you see although there
> are many others, particularly if it occurs on more than one GPU. For
> example driver issues, problems with your simulation or something unique in
> your simulation that is triggering a bug in the AMBER code. First things
> first is to check if all your GPUs are behaving themselves. Please download
> the following:
>
> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>
> Untar it and edit the run script to specify the number of GPUs you have in
> your machine (I would suggest making separate copies for each machine).
>
> Then, after making sure nothing else is running on the machine do:
>
> nohup ./run_test_4gpu.x >& run_test_4gpu.log &
>
> Leave it running - will take about 12 hours or so and will produce a
> number of log files in the GPU_Validation_Test directory. Take a look at
> these log files - they will report a final energy for each test they should
> all be identical. If they aren't, or some are missing, then it points to a
> bad GPU.
>
> Let me know how it goes.
>
> All the best
> Ross
>
> > On Aug 27, 2015, at 1:40 AM, Karolina Markowska <markowska.kar.gmail.com>
> wrote:
> >
> > Dear Amber Users,
> >
> > I'm having problems with my GPUs. I have a cluster with Titan Blacks and
> > Titan Z cards and sometimes I'm experiencing some errors, like the one
> > below:
> >
> > cudaMemcpy GpuBuffer::Download failed an illegal memory access was
> > encountered
> >
> > Can this error be related to some errors during simulation?
> > Or maybe it means that the card could be broken?
> > What should I do to find out if the card is OK?
> >
> > Best regards,
> > Karolina Markowska
> > PhD student
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Aug 28 2015 - 01:30:03 PDT
Custom Search