Re: [AMBER] Does it mean the card is damaged? from Karolina Markowska on 2015-08-31 (Amber Archive Aug 2015)

From: Karolina Markowska <markowska.kar.gmail.com>
Date: Mon, 31 Aug 2015 10:37:44 +0200

Hello Ross,

I've got some results from the cards tests.
We're using NVIDIA drivers from CUDA Toolkit 6.5 (Drivers Version 340.29).
nvidia-smi output looks like that:
NVIDIA-SMI 340.29 Driver Version: 340.29
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:01:00.0 N/A |
N/A |
| 26% 40C P8 N/A / N/A | 15MiB / 6143MiB | N/A
Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:02:00.0 N/A |
N/A |
| 26% 42C P8 N/A / N/A | 15MiB / 6143MiB | N/A
Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:03:00.0 N/A |
N/A |
| 26% 41C P8 N/A / N/A | 15MiB / 6143MiB | N/A
Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... Off | 0000:04:00.0 N/A |
N/A |
| 28% 44C P8 N/A / N/A | 15MiB / 6143MiB | N/A
Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes: GPU
Memory |
| GPU PID Process name
Usage |
|=============================================================================|
| 0 Not
Supported |
| 1 Not
Supported |
| 2 Not
Supported |
| 3 Not
Supported |
+-----------------------------------------------------------------------------+

The card I've suspected that is broken, computed everything well, without
any error or mistake. Probably there is something blocking the fan. Need to
check it.
But one card in other machine, the one that also gives us errors, gave me
this:

3.0: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.1: 3.2: Etot = -58224.7039 EKtot = 14401.6602 EPtot
= -72626.3640
3.3: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.4: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.5: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.6: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.7: 3.8: Etot = -58224.7039 EKtot = 14401.6602 EPtot
= -72626.3640
3.9: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.10: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.11: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.12: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.13: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640
3.14: 3.15: 3.16: Etot = -58224.7039 EKtot = 14401.6602
EPtot = -72626.3640
3.17: 3.18: Etot = -58224.7039 EKtot = 14401.6602 EPtot
= -72626.3640
3.19: Etot = -58224.7039 EKtot = 14401.6602 EPtot =
-72626.3640

And some errors like that:
cudaMemcpy GpuBuffer::Download failed an illegal memory access was
encountered

and:
ERROR: Calculation halted. Periodic box dimensions have changed too much
from their initial values.
  Your system density has likely changed by a large amount, probably from
  starting the simulation from a structure a long way from equilibrium.

  [Although this error can also occur if the simulation has blown up for
some reason]

  The GPU code does not automatically reorganize grid cells and thus you
  will need to restart the calculation from the previous restart file.
  This will generate new grid cells and allow the calculation to continue.
  It may be necessary to repeat this restarting multiple times if your
system
  is a long way from an equilibrated density.

  Alternatively you can run with the CPU code until the density has
converged
  and then switch back to the GPU code.

So it looks like this card also have some issues.
I'll run the tests on our other two machines to check is everything fine
there.

Thank you very much for your help.

Best regards,
Karolina

2015-08-28 10:04 GMT+02:00 Karolina Markowska <markowska.kar.gmail.com>:

> Thank you very much, Ross!
>
> I suspect that our only Titan Z card is having problems with its fans,
> because once, when I run a simulation on it, the fan speed fall down to 0%
> ad the simulation just went down. But after restarting the machine,
> everything is OK for now.
>
> The second machine also gives that kind of errors, so I'll run the script
> on it too.
>
> I'll give you the outputs probably on Monday.
>
> Best regards,
> Karolina
>
> 2015-08-27 17:06 GMT+02:00 Ross Walker <ross.rosswalker.co.uk>:
>
>> Hi Karolina,
>>
>> A bad GPU is one possible explanation for the error you see although
>> there are many others, particularly if it occurs on more than one GPU. For
>> example driver issues, problems with your simulation or something unique in
>> your simulation that is triggering a bug in the AMBER code. First things
>> first is to check if all your GPUs are behaving themselves. Please download
>> the following:
>>
>> https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
>>
>> Untar it and edit the run script to specify the number of GPUs you have
>> in your machine (I would suggest making separate copies for each machine).
>>
>> Then, after making sure nothing else is running on the machine do:
>>
>> nohup ./run_test_4gpu.x >& run_test_4gpu.log &
>>
>> Leave it running - will take about 12 hours or so and will produce a
>> number of log files in the GPU_Validation_Test directory. Take a look at
>> these log files - they will report a final energy for each test they should
>> all be identical. If they aren't, or some are missing, then it points to a
>> bad GPU.
>>
>> Let me know how it goes.
>>
>> All the best
>> Ross
>>
>> > On Aug 27, 2015, at 1:40 AM, Karolina Markowska <
>> markowska.kar.gmail.com> wrote:
>> >
>> > Dear Amber Users,
>> >
>> > I'm having problems with my GPUs. I have a cluster with Titan Blacks and
>> > Titan Z cards and sometimes I'm experiencing some errors, like the one
>> > below:
>> >
>> > cudaMemcpy GpuBuffer::Download failed an illegal memory access was
>> > encountered
>> >
>> > Can this error be related to some errors during simulation?
>> > Or maybe it means that the card could be broken?
>> > What should I do to find out if the card is OK?
>> >
>> > Best regards,
>> > Karolina Markowska
>> > PhD student
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Aug 31 2015 - 02:00:03 PDT