Hi Naomi.
Does this happen on multiple cards and machines or just one? It sounds like a bad GPU to me.
Try downloading the following: https://dl.dropboxusercontent.com/u/708185/GPU_Validation_Test.tar.gz
then.
tar xvzf GPU_Validation_Test.tar.gz
cd GPU_Validation_Test
edit run_test_4gpu.x to set the number of GPUs in your system at the top of the file.
Then run:
nohup ./run_test_4gpu.x >& run_test_4gpu.log
it will take about 12 hours to run. Once done post the run_test_4gpu.log and GPUx.log files to the list.
All the best
Ross
> On Jul 17, 2015, at 7:48 PM, Latorraca, Naomi Rose <nlatorra.stanford.edu> wrote:
>
> Hi Amber mailing list,
>
>
> Several Amber simulations that we have been running on Titan X GPUs (pmemd.cuda, cuda version 6.5) have been crashing with this error: "gpu_allreduce cudaDeviceSynchronize failed an illegal memory access was encountered". Our system administrator has described these errors as Xid 31 errors, which NVIDIA describes as a MMU error. The full error logged is:
>
> NVRM: Xid (PCI:0000:88:00): 31, Ch 00000001, engmask 00000101, intr 10000000
>
> We are writing to understand if there are issues that pertain to running Amber simulations on TitanX GPUs and whether there are any suggested fixes?
>
> Thanks,
>
> Naomi Latorraca & AJ Venkatakrishnan
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Jul 17 2015 - 21:30:02 PDT