Re: [AMBER] parallel cuda test errors from Ross Walker on 2014-09-03 (Amber Archive Sep 2014)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Wed, 03 Sep 2014 13:00:23 -0700

Hi Gard,

Have you applied all the latest patches? - In particular update.6 which
address 8 GPU count runs - and may impact 4 GPU runs as well.

The errors for 4 GPU runs should not be any larger than 2 GPUs so
something is clearly wrong here - the problem is I don't have access to
any 4 ways C2050 systems anymore - haven't for a long time - and old fermi
cards take a very different code path to modern cards so it is possible
the problem is just in that code path.

Firstly are you actually seeing speedup over 4 GPUs for explicit solvent
runs? - This seems unlikely since the only systems to support 4 way peer
to peer are the custom built CirraScale machines that we are developing.
So for 4 GPUs here it must be going through the chipset which would slow
things down a lot. Although maybe S2050's are slow enough that that is not
an issue.

If you do see a worthwhile speed improvement on 4 GPUs then we can have a
go at fixing it but it is likely to need access to your machine or other
identical hardware.

All the best
Ross

On 9/3/14, 12:49 PM, "Gard Nelson" <Gard.Nelson.NantBio.com> wrote:

>Hi all,
>
>
>
>I've recently installed Amber14 on my local cluster. The serial and
>parallel CPU versions both pass all of the included tests without any
>errors or failures. The serial GPU version reports a few possible
>failures, but manual inspection shows that these are all infrequent and
>likely harmless. (maximum relative errors =< 1e-3) The parallel GPU code
>passes the tests (similar to the serial GPU version) if I use 2 GPUs.
>However, when I run the same tests with 4 GPUs I see frequent differences
>with relative errors around 1-2. This often corresponds to energy
>differences on the order of tens to hundreds of kcal/mol.
>
>
>
>I realize that the highly parallel nature of GPU calculations will result
>in test differences, but what I'm seeing seems too large to be caused by
>order of operations or round off errors. Does anyone have any idea what
>could be causing this behavior?
>
>
>
>I'm running this on Tesla S2050 GPUs with driver version 331.62. The code
>was built with gnu 4.8 and CUDA 6.0 compilers.
>
>
>
>Thanks for your help,
>
>Gard
>
>CONFIDENTIALITY NOTICE
>This e-mail message and any attachments are only for the use of the
>intended recipient and may contain information that is privileged,
>confidential or exempt from disclosure under applicable law. If you are
>not the intended recipient, any disclosure, distribution or other use of
>this e-mail message or attachments is prohibited. If you have received
>this e-mail message in error, please delete and notify the sender
>immediately. Thank you.
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 03 2014 - 13:30:02 PDT