Re: [AMBER] parallel cuda test errors

From: Gard Nelson <Gard.Nelson.NantBio.com>
Date: Wed, 3 Sep 2014 20:46:43 +0000

Hi Ross,

You're exactly right - it is much slower with 4 GPUs than with 2 and I probably won't use 4 for this reason. Just wanted to make sure that this wasn't a symptom of a deeper problem. (or that the slow speed was somehow related to the errors)

You're also right about the version. I'm surprised at this since I just installed the code yesterday and the configure script didn't find update 6. Anyway, I'll download the update and stick to 1 or 2 GPUs.

Thanks,
Gard

________________________________________
From: Ross Walker [ross.rosswalker.co.uk]
Sent: Wednesday, September 03, 2014 1:00 PM
To: AMBER Mailing List
Subject: Re: [AMBER] parallel cuda test errors

Hi Gard,

Have you applied all the latest patches? - In particular update.6 which
address 8 GPU count runs - and may impact 4 GPU runs as well.

The errors for 4 GPU runs should not be any larger than 2 GPUs so
something is clearly wrong here - the problem is I don't have access to
any 4 ways C2050 systems anymore - haven't for a long time - and old fermi
cards take a very different code path to modern cards so it is possible
the problem is just in that code path.

Firstly are you actually seeing speedup over 4 GPUs for explicit solvent
runs? - This seems unlikely since the only systems to support 4 way peer
to peer are the custom built CirraScale machines that we are developing.
So for 4 GPUs here it must be going through the chipset which would slow
things down a lot. Although maybe S2050's are slow enough that that is not
an issue.

If you do see a worthwhile speed improvement on 4 GPUs then we can have a
go at fixing it but it is likely to need access to your machine or other
identical hardware.

All the best
Ross




On 9/3/14, 12:49 PM, "Gard Nelson" <Gard.Nelson.NantBio.com> wrote:

>Hi all,
>
>
>
>I've recently installed Amber14 on my local cluster. The serial and
>parallel CPU versions both pass all of the included tests without any
>errors or failures. The serial GPU version reports a few possible
>failures, but manual inspection shows that these are all infrequent and
>likely harmless. (maximum relative errors =< 1e-3) The parallel GPU code
>passes the tests (similar to the serial GPU version) if I use 2 GPUs.
>However, when I run the same tests with 4 GPUs I see frequent differences
>with relative errors around 1-2. This often corresponds to energy
>differences on the order of tens to hundreds of kcal/mol.
>
>
>
>I realize that the highly parallel nature of GPU calculations will result
>in test differences, but what I'm seeing seems too large to be caused by
>order of operations or round off errors. Does anyone have any idea what
>could be causing this behavior?
>
>
>
>I'm running this on Tesla S2050 GPUs with driver version 331.62. The code
>was built with gnu 4.8 and CUDA 6.0 compilers.
>
>
>
>Thanks for your help,
>
>Gard
>
>CONFIDENTIALITY NOTICE
>This e-mail message and any attachments are only for the use of the
>intended recipient and may contain information that is privileged,
>confidential or exempt from disclosure under applicable law. If you are
>not the intended recipient, any disclosure, distribution or other use of
>this e-mail message or attachments is prohibited. If you have received
>this e-mail message in error, please delete and notify the sender
>immediately. Thank you.
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you.

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 03 2014 - 14:00:03 PDT
Custom Search