Re: [AMBER] transition metal system works on pmemd crashes on pmemd.cuda

From: Scott Le Grand <varelse2005.gmail.com>
Date: Tue, 11 Sep 2012 10:19:45 -0700

The good news: I think I know what's going on...

The bad news: there's probably nothing I can do about it...

That cluster is running RHATt 5.3:
[xt629.u04n034 B12_cuda]$ cat /etc/redhat-release
Scientific Linux SL release 5.3 (Boron)
[xt629.u04n034 B12_cuda]$

And I think you're hitting some *weird* RHAT 5.3/gcc/CUDA bug... CUDA sets
a minimum expectation of RHAT 5.5. I'm going to ask my friends at NVIDIA
about this, but they're likely to say they don't support this configuration
anymore.

The experiment to run would be a single node with Scientific Linux 5.5 to
see if that fixes things. I have seen equally weird behavior like this
before on weird builds of SUSE.







On Mon, Sep 10, 2012 at 8:18 AM, Patrick von Glehn <
patrickvonglehn.gmail.com> wrote:

> Hi Scott,
>
> We are in the process of getting you a temporary account on our
> clusters. Details to follow off list shortly.
>
> All the best,
> Patrick
>
> On 5 September 2012 13:50, Scott Le Grand <varelse2005.gmail.com> wrote:
> > Yep, it runs all 25,000 steps. Looking at the output you guys sent,
> > whatever's happening to Patrick is hosed from the get-go...
> >
> >
> >
> > On Wed, Sep 5, 2012 at 4:18 AM, Marc van der Kamp
> > <marcvanderkamp.gmail.com>wrote:
> >
> >> Hi again,
> >>
> >> The old driver was used by mistake (not commented out). Now 295.41 is
> >> installed, but Patrick's problem persists, i.e. NaN values appear in the
> >> energy output after a few steps. (With Amber 11 pmemd.cuda the
> simulation
> >> kept running even though it was producing NaN values.)
> >> I assume that when you say "it runs at my end" that there aren't any NaN
> >> values?
> >>
> >> I'll ask about getting you an account on the machine in question. It is
> our
> >> institution-wide (University of Bristol) HPC cluster (BlueCrystal), so
> I'm
> >> not sure if that is going to be possible.
> >>
> >> --Marc
> >>
> >> On 4 September 2012 22:34, Scott Le Grand <varelse2005.gmail.com>
> wrote:
> >>
> >> > Well, can your sysadmin give me account on the machine in question?
> >> >
> >> > Or please point him to
> http://developer.nvidia.com/cuda/cuda-downloads
> >> >
> >> > It indicates 295.41 as the latest driver...
> >> >
> >> > Your point about passing test.cuda is valid, but since it runs at my
> end,
> >> > there's not much else I can suggest...
> >> >
> >> >
> >> > On Tue, Sep 4, 2012 at 1:31 PM, Marc van der Kamp
> >> > <marcvanderkamp.gmail.com>wrote:
> >> >
> >> > > PS As pmemd.cuda essentially passed make test.cuda completely, I'd
> >> think
> >> > > the driver shouldn't be the issue?
> >> > > It is only Patrick's test cases with transition metals in co-factors
> >> that
> >> > > are failing. So you can run these same test cases without problems?
> >> > > --Marc
> >> > >
> >> > > On 4 September 2012 21:21, Marc van der Kamp <
> marcvanderkamp.gmail.com
> >> > > >wrote:
> >> > >
> >> > > > Thanks Scott,
> >> > > > The driver was installed by the sysadmin - I don't have root
> access
> >> on
> >> > > > this cluster.
> >> > > > He wrote to me that it was "the latest driver", but apparently
> not.
> >> > I'll
> >> > > > ask him to download a fresh driver from nvidia.
> >> > > >
> >> > > > Can pmemd.cuda post-bugfix.9 still be compiled with the 4.0
> toolkit?
> >> I
> >> > > > initially tried this, but got an error saying I needed to use 4.2.
> >> > > >
> >> > > > Thanks,
> >> > > > --Marc
> >> > > >
> >> > > > On 4 September 2012 21:15, Scott Le Grand <varelse2005.gmail.com>
> >> > wrote:
> >> > > >
> >> > > >> Your driver is far too old for 4.2. Either install a newer
> driver
> >> or
> >> > > use
> >> > > >> the 4.0 toolkit...
> >> > > >> On Sep 4, 2012 1:08 PM, "Marc van der Kamp" <
> >> marcvanderkamp.gmail.com
> >> > >
> >> > > >> wrote:
> >> > > >>
> >> > > >> > Hi Scott,
> >> > > >> >
> >> > > >> > I compiled pmemd.cuda_SPFP for Patrick and ran make test.cuda.
> All
> >> > > tests
> >> > > >> > passed, apart from a few (6 I think) that only had minor
> >> differences
> >> > > in
> >> > > >> > values (different 4th digit).
> >> > > >> >
> >> > > >> > The CUDA Toolkit:
> >> > > >> > $ nvcc -V
> >> > > >> > nvcc: NVIDIA (R) Cuda compiler driver
> >> > > >> > Copyright (c) 2005-2012 NVIDIA Corporation
> >> > > >> > Built on Thu_Apr__5_00:24:31_PDT_2012
> >> > > >> > Cuda compilation tools, release 4.2, V0.2.1221
> >> > > >> >
> >> > > >> > The driver was freshly installed today:
> >> > > >> > devdriver_4.0_linux_64_270.41.19
> >> > > >> >
> >> > > >> > Cards on the node where both test.cuda and Patrick's jobs ran:
> >> > > >> > $ nvidia-smi -L
> >> > > >> > GPU 0: Tesla M2050 (S/N: 0322310084063)
> >> > > >> > GPU 1: Tesla M2050 (S/N: 0322310082367)
> >> > > >> >
> >> > > >> > Hope this helps,
> >> > > >> > Marc
> >> > > >> >
> >> > > >> >
> >> > > >> > On 4 September 2012 19:44, Scott Le Grand <
> varelse2005.gmail.com>
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > > Does your build of pmemd.cuda pass a make test.cuda?
> >> > > >> > >
> >> > > >> > > Also what CUDA Toolkit/Display driver are you using?
> >> > > >> > >
> >> > > >> > > On Tue, Sep 4, 2012 at 10:27 AM, Patrick von Glehn <
> >> > > >> > > patrickvonglehn.gmail.com> wrote:
> >> > > >> > >
> >> > > >> > > > Hi Jason and Scott,
> >> > > >> > > >
> >> > > >> > > > Unfortunately bugfix 9 has not solved the problem.
> >> > > >> > > >
> >> > > >> > > > To reiterate for anyone else who is interested, molecular
> >> > dynamics
> >> > > >> on
> >> > > >> > > > my system of interest runs smoothly with pmemd but the
> system
> >> > > blows
> >> > > >> up
> >> > > >> > > > when run with pmemd.cuda on GPUs (a few atoms in the
> region of
> >> > the
> >> > > >> > > > hexacoordinated cobalt fly off in different directions).
> This
> >> > > >> happens
> >> > > >> > > > with either a 0.002ps timestep or a 0.000002ps timestep.
> >> > > >> > > >
> >> > > >> > > > I initially ran the calculations on NVIDIA Tesla M2090 GPUs
> >> with
> >> > > >> > > > pmemd.cuda_SPDP and then I tried again on Nvidia Fermi
> M2050
> >> > GPUs
> >> > > >> with
> >> > > >> > > > bufix.9 applied.
> >> > > >> > > >
> >> > > >> > > > Input files can be found attached to the first message in
> this
> >> > > >> thread.
> >> > > >> > > >
> >> > > >> > > > Any help would be greatly appreciated,
> >> > > >> > > >
> >> > > >> > > > Patrick von Glehn
> >> > > >> > > > PhD student in the Harvey and Mulholland groups
> >> > > >> > > > Centre for Computational Chemistry
> >> > > >> > > > University of Bristol
> >> > > >> > > >
> >> > > >> > > > On 22 August 2012 15:50, Jason Swails <
> jason.swails.gmail.com
> >> >
> >> > > >> wrote:
> >> > > >> > > > > On Wed, Aug 22, 2012 at 10:28 AM, Patrick von Glehn <
> >> > > >> > > > > patrickvonglehn.gmail.com> wrote:
> >> > > >> > > > >
> >> > > >> > > > >> Hi Scott,
> >> > > >> > > > >>
> >> > > >> > > > >> Thanks for your reply.
> >> > > >> > > > >>
> >> > > >> > > > >> Do you have reason to believe that the new patch will
> >> resolve
> >> > > >> this
> >> > > >> > > > >> error? Were you able to reproduce the error with an
> >> unpatched
> >> > > >> > version
> >> > > >> > > > >> of amber? Also, forgive my ignorance, but what does TOT
> >> > mean?
> >> > > >> > > > >>
> >> > > >> > > > >
> >> > > >> > > > > Top Of Tree, I think :). What this means is that he
> doesn't
> >> > see
> >> > > >> the
> >> > > >> > > > error
> >> > > >> > > > > with the soon-to-be-released pmemd.cuda upgrade (I don't
> >> think
> >> > > the
> >> > > >> > > > current
> >> > > >> > > > > version of amber was tested, but the upcoming patch is
> known
> >> > to
> >> > > >> have
> >> > > >> > > > fixed
> >> > > >> > > > > a handful of bugs).
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > >> What sort of timescale are we talking about here for the
> >> new
> >> > > >> patch
> >> > > >> > > > >> release? Days/weeks/months? I am very keen to get my GPU
> >> > > >> simulations
> >> > > >> > > > >> going!
> >> > > >> > > > >>
> >> > > >> > > > >
> >> > > >> > > > > No promises here, but in conversations I've had with
> Ross, I
> >> > > would
> >> > > >> > say
> >> > > >> > > > > we're aiming for 'days'. The patch is a large one, and
> has
> >> to
> >> > > be
> >> > > >> > > handled
> >> > > >> > > > > with care, but we're taking a crack at generating the
> patch
> >> > > >> tonight.
> >> > > >> > > If
> >> > > >> > > > > the merge goes smoothly and everything tests out
> correctly
> >> the
> >> > > >> first
> >> > > >> > > time
> >> > > >> > > > > through, you probably will not have more than a few days
> to
> >> > > wait.
> >> > > >> > > > >
> >> > > >> > > > > HTH,
> >> > > >> > > > > Jason
> >> > > >> > > > >
> >> > > >> > > > > --
> >> > > >> > > > > Jason M. Swails
> >> > > >> > > > > Quantum Theory Project,
> >> > > >> > > > > University of Florida
> >> > > >> > > > > Ph.D. Candidate
> >> > > >> > > > > 352-392-4032
> >> > > >> > > > > _______________________________________________
> >> > > >> > > > > AMBER mailing list
> >> > > >> > > > > AMBER.ambermd.org
> >> > > >> > > > > http://lists.ambermd.org/mailman/listinfo/amber
> >> > > >> > > >
> >> > > >> > > > _______________________________________________
> >> > > >> > > > AMBER mailing list
> >> > > >> > > > AMBER.ambermd.org
> >> > > >> > > > http://lists.ambermd.org/mailman/listinfo/amber
> >> > > >> > > >
> >> > > >> > > _______________________________________________
> >> > > >> > > AMBER mailing list
> >> > > >> > > AMBER.ambermd.org
> >> > > >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >> > > >> > >
> >> > > >> > _______________________________________________
> >> > > >> > AMBER mailing list
> >> > > >> > AMBER.ambermd.org
> >> > > >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> > > >> >
> >> > > >> _______________________________________________
> >> > > >> AMBER mailing list
> >> > > >> AMBER.ambermd.org
> >> > > >> http://lists.ambermd.org/mailman/listinfo/amber
> >> > > >>
> >> > > >
> >> > > >
> >> > > _______________________________________________
> >> > > AMBER mailing list
> >> > > AMBER.ambermd.org
> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >> > >
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Sep 11 2012 - 10:30:04 PDT
Custom Search