Re: [AMBER] transition metal system works on pmemd crashes on pmemd.cuda

From: Scott Le Grand <varelse2005.gmail.com>
Date: Wed, 12 Sep 2012 09:15:24 -0700

Confirmed - from NVIDIA:

"Confirmed that CUDA 3.0 is the last release that officially supported RHEL
5.3."


On Tue, Sep 11, 2012 at 10:19 AM, Scott Le Grand <varelse2005.gmail.com>wrote:

> The good news: I think I know what's going on...
>
> The bad news: there's probably nothing I can do about it...
>
> That cluster is running RHATt 5.3:
> [xt629.u04n034 B12_cuda]$ cat /etc/redhat-release
> Scientific Linux SL release 5.3 (Boron)
> [xt629.u04n034 B12_cuda]$
>
> And I think you're hitting some *weird* RHAT 5.3/gcc/CUDA bug... CUDA
> sets a minimum expectation of RHAT 5.5. I'm going to ask my friends at
> NVIDIA about this, but they're likely to say they don't support this
> configuration anymore.
>
> The experiment to run would be a single node with Scientific Linux 5.5 to
> see if that fixes things. I have seen equally weird behavior like this
> before on weird builds of SUSE.
>
>
>
>
>
>
>
>
> On Mon, Sep 10, 2012 at 8:18 AM, Patrick von Glehn <
> patrickvonglehn.gmail.com> wrote:
>
>> Hi Scott,
>>
>> We are in the process of getting you a temporary account on our
>> clusters. Details to follow off list shortly.
>>
>> All the best,
>> Patrick
>>
>> On 5 September 2012 13:50, Scott Le Grand <varelse2005.gmail.com> wrote:
>> > Yep, it runs all 25,000 steps. Looking at the output you guys sent,
>> > whatever's happening to Patrick is hosed from the get-go...
>> >
>> >
>> >
>> > On Wed, Sep 5, 2012 at 4:18 AM, Marc van der Kamp
>> > <marcvanderkamp.gmail.com>wrote:
>> >
>> >> Hi again,
>> >>
>> >> The old driver was used by mistake (not commented out). Now 295.41 is
>> >> installed, but Patrick's problem persists, i.e. NaN values appear in
>> the
>> >> energy output after a few steps. (With Amber 11 pmemd.cuda the
>> simulation
>> >> kept running even though it was producing NaN values.)
>> >> I assume that when you say "it runs at my end" that there aren't any
>> NaN
>> >> values?
>> >>
>> >> I'll ask about getting you an account on the machine in question. It
>> is our
>> >> institution-wide (University of Bristol) HPC cluster (BlueCrystal), so
>> I'm
>> >> not sure if that is going to be possible.
>> >>
>> >> --Marc
>> >>
>> >> On 4 September 2012 22:34, Scott Le Grand <varelse2005.gmail.com>
>> wrote:
>> >>
>> >> > Well, can your sysadmin give me account on the machine in question?
>> >> >
>> >> > Or please point him to
>> http://developer.nvidia.com/cuda/cuda-downloads
>> >> >
>> >> > It indicates 295.41 as the latest driver...
>> >> >
>> >> > Your point about passing test.cuda is valid, but since it runs at my
>> end,
>> >> > there's not much else I can suggest...
>> >> >
>> >> >
>> >> > On Tue, Sep 4, 2012 at 1:31 PM, Marc van der Kamp
>> >> > <marcvanderkamp.gmail.com>wrote:
>> >> >
>> >> > > PS As pmemd.cuda essentially passed make test.cuda completely, I'd
>> >> think
>> >> > > the driver shouldn't be the issue?
>> >> > > It is only Patrick's test cases with transition metals in
>> co-factors
>> >> that
>> >> > > are failing. So you can run these same test cases without problems?
>> >> > > --Marc
>> >> > >
>> >> > > On 4 September 2012 21:21, Marc van der Kamp <
>> marcvanderkamp.gmail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > Thanks Scott,
>> >> > > > The driver was installed by the sysadmin - I don't have root
>> access
>> >> on
>> >> > > > this cluster.
>> >> > > > He wrote to me that it was "the latest driver", but apparently
>> not.
>> >> > I'll
>> >> > > > ask him to download a fresh driver from nvidia.
>> >> > > >
>> >> > > > Can pmemd.cuda post-bugfix.9 still be compiled with the 4.0
>> toolkit?
>> >> I
>> >> > > > initially tried this, but got an error saying I needed to use
>> 4.2.
>> >> > > >
>> >> > > > Thanks,
>> >> > > > --Marc
>> >> > > >
>> >> > > > On 4 September 2012 21:15, Scott Le Grand <varelse2005.gmail.com
>> >
>> >> > wrote:
>> >> > > >
>> >> > > >> Your driver is far too old for 4.2. Either install a newer
>> driver
>> >> or
>> >> > > use
>> >> > > >> the 4.0 toolkit...
>> >> > > >> On Sep 4, 2012 1:08 PM, "Marc van der Kamp" <
>> >> marcvanderkamp.gmail.com
>> >> > >
>> >> > > >> wrote:
>> >> > > >>
>> >> > > >> > Hi Scott,
>> >> > > >> >
>> >> > > >> > I compiled pmemd.cuda_SPFP for Patrick and ran make
>> test.cuda. All
>> >> > > tests
>> >> > > >> > passed, apart from a few (6 I think) that only had minor
>> >> differences
>> >> > > in
>> >> > > >> > values (different 4th digit).
>> >> > > >> >
>> >> > > >> > The CUDA Toolkit:
>> >> > > >> > $ nvcc -V
>> >> > > >> > nvcc: NVIDIA (R) Cuda compiler driver
>> >> > > >> > Copyright (c) 2005-2012 NVIDIA Corporation
>> >> > > >> > Built on Thu_Apr__5_00:24:31_PDT_2012
>> >> > > >> > Cuda compilation tools, release 4.2, V0.2.1221
>> >> > > >> >
>> >> > > >> > The driver was freshly installed today:
>> >> > > >> > devdriver_4.0_linux_64_270.41.19
>> >> > > >> >
>> >> > > >> > Cards on the node where both test.cuda and Patrick's jobs ran:
>> >> > > >> > $ nvidia-smi -L
>> >> > > >> > GPU 0: Tesla M2050 (S/N: 0322310084063)
>> >> > > >> > GPU 1: Tesla M2050 (S/N: 0322310082367)
>> >> > > >> >
>> >> > > >> > Hope this helps,
>> >> > > >> > Marc
>> >> > > >> >
>> >> > > >> >
>> >> > > >> > On 4 September 2012 19:44, Scott Le Grand <
>> varelse2005.gmail.com>
>> >> > > >> wrote:
>> >> > > >> >
>> >> > > >> > > Does your build of pmemd.cuda pass a make test.cuda?
>> >> > > >> > >
>> >> > > >> > > Also what CUDA Toolkit/Display driver are you using?
>> >> > > >> > >
>> >> > > >> > > On Tue, Sep 4, 2012 at 10:27 AM, Patrick von Glehn <
>> >> > > >> > > patrickvonglehn.gmail.com> wrote:
>> >> > > >> > >
>> >> > > >> > > > Hi Jason and Scott,
>> >> > > >> > > >
>> >> > > >> > > > Unfortunately bugfix 9 has not solved the problem.
>> >> > > >> > > >
>> >> > > >> > > > To reiterate for anyone else who is interested, molecular
>> >> > dynamics
>> >> > > >> on
>> >> > > >> > > > my system of interest runs smoothly with pmemd but the
>> system
>> >> > > blows
>> >> > > >> up
>> >> > > >> > > > when run with pmemd.cuda on GPUs (a few atoms in the
>> region of
>> >> > the
>> >> > > >> > > > hexacoordinated cobalt fly off in different directions).
>> This
>> >> > > >> happens
>> >> > > >> > > > with either a 0.002ps timestep or a 0.000002ps timestep.
>> >> > > >> > > >
>> >> > > >> > > > I initially ran the calculations on NVIDIA Tesla M2090
>> GPUs
>> >> with
>> >> > > >> > > > pmemd.cuda_SPDP and then I tried again on Nvidia Fermi
>> M2050
>> >> > GPUs
>> >> > > >> with
>> >> > > >> > > > bufix.9 applied.
>> >> > > >> > > >
>> >> > > >> > > > Input files can be found attached to the first message in
>> this
>> >> > > >> thread.
>> >> > > >> > > >
>> >> > > >> > > > Any help would be greatly appreciated,
>> >> > > >> > > >
>> >> > > >> > > > Patrick von Glehn
>> >> > > >> > > > PhD student in the Harvey and Mulholland groups
>> >> > > >> > > > Centre for Computational Chemistry
>> >> > > >> > > > University of Bristol
>> >> > > >> > > >
>> >> > > >> > > > On 22 August 2012 15:50, Jason Swails <
>> jason.swails.gmail.com
>> >> >
>> >> > > >> wrote:
>> >> > > >> > > > > On Wed, Aug 22, 2012 at 10:28 AM, Patrick von Glehn <
>> >> > > >> > > > > patrickvonglehn.gmail.com> wrote:
>> >> > > >> > > > >
>> >> > > >> > > > >> Hi Scott,
>> >> > > >> > > > >>
>> >> > > >> > > > >> Thanks for your reply.
>> >> > > >> > > > >>
>> >> > > >> > > > >> Do you have reason to believe that the new patch will
>> >> resolve
>> >> > > >> this
>> >> > > >> > > > >> error? Were you able to reproduce the error with an
>> >> unpatched
>> >> > > >> > version
>> >> > > >> > > > >> of amber? Also, forgive my ignorance, but what does
>> TOT
>> >> > mean?
>> >> > > >> > > > >>
>> >> > > >> > > > >
>> >> > > >> > > > > Top Of Tree, I think :). What this means is that he
>> doesn't
>> >> > see
>> >> > > >> the
>> >> > > >> > > > error
>> >> > > >> > > > > with the soon-to-be-released pmemd.cuda upgrade (I don't
>> >> think
>> >> > > the
>> >> > > >> > > > current
>> >> > > >> > > > > version of amber was tested, but the upcoming patch is
>> known
>> >> > to
>> >> > > >> have
>> >> > > >> > > > fixed
>> >> > > >> > > > > a handful of bugs).
>> >> > > >> > > > >
>> >> > > >> > > > >
>> >> > > >> > > > >> What sort of timescale are we talking about here for
>> the
>> >> new
>> >> > > >> patch
>> >> > > >> > > > >> release? Days/weeks/months? I am very keen to get my
>> GPU
>> >> > > >> simulations
>> >> > > >> > > > >> going!
>> >> > > >> > > > >>
>> >> > > >> > > > >
>> >> > > >> > > > > No promises here, but in conversations I've had with
>> Ross, I
>> >> > > would
>> >> > > >> > say
>> >> > > >> > > > > we're aiming for 'days'. The patch is a large one, and
>> has
>> >> to
>> >> > > be
>> >> > > >> > > handled
>> >> > > >> > > > > with care, but we're taking a crack at generating the
>> patch
>> >> > > >> tonight.
>> >> > > >> > > If
>> >> > > >> > > > > the merge goes smoothly and everything tests out
>> correctly
>> >> the
>> >> > > >> first
>> >> > > >> > > time
>> >> > > >> > > > > through, you probably will not have more than a few
>> days to
>> >> > > wait.
>> >> > > >> > > > >
>> >> > > >> > > > > HTH,
>> >> > > >> > > > > Jason
>> >> > > >> > > > >
>> >> > > >> > > > > --
>> >> > > >> > > > > Jason M. Swails
>> >> > > >> > > > > Quantum Theory Project,
>> >> > > >> > > > > University of Florida
>> >> > > >> > > > > Ph.D. Candidate
>> >> > > >> > > > > 352-392-4032
>> >> > > >> > > > > _______________________________________________
>> >> > > >> > > > > AMBER mailing list
>> >> > > >> > > > > AMBER.ambermd.org
>> >> > > >> > > > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > >> > > >
>> >> > > >> > > > _______________________________________________
>> >> > > >> > > > AMBER mailing list
>> >> > > >> > > > AMBER.ambermd.org
>> >> > > >> > > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > >> > > >
>> >> > > >> > > _______________________________________________
>> >> > > >> > > AMBER mailing list
>> >> > > >> > > AMBER.ambermd.org
>> >> > > >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > >> > >
>> >> > > >> > _______________________________________________
>> >> > > >> > AMBER mailing list
>> >> > > >> > AMBER.ambermd.org
>> >> > > >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > >> >
>> >> > > >> _______________________________________________
>> >> > > >> AMBER mailing list
>> >> > > >> AMBER.ambermd.org
>> >> > > >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> > > >>
>> >> > > >
>> >> > > >
>> >> > > _______________________________________________
>> >> > > AMBER mailing list
>> >> > > AMBER.ambermd.org
>> >> > > http://lists.ambermd.org/mailman/listinfo/amber
>> >> > >
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Sep 12 2012 - 09:30:04 PDT
Custom Search