Re: [AMBER] Failure kReduceSoluteCOM with GPU

From: Scott Le Grand <varelse2005.gmail.com>
Date: Tue, 9 Aug 2011 09:03:23 -0700

This bug is fixed in the upcoming patch.
On Aug 9, 2011 7:47 AM, "Ismail, Mohd F." <farid.ou.edu> wrote:
> I have also stumbled into this error. My system is a 64,000 atom system
made of 4000 dimethoxy ethane molecule.
>
> My system is a dual opteron with GTX 590, cuda 3.2, Nvidia driver 280.16,
and gfortran.
>
> If I run the system with NPT ensembles, it fails. But if I run with NVT
ensemble, it works fine. I never test it with NVE calculation. Both system
works when run on CPU. The weird thing is the benchmark system
Factorial....NPT runs fine. The only different I found is the benchmark
system uses the old parameter file, whereas the my system run the new
parameter file (the part in the mdout that says "New format PARM file being
parsed.")
>
> I assume the old format PARM is for AmberTools <1.5, no?
>
> Best,
> Farid Ismail
> Graduate Student
> University of Oklahoma
>
> ________________________________________
> From: Scott Le Grand [varelse2005.gmail.com]
> Sent: Monday, August 01, 2011 9:09 PM
> To: amber.ambermd.org
> Subject: Re: [AMBER] Failure kReduceSoluteCOM with GPU
>
> OK I found the problem. It's a pretty obscure corner case you have there -
> lots and lots of solute molecules. I never expected anyone to take this as
> far as you have. But since I expect the situation to only get worse as
> people throw more and more membrane simulations at it, I am addressing
this
> once and for all to automagically handle your system and any other one
that
> may come its way.
>
> I need a few more days for this but the upcoming patch *will* address
this.
> You have 1835 or so solute molecules - that blows out the L1 cache on
C20xx
> which can only hold enough data for 1535. I'm modifying things to spill
out
> of L1 when this happens instead of get weird and wrong as they currently
do.
>
> GPU AMBER is in some ways a victim of its own success I guess :-)...
>
> Scott
>
>
>
>
>
> 2011/7/30 Fabrício Bracht <bracht.iq.ufrj.br>
>
>> Hi Scott. This calculation works fine on amber cpu. I haven't had this
>> type of problems with the restraint energy. But I'll look into it.
>>
>> As for Ross' answer. I'm glad to see that the problem reproduces with
>> you. If you need anything, just let me know.
>> Thank you again.
>> Fabrício
>>
>> 2011/7/30 Scott Le Grand <varelse2005.gmail.com>:
>> > First, try running this on CPU AMBER. It looks to me like it's broken
>> there
>> > as well. This is beause your restraint energy starts off somewhere in
>> the
>> > proximity of Neptune on the first iteration. What happens from there is
>> > dependent on whether you're on a GPU or CPU to some extent but it's
>> separate
>> > but equally bad.
>> >
>> > To see what I mean, set ntpr=1 in your md.in file and compare step 1
>> > energies on both CPU and GPU.
>> >
>> > Also, I got a file from Ross to replicate this but your command line in
>> this
>> > thread uses different file names than are in the archive. Could you
send
>> me
>> > your exact command line based on what you sent Ross?
>> >
>> > Scott
>> >
>> >
>> >
>> > 2011/7/29 Fabrício Bracht <bracht.iq.ufrj.br>
>> >
>> >> Hi Ross. Just cheking to see if you received my last email with the
>> >> file attached.
>> >> Thank you
>> >> Fabrício
>> >>
>> >> 2011/7/27 Ross Walker <ross.rosswalker.co.uk>:
>> >> > Hi Fabricio,
>> >> >
>> >> > If they are identical this means that this may be a new bug,
although
>> we
>> >> may
>> >> > have already inadvertently fixed it in the development version. Can
>> you
>> >> send
>> >> > me your input files please (direct to me is fine) so I can try it
here
>> >> and
>> >> > see if I can reproduce it.
>> >> >
>> >> > All the best
>> >> > Ross
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Fabrício Bracht [mailto:bracht.iq.ufrj.br]
>> >> >> Sent: Wednesday, July 27, 2011 12:05 PM
>> >> >> To: AMBER Mailing List
>> >> >> Subject: Re: [AMBER] Failure kReduceSoluteCOM with GPU
>> >> >>
>> >> >> Hi Ross. Here is my result to md5sum *.
>> >> >> md5sum: B40C: Is a directory
>> >> >> f4ed79de194d836246009d5c29051574 cuda_info.fpp
>> >> >> a9e4f660fcb5347b1273a8e3f76d3e74 gpu.cpp
>> >> >> 307e64e078aa5f1f22bd78fd224c9f4b gpu.h
>> >> >> 9e6a4f93e46046cda29369feb0dd32e8 gputypes.cpp
>> >> >> 46f8ccf2bbee063ff35a73945b16a3a2 gputypes.h
>> >> >> 90ba8d068522a00074707a529469f5ea kCalculateGBBornRadii.cu
>> >> >> 97fbbcfb8a3833509d94072ecab05643 kCalculateGBNonbondEnergy1.cu
>> >> >> 79fb7a5bba2a19ba351a7dd5996d31fc kCalculateGBNonbondEnergy2.cu
>> >> >> 67a458e51a76162edbcc907e7135500c kCalculateLocalForces.cu
>> >> >> ce308f4fbe9468d5505beb0099d58e76 kCalculatePMENonbondEnergy.cu
>> >> >> 9b240d418e391a71b590e6dc3bc3b0ff kCCF.h
>> >> >> 5561a56bc236291cb87b4770453d67a4 kCLF.h
>> >> >> 86f220029e3a943a186ebcfd16e2dcd9 kCPNE.h
>> >> >> 9905ed2e705bccf1ae705279d85d0e57 kForcesUpdate.cu
>> >> >> edf2d74af7a4d401ccecc7bfa6d036c3 kNeighborList.cu
>> >> >> fd65d023597024a68565c5a0e5ffd86c kNTPKernels.h
>> >> >> 49f952b429618228fca8e23f44223c58 kPGGW.h
>> >> >> 4aea91b87cbb3cf62b9fddafe607ab48 kPGS.h
>> >> >> 9c5951cdf94402d2c0396b74498f72f5 kPMEInterpolation.cu
>> >> >> 46f01611524128ea428c069ef58bd421 kPSSE.h
>> >> >> ada7d510598c88ed4adb8d32a9dbf73d kRandom.h
>> >> >> eefe9bd32e04ba2bbe2eb5611a6464bd kShake.cu
>> >> >> b07e184d2840ffae27d8af5415fae04a kU.h
>> >> >> 6947e1fae477c0bb9c637062a0ddbfd8 Makefile
>> >> >> e5a6173273e6812669c21abcd1530226 Makefile.advanced
>> >> >> They are exactly the same. Now I really don´t know what to do. What
>> do
>> >> >> you suggest?
>> >> >> Fabrício Bracht
>> >> >>
>> >> >> 2011/7/27 Ross Walker <ross.rosswalker.co.uk>:
>> >> >> > Hi Fabricio,
>> >> >> >
>> >> >> > Please take a look at the following which explains what md5sum's
>> are:
>> >> >> > http://en.wikipedia.org/wiki/Md5sum
>> >> >> >
>> >> >> > In summary it creates an 'almost' unique fingerprint of a file.
>> Thus
>> >> >> if I
>> >> >> > run md5sum on the files in my directory and you run md5sum on the
>> >> >> files in
>> >> >> > your directory one can compare the fingerprints produced. If they
>> are
>> >> >> the
>> >> >> > same then we know the files are identical. The following is the
>> list
>> >> >> of
>> >> >> > md5sum's for the files in my cuda directory which represents the
>> >> >> currently
>> >> >> > fully up to date released copy of AMBER with all bugfixes
applied.
>> >> >> You
>> >> >> > should go to your machine and do the following:
>> >> >> >
>> >> >> > cd $AMBERHOME/src
>> >> >> > make clean
>> >> >> > cd pmemd/src/cuda
>> >> >> > md5sum *
>> >> >> >
>> >> >> > And then see if the fingerprint given (the bunch of letters and
>> >> >> numbers
>> >> >> > before each file) matches those I list below for each file. If
they
>> >> >> do then
>> >> >> > we know your patch was all applied correctly and your system may
be
>> >> >> > highlighting a real bug in the code. Note the GTX275 and GTX460's
>> are
>> >> >> VERY
>> >> >> > different chip architectures hence why a subtle bug such as this
>> may
>> >> >> only
>> >> >> > manifest itself on one card and not the other.
>> >> >> >
>> >> >> > All the best
>> >> >> > Ross
>> >> >> >
>> >> >> > foo.linux-jh9j:~/amber11_as_of_jul_22/src/pmemd/src/cuda> md5sum
*
>> >> >> > md5sum: B40C: Is a directory
>> >> >> > f4ed79de194d836246009d5c29051574 cuda_info.fpp
>> >> >> > a9e4f660fcb5347b1273a8e3f76d3e74 gpu.cpp
>> >> >> > 307e64e078aa5f1f22bd78fd224c9f4b gpu.h
>> >> >> > 9e6a4f93e46046cda29369feb0dd32e8 gputypes.cpp
>> >> >> > 46f8ccf2bbee063ff35a73945b16a3a2 gputypes.h
>> >> >> > 90ba8d068522a00074707a529469f5ea kCalculateGBBornRadii.cu
>> >> >> > 97fbbcfb8a3833509d94072ecab05643 kCalculateGBNonbondEnergy1.cu
>> >> >> > 79fb7a5bba2a19ba351a7dd5996d31fc kCalculateGBNonbondEnergy2.cu
>> >> >> > 67a458e51a76162edbcc907e7135500c kCalculateLocalForces.cu
>> >> >> > ce308f4fbe9468d5505beb0099d58e76 kCalculatePMENonbondEnergy.cu
>> >> >> > 9b240d418e391a71b590e6dc3bc3b0ff kCCF.h
>> >> >> > 5561a56bc236291cb87b4770453d67a4 kCLF.h
>> >> >> > 86f220029e3a943a186ebcfd16e2dcd9 kCPNE.h
>> >> >> > 9905ed2e705bccf1ae705279d85d0e57 kForcesUpdate.cu
>> >> >> > edf2d74af7a4d401ccecc7bfa6d036c3 kNeighborList.cu
>> >> >> > fd65d023597024a68565c5a0e5ffd86c kNTPKernels.h
>> >> >> > 49f952b429618228fca8e23f44223c58 kPGGW.h
>> >> >> > 4aea91b87cbb3cf62b9fddafe607ab48 kPGS.h
>> >> >> > 9c5951cdf94402d2c0396b74498f72f5 kPMEInterpolation.cu
>> >> >> > 46f01611524128ea428c069ef58bd421 kPSSE.h
>> >> >> > ada7d510598c88ed4adb8d32a9dbf73d kRandom.h
>> >> >> > eefe9bd32e04ba2bbe2eb5611a6464bd kShake.cu
>> >> >> > b07e184d2840ffae27d8af5415fae04a kU.h
>> >> >> > 6947e1fae477c0bb9c637062a0ddbfd8 Makefile
>> >> >> > e5a6173273e6812669c21abcd1530226 Makefile.advanced
>> >> >> >
>> >> >> >> -----Original Message-----
>> >> >> >> From: Fabrício Bracht [mailto:bracht.iq.ufrj.br]
>> >> >> >> Sent: Wednesday, July 27, 2011 8:53 AM
>> >> >> >> To: AMBER Mailing List; Scott Brozell
>> >> >> >> Subject: Re: [AMBER] Failure kReduceSoluteCOM with GPU
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >> I've only found $AMBERHOME/AmberTools/src/configure.rej .
>> >> >> >> I've checked the files that were supposed to be patched by
>> >> >> bugfix.11,
>> >> >> >> but wasn't able to confirm if they were patched or not due to my
>> >> >> lack
>> >> >> >> of programming knowledge. Any tips here?
>> >> >> >> One other thing. Why is it that this simulation ran successfully
>> on
>> >> >> my
>> >> >> >> GTX275 computer but has problems with my GTX460?
>> >> >> >> Thank you
>> >> >> >> Fabrício
>> >> >> >>
>> >> >> >> 2011/7/27 Scott Brozell <sbrozell.rci.rutgers.edu>:
>> >> >> >> > Hi,
>> >> >> >> >
>> >> >> >> > The patch command should create a reject file: blabla.rej.
>> >> >> >> > So look for files with a rej extension.
>> >> >> >> > Also since in bugfix 11 there are only a few files to be
patched
>> >> >> in
>> >> >> >> > src/pmemd/src/cuda, you could look at those files to see if
the
>> >> >> >> > patch has been applied:
>> >> >> >> > http://ambermd.org/bugfixes/11.0/bugfix.11
>> >> >> >> >
>> >> >> >> > scott
>> >> >> >> >
>> >> >> >> > On Tue, Jul 26, 2011 at 10:07:28AM -0300, Fabrício Bracht
wrote:
>> >> >> >> >> Hi Scott. How do I check if this specific bugfix has been
>> applied
>> >> >> >> >> correctly? Would it be something like md5sum * in
>> >> >> >> >> $AMBERHOME/src/pmemd/src/cuda/ . And what should I look for?
>> >> >> >> >> Thank you
>> >> >> >> >> Fabrício
>> >> >> >> >>
>> >> >> >> >> 2011/7/26 Scott Brozell <sbrozell.rci.rutgers.edu>:
>> >> >> >> >> > Hi,
>> >> >> >> >> >
>> >> >> >> >> > This looks like a problem addressed by bugfix.11.
>> >> >> >> >> > I have not been following your threads closely,
>> >> >> >> >> > but i read that you were having problems with the bugfixes.
>> >> >> >> >> > You might inspect the files listed in bugfix.11 to
determine
>> >> >> >> >> > whether the bugfixes were really applied, while you are
>> waiting
>> >> >> >> >> > for someone that as been following your threads closely to
>> >> >> reply.
>> >> >> >> >> >
>> >> >> >> >> > scott
>> >> >> >> >> >
>> >> >> >> >> > On Tue, Jul 26, 2011 at 12:44:10AM -0300, Fabrício Bracht
>> >> >> wrote:
>> >> >> >> >> >> Since I finally was able to compile amber11 with cuda
>> support
>> >> >> on
>> >> >> >> my
>> >> >> >> >> >> for my gtx460, I thought everything was fine, but it seems
>> >> >> that
>> >> >> >> now I
>> >> >> >> >> >> have to set a few things in order to get my system running
>> >> >> again.
>> >> >> >> Let
>> >> >> >> >> >> me explain more.
>> >> >> >> >> >> I was simulating a protein inside a micele. I had a few
tens
>> >> >> of
>> >> >> >> >> >> nanoseconds simulated on a gtx275. The system is comprised
>> of
>> >> >> >> water,
>> >> >> >> >> >> organic solvent, surfactant, counterions and my protein
>> >> >> (aprox.
>> >> >> >> 60000
>> >> >> >> >> >> atoms). When I tried to start a simulation using my
restart
>> >> >> files
>> >> >> >> from
>> >> >> >> >> >> the GTX275 on my gtx460 machine, I got the following
error.
>> >> >> >> >> >> Error: unspecified launch failure launching kernel
>> >> >> >> kReduceSoluteCOM
>> >> >> >> >> >> cudaFree GpuBuffer::Deallocate failed unspecified launch
>> >> >> failure
>> >> >> >> >> >>
>> >> >> >> >> >> I thought it might have something to do with a problem in
>> the
>> >> >> >> restart
>> >> >> >> >> >> file or something like this, so I recreated the inpcrd and
>> >> >> prmtop
>> >> >> >> >> >> files for the last configuration and tried to start a new
>> >> >> fresh
>> >> >> >> one in
>> >> >> >> >> >> my gtx460 machine. Well, it didn't work out. I got the
same
>> >> >> error
>> >> >> >> >> >> lines again.
>> >> >> >> >> >> Here is my configuration file.
>> >> >> >> >> >> MD parameters
>> >> >> >> >> >> &cntrl
>> >> >> >> >> >> imin = 0,
>> >> >> >> >> >> irest = 1,
>> >> >> >> >> >> ntx = 7,
>> >> >> >> >> >> ntb = 2, pres0 = 1.0, ntp = 1, taup = 2.0,
>> >> >> >> >> >> cut = 9.0,
>> >> >> >> >> >> ntr = 1,
>> >> >> >> >> >> ntc = 2,
>> >> >> >> >> >> ntf = 2,
>> >> >> >> >> >> tempi = 300.0,
>> >> >> >> >> >> temp0 = 300.0,
>> >> >> >> >> >> ntt = 3,
>> >> >> >> >> >> gamma_ln = 1.0,
>> >> >> >> >> >> nstlim = 5000000, dt = 0.002,
>> >> >> >> >> >> ntpr = 10000, ntwx = 10000, ntwr = 1000
>> >> >> >> >> >> /
>> >> >> >> >> >> Restraints
>> >> >> >> >> >> 5.0
>> >> >> >> >> >> RES 1 317
>> >> >> >> >> >> END
>> >> >> >> >> >> END
>> >> >> >> >> >>
>> >> >> >> >> >> And here is the command line:
>> >> >> >> >> >> pmemd.cuda -O -i md.in -c micel2.3.inpcrd -p
>> micel2.3.prmtop -
>> >> >> r
>> >> >> >> >> >> md3.rst -o md3.out -ref micela2.3.inpcrd -inf md3.info -x
>> >> >> >> md3.mdcrd
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> > _______________________________________________
>> >> >> >> > AMBER mailing list
>> >> >> >> > AMBER.ambermd.org
>> >> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >> >> >
>> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >> AMBER mailing list
>> >> >> >> AMBER.ambermd.org
>> >> >> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >> >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > AMBER mailing list
>> >> >> > AMBER.ambermd.org
>> >> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >> >
>> >> >>
>> >> >> _______________________________________________
>> >> >> AMBER mailing list
>> >> >> AMBER.ambermd.org
>> >> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Aug 09 2011 - 09:30:02 PDT
Custom Search