Re: [AMBER] GB simulation on GPU freezes

From: Scott Le Grand <varelse2005.gmail.com>
Date: Mon, 17 Oct 2011 14:49:18 -0700

It's hardware... You either have some sort of defective motherboard or GPU
or BIOS, a crappy power supply, or some combo thereof.

There's nothing else to understand here. And the only people who can debug
this work at NVIDIA. And I suspect their recommendation would be to replace
parts until it works.

Try plugging those GPUs in another motherboard and/or known working GPUs in
the suspect motherboard. That's the extent of debugging you can do.



On Mon, Oct 17, 2011 at 1:46 PM, E. Nihal Korkmaz
<enihalkorkmaz.gmail.com>wrote:

> The simulations do not freeze on different clusters (with 2070s), so
> apparently it is something specific to that particular cluster. Is there
> any
> specific programs that Amber would encounter an incompatibility, for
> instance during the recording of the output files?
>
> Or is there any way to increase the verbosity of the *out files or mdinfo
> file that would help me detect what causes my simulations to lock up??
>
>
> On Fri, Oct 14, 2011 at 6:15 PM, E. Nihal Korkmaz
> <enihalkorkmaz.gmail.com>wrote:
>
> > Unfortunately, it locks up in both 2070 and 480.
> >
> > I am currently trying same files on a different GPU cluster of 2070s,
> > fingers crossed!
> >
> >
> >
> > On Fri, Oct 14, 2011 at 5:29 PM, Scott Le Grand <varelse2005.gmail.com
> >wrote:
> >
> >> If it doesn't lock up on the 2070, but does on the 480, it is likely
> >> defective HW.
> >>
> >> If it locks up on the 2070, and Ross can repro it on his 20xxs, I know
> >> what
> >> I'll be doing this weekend :-)...
> >>
> >> But shooting from the hip, I'm guessing this is a bad GPU.
> >>
> >>
> >> On Fri, Oct 14, 2011 at 2:55 PM, Ross Walker <rosscwalker.gmail.com>
> >> wrote:
> >>
> >> > Can you send me the input files for one of the simulations that locks
> >> > please so I can try to reproduce it.
> >> >
> >> > Does it lock up on both the GTX480 and C2070?
> >> >
> >> > All the best
> >> > Ross
> >> >
> >> >
> >> >
> >> > On Oct 14, 2011, at 15:28, "E. Nihal Korkmaz" <
> enihalkorkmaz.gmail.com>
> >> > wrote:
> >> >
> >> > > Yes, I applied the bugfix patches during the first configuration of
> >> Amber
> >> > on
> >> > > the cluster as directed on the Amber website.
> >> > >
> >> > > Not the exact same point, but always after 500 ns for that
> particular
> >> > > simulation.
> >> > > I just realized it got locked up for different proteins (shorter)
> too
> >> at
> >> > > around 200 ns. I simulate a series of the same protein for different
> >> > > conditions (T and salt conc), some goes smoothly some gets locked
> up.
> >> I
> >> > > checked the energy logs in the *.out file, nothing seems unusual and
> >> > nothing
> >> > > is drastically different between simulations go smooth and those
> >> freeze.
> >> > >
> >> > > Thanks,
> >> > > Nihal
> >> > >
> >> > > On Fri, Oct 14, 2011 at 2:15 PM, Ross Walker <rosscwalker.gmail.com
> >
> >> > wrote:
> >> > >
> >> > >> There are a lot of unnecessary defaults in your input file. Like
> >> > specifying
> >> > >> taup for a GB run. You can probably also set ntwr much larger to
> >> improve
> >> > >> performance. And a gamma_ln of 20 is probably a bit high. None of
> >> these
> >> > >> should cause a lockup though.
> >> > >>
> >> > >> Can you confirm that you are running with the latest bugfixes. In
> >> > >> particular bugfix.17 for Amber 11.
> >> > >>
> >> > >> Also does the calculation always lockup at the exact same point?
> >> > >>
> >> > >> All the best
> >> > >> Ross
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Oct 14, 2011, at 14:17, "E. Nihal Korkmaz" <
> >> enihalkorkmaz.gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >>> Amber 11, I tried on GeForce GTX 480 and Tesla C2070 processors,
> on
> >> > Linux
> >> > >>> (CentOS release 5.6). We have Cuda 4 for nvidia compiler. I am
> >> running
> >> > >> with
> >> > >>> pmemd.cuda.
> >> > >>>
> >> > >>> and that's my in file below (although same file works ok with the
> >> > >> homologous
> >> > >>> structure) :
> >> > >>> &cntrl
> >> > >>> imin=0,
> >> > >>>
> >> > >>> ntb=0,
> >> > >>> ntx=5,
> >> > >>> irest=1,
> >> > >>>
> >> > >>> ntpr=200,
> >> > >>> ntwr=200,
> >> > >>> ntwx=200,
> >> > >>> ntwe=200,
> >> > >>>
> >> > >>> nstlim=5000000,
> >> > >>> dt=0.002,
> >> > >>>
> >> > >>> ntt=3,
> >> > >>>
> >> > >>> temp0=300,
> >> > >>> tempi=300,
> >> > >>> ig=-1,
> >> > >>> tautp=1,
> >> > >>> gamma_ln=20,
> >> > >>>
> >> > >>> ntp=0,
> >> > >>> pres0=1,
> >> > >>> taup=1,
> >> > >>>
> >> > >>> ntc=2,
> >> > >>> tol=0.00001,
> >> > >>>
> >> > >>> ntf=2,
> >> > >>> ntb=0,
> >> > >>> dielc=1,
> >> > >>> cut=9999,
> >> > >>> rgbmax=12,
> >> > >>> ipol=0,
> >> > >>> ifqnt=0,
> >> > >>> igb=5,
> >> > >>> saltcon=0.15,
> >> > >>> ioutfm=1,
> >> > >>> nscm=100,
> >> > >>> &end
> >> > >>>
> >> > >>>
> >> > >>> On Fri, Oct 14, 2011 at 1:05 PM, Scott Le Grand <
> >> varelse2005.gmail.com
> >> > >>> wrote:
> >> > >>>
> >> > >>>> What revision of AMBER? What GPU? What OS? What driver? What
> >> > toolkit
> >> > >> did
> >> > >>>> you compile with?
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> On Fri, Oct 14, 2011 at 10:55 AM, E. Nihal Korkmaz
> >> > >>>> <enihalkorkmaz.gmail.com>wrote:
> >> > >>>>
> >> > >>>>> Dear all,
> >> > >>>>>
> >> > >>>>> I keep having a problem that only for a particular protein the
> >> > >>>> simulation
> >> > >>>>> "freezes" and by freeze I mean, it looks like the job is running
> >> but
> >> > no
> >> > >>>>> changes are made on the output files even if you wait 2 days. I
> am
> >> > >> using
> >> > >>>>> igb=5 on GPU, it is a 114 amino acid long protein, I have the
> >> > >> homologous
> >> > >>>>> structure running (112 amino acid long) without a problem. But
> >> that
> >> > >>>>> specific
> >> > >>>>> one stops without being dropped of the queue or any error
> messages
> >> at
> >> > >>>> all.
> >> > >>>>> I
> >> > >>>>> checked the output files, no '*' or 'NaN' are present. I also
> >> tried
> >> > >>>> running
> >> > >>>>> on different machines, same thing happens. I tried starting from
> a
> >> > >>>>> different
> >> > >>>>> restart file, nothing changes. I always freezes although at
> >> different
> >> > >>>> time
> >> > >>>>> steps.
> >> > >>>>>
> >> > >>>>> Has anyone have such a problem before? What can be the causes?
> I'd
> >> > >>>>> appreciate any comments or suggestions.
> >> > >>>>>
> >> > >>>>> Thanks,
> >> > >>>>>
> >> > >>>>> --
> >> > >>>>> Elif Nihal Korkmaz
> >> > >>>>>
> >> > >>>>> Research Assistant
> >> > >>>>> University of Wisconsin - Biophysics
> >> > >>>>> Member of Qiang Cui & Thomas Record Labs
> >> > >>>>> 1101 University Ave, Rm. 8359
> >> > >>>>> Madison, WI 53706
> >> > >>>>> Phone: 608-265-3644
> >> > >>>>> Email: korkmaz.wisc.edu
> >> > >>>>> _______________________________________________
> >> > >>>>> AMBER mailing list
> >> > >>>>> AMBER.ambermd.org
> >> > >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >> > >>>>>
> >> > >>>> _______________________________________________
> >> > >>>> AMBER mailing list
> >> > >>>> AMBER.ambermd.org
> >> > >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >> > >>>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> --
> >> > >>> Elif Nihal Korkmaz
> >> > >>>
> >> > >>> Research Assistant
> >> > >>> University of Wisconsin - Biophysics
> >> > >>> Member of Qiang Cui & Thomas Record Labs
> >> > >>> 1101 University Ave, Rm. 8359
> >> > >>> Madison, WI 53706
> >> > >>> Phone: 608-265-3644
> >> > >>> Email: korkmaz.wisc.edu
> >> > >>> _______________________________________________
> >> > >>> AMBER mailing list
> >> > >>> AMBER.ambermd.org
> >> > >>> http://lists.ambermd.org/mailman/listinfo/amber
> >> > >>
> >> > >> _______________________________________________
> >> > >> AMBER mailing list
> >> > >> AMBER.ambermd.org
> >> > >> http://lists.ambermd.org/mailman/listinfo/amber
> >> > >>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Elif Nihal Korkmaz
> >> > >
> >> > > Research Assistant
> >> > > University of Wisconsin - Biophysics
> >> > > Member of Qiang Cui & Thomas Record Labs
> >> > > 1101 University Ave, Rm. 8359
> >> > > Madison, WI 53706
> >> > > Phone: 608-265-3644
> >> > > Email: korkmaz.wisc.edu
> >> > > _______________________________________________
> >> > > AMBER mailing list
> >> > > AMBER.ambermd.org
> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER.ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> >
> > --
> > Elif Nihal Korkmaz
> >
> > Research Assistant
> > University of Wisconsin - Biophysics
> > Member of Qiang Cui & Thomas Record Labs
> > 1101 University Ave, Rm. 8359
> > Madison, WI 53706
> > Phone: 608-265-3644
> > Email: korkmaz.wisc.edu
> >
> >
> >
>
>
> --
> Elif Nihal Korkmaz
>
> Research Assistant
> University of Wisconsin - Biophysics
> Member of Qiang Cui & Thomas Record Labs
> 1101 University Ave, Rm. 8359
> Madison, WI 53706
> Phone: 608-265-3644
> Email: korkmaz.wisc.edu
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Oct 17 2011 - 15:00:03 PDT
Custom Search