Re: [AMBER] REMD replicas blowing up

From: Janzsó Gábor <janzso.brc.hu>
Date: Thu, 09 Dec 2010 18:18:19 +0100

Dear Dr. Simmerling,

Yes, strange indeed. I've run some tests since, and with quite
interesting results.
I've run a 2 cpu/replica setup with different parallel environment.
This one does not assign 4 cpus in one machine for each job, it just
fills up the aviable cpus with the jobs. This run also crashed, but
much later, than the previous one (i.e. it run until the 46th exchange
attempt without an error, than it exited, while the previous run got
the strange temperature in the first 10 attempts, and did not exit
right after), and no strange temperatures have emerged, although the
output of replica09 said:
"vlimit exceeded for step 6; vmax = 20.6728
vlimit exceeded for step 7; vmax = 95.9115
vlimit exceeded for step 8; vmax = 109.2372
vlimit exceeded for step 9; vmax = 50.7357
vlimit exceeded for step 10; vmax = 194.2924
vlimit exceeded for step 11; vmax = 79.3722
vlimit exceeded for step 12; vmax = 176.8686
vlimit exceeded for step 13; vmax = 77.9261
vlimit exceeded for step 14; vmax = 23.0898
vlimit exceeded for step 15; vmax = 100.9211
vlimit exceeded for step 16; vmax = 25.0991
vlimit exceeded for step 17; vmax = 21.9034
vlimit exceeded for step 18; vmax = 33.6312
vlimit exceeded for step 19; vmax = 29.5904
vlimit exceeded for step 20; vmax = 45.6197
vlimit exceeded for step 21; vmax = 33.1269
vlimit exceeded for step 22; vmax = 54.2559
vlimit exceeded for step 23; vmax = 1478.6107
vlimit exceeded for step 24; vmax = 24.5792
vlimit exceeded for step 25; vmax = 101.1165

      Coordinate resetting (SHAKE) cannot be accomplished,
      deviation is too large
      NITER, NIT, LL, I and J are : 0 0 1569 4418 4419"

So I guess it is a similar outcome, because the 9th replica had the
problem, just like before, and the kinetic energies had some peculiar
increase.
Although, this I cannot tell for sure, because the run exited, and the
trajectory looks fine, and the output does not contain the increased
values because of the exit. The .out file contains the vlimit warnings
quoted above, but the .info file has not been updated. It contains the
energy values up until 90 ps, and this error happened in the next
sander call.

On an another test, I've run a 4cpu/replica setup, using the first
parallel environment (the one that assigns jobs in a manner that the
four threads of a job are on the four cores of the same machine). This
test run finished without an error.
I will check if the potential energies overlap enough, although I can
already tell that in the rem.log the exchange rates are between 0.2
and 0.4 which is in the desirable range.

Gabor Janzso

Quoting "Carlos Simmerling" <carlos.simmerling.gmail.com>:

> it seems strange that it passes the parallel REMD tests, but fails for your
> system. As you say, it indicates a bug in the code but I don't have much
> info to try and locate it.
>
>
> On Wed, Dec 8, 2010 at 11:57 AM, Janzsó Gábor <janzso.brc.hu> wrote:
>
>> Dear Dr. Simmerling,
>>
>> Yes, the build passed the parallel tests. As far as I know, it is at
>> least a two year old build, and many parallel simulations were
>> successfully run on it.
>> I would try it with 4 core for each replica, but with as many replicas
>> as I need, the simulation would require more cores than the cluster has.
>>
>>
>> Gabor Janzso
>>
>>
>> Quoting "Carlos Simmerling" <carlos.simmerling.gmail.com>:
>>
>> > did your amber build pass the parallel tests? this is worrisome.
>> >
>> >
>> > On Wed, Dec 8, 2010 at 11:44 AM, Janzsó Gábor <janzso.brc.hu> wrote:
>> >
>> >> Hi Everyone,
>> >>
>> >> So finally it looks like I've found the solution for my problem. It
>> >> had something to do with the parallel environment. The sysadmin
>> >> created a p.e. to utilize better that cluster. The cluster is built
>> >> from 4 core amd processors and that particular p.e. would assign the
>> >> jobs in a manner that the threads of one job wouldn't run on different
>> >> machines instead of the four core of one cpu. Since I used a 64 core
>> >> job (2 core for each of the 32 replicas), the p.e. accepted it (since
>> >> it can be divided by 4) but during the exchanges something got messed
>> >> up. I changed the input so one replica would run on one core, and the
>> >> issue of the temperatures racing up never emerged again.
>> >>
>> >> So the problem wasn't with the input or the parameters, which now seem
>> >> to be alright, since the replicas exchange in the expected fashion. It
>> >> was some kind of informatics-related issue, maybe it is a sign of a
>> >> hidden bug, but only the amber experts could tell that.
>> >> Anyways, I just wrote this so this thread would have some conclusion
>> >> for someone browsing the archives with the same problem.
>> >>
>> >> Take care,
>> >>
>> >> Gabor Janzso
>> >>
>> >>
>> >> Quoting "Adrian Roitberg" <roitberg.qtp.ufl.edu>:
>> >>
>> >> > Dear Gabor
>> >> >
>> >> > Have you tried plotting the distribution of potential energies for the
>> >> > replicas, before they blow up ? They should be basically identical to
>> >> > the ones you get from the individual MD runs.
>> >> >
>> >> > Adrian
>> >> >
>> >> >
>> >> > On 11/30/10 2:59 PM, Carlos Simmerling wrote:
>> >> >> using the same structures at the start can be dangerous since they
>> are
>> >> not
>> >> >> equilibrated at the right T.
>> >> >> this can cause weird things in exchanges. i suggest using the restart
>> >> files
>> >> >> from the runs you just described and initiating remd from that.
>> >> >> On Mon, Nov 29, 2010 at 2:19 PM, Janzsó Gábor<janzso.brc.hu> wrote:
>> >> >>
>> >> >>> Dear Dr. Simmerling,
>> >> >>>
>> >> >>> The replicas have the same input coordinate file, namely the restart
>> >> file
>> >> >>> from the NPT run I used for relaxing the system. So there is no way
>> the
>> >> box
>> >> >>> sizes could be different.
>> >> >>>
>> >> >>> Following your advice, I've run a 5 ns md simulation at each
>> >> temperature,
>> >> >>> and all of the simulations finished correctly. I have created the
>> >> energy
>> >> >>> distribution histogram of each run as you suggested, and there is
>> >> >>> sufficient
>> >> >>> overlap between the potential energies (as far as I can tell). I
>> have
>> >> >>> enclosed an image of the histograms.
>> >> >>> Since the md runs never crashed, I think the problem would be
>> something
>> >> >>> regarding the replica exchange step.
>> >> >>> Any advice what should be the next thing I look into?
>> >> >>>
>> >> >>> Thank you in forward,
>> >> >>>
>> >> >>>
>> >> >>> Gabor Janzso
>> >> >>>
>> >> >>> Quoting "Carlos Simmerling"<carlos.simmerling.gmail.com>:
>> >> >>>
>> >> >>> it's still unclear to me if the initial structures have different
>> >> volumes
>> >> >>>> or
>> >> >>>> not- if yes, this can make exchanges very difficult.
>> >> >>>>
>> >> >>>>
>> >> >>>> I suggest running the identical simulation without remd- meaning
>> set
>> >> up
>> >> >>>> all
>> >> >>>> of the repliacs and temepratures, but do not use remd. check to
>> make
>> >> sure
>> >> >>>> it
>> >> >>>> is still stable (and verify that REMD is the problem). from this,
>> >> extract
>> >> >>>> potential energies from the output files and histogram all of them
>> to
>> >> >>>> ensure
>> >> >>>> that there is overlap between neighbors.
>> >> >>>>
>> >> >>>>
>> >> >>>> On Tue, Nov 23, 2010 at 1:07 PM, Janzsó Gábor<janzso.brc.hu>
>> wrote:
>> >> >>>>
>> >> >>>> Dear Mr. Simmerling,
>> >> >>>>>
>> >> >>>>> I am sorry if I wasn't clear, my goal is to run an NVT study. The
>> NPT
>> >> >>>>> part was only to relax the system after solvating the peptide in
>> the
>> >> >>>>> TFE, just as the tutorials and the manual suggest.
>> >> >>>>>
>> >> >>>>> Regarding your second advice, I am not sure how to create the
>> >> >>>>> histogram of the potential energies if the replicas do not behave
>> as
>> >> >>>>> expected? Should I run simple md runs at each temperature instead?
>> >> How
>> >> >>>>> long such a run shoul be?
>> >> >>>>>
>> >> >>>>> I am also almost sure that the phase transition is not the cause
>> of
>> >> my
>> >> >>>>> problem, since I also tried to run my simulation between 300K and
>> >> 350K
>> >> >>>>> (with 32 replicas), and 350K is just below the boiling point of
>> TFE.
>> >> >>>>> My first guess was the replicas were too far away from each other,
>> >> and
>> >> >>>>> because I have only limited computational capacity at my disposal,
>> my
>> >> >>>>> only option was for sampling the temperatures more frequently,
>> >> >>>>> decreasing the temperature range. Regardless, on lower
>> temperatures,
>> >> >>>>> with smaller deltaT values, the same behavior was observed.
>> >> >>>>>
>> >> >>>>> best regards,
>> >> >>>>>
>> >> >>>>> Gabor Janzso
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Quoting "Carlos Simmerling"<carlos.simmerling.gmail.com>:
>> >> >>>>>
>> >> >>>>>> it's very important to study REMD examples in the literature
>> before
>> >> >>>>> trying
>> >> >>>>>> something very complex like what you want. First, most studies
>> are
>> >> done
>> >> >>>>> at
>> >> >>>>>> NVT. Check work by Angel Garcia if you want to include pressure
>> >> >>>>> effects.
>> >> >>>>>> Second, it is important to carefully histogram your potential
>> >> energies
>> >> >>>>> for
>> >> >>>>>> the replicas. Like you are trying to sample across a phase
>> >> transition,
>> >> >>>>> which
>> >> >>>>>> is quite challenging. Almost certainly this was not included in
>> your
>> >> >>>>> method
>> >> >>>>>> for selecting the replica temperatures (which you have not told
>> us
>> >> >>>>> about).
>> >> >>>>>>
>> >> >>>>>> perhaps there is something else going on- but I think the first
>> step
>> >> is
>> >> >>>>> to
>> >> >>>>>> try NVT.
>> >> >>>>>>
>> >> >>>>>> 2010/11/23 Janzsó Gábor<janzso.brc.hu>
>> >> >>>>>>
>> >> >>>>>>> Dear Amber Users!
>> >> >>>>>>>
>> >> >>>>>>> I run into a problem with Amber REMD. I am using Amber 9, and I
>> do
>> >> not
>> >> >>>>>>> have the option to upgrade to 11, so any solution working on
>> Amber
>> >> 9
>> >> >>>>>>> would be much appreciated.
>> >> >>>>>>> So, I try to run an NVT simulation of amyloid beta 1-42 (Ab1-42)
>> in
>> >> >>>>>>> explicit TFE solvent.
>> >> >>>>>>>
>> >> >>>>>>> I downloaded the mol2 file I found on REDDB (project code W-16),
>> I
>> >> >>>>>>> used packmol to put 256 molecule into a=30.125 cubic box, and
>> then
>> >> >>>>>>> relaxed the box at 300 K. (first heated up with NVT, than
>> relaxed
>> >> with
>> >> >>>>>>> NPT) I saved the output as a lib file, than used it as the
>> solvent
>> >> box
>> >> >>>>>>> to solve the peptide. I've run some NVT and NPT dynamics to see
>> if
>> >> its
>> >> >>>>>>> stable, and it was, at least up to 400K. At 450K or 500K the
>> >> >>>>>>> simulation stopped, the output said SANDER BOMB stopped the run
>> or
>> >> >>>>>>> something like that. I figured it might be ok, because the
>> boiling
>> >> >>>>>>> point of TFE is at 78°C, and the studies I have found used the
>> >> >>>>>>> temperature range of 300K-400K for TFE solvent simulation.
>> >> >>>>>>>
>> >> >>>>>>> So, I set up a REMD using 32 replicas between 300K and 400K,
>> with
>> >> >>>>>>> Berendsens thermostat (1 ps coupling) SHAKE is on, exchange
>> >> attempts
>> >> >>>>>>> at every 2 ps, and chirality restraints and trans-omega
>> restraints
>> >> are
>> >> >>>>>>> applied.
>> >> >>>>>>> The simulation starts normally, but around the first ten-twenty
>> >> >>>>>>> exchange attempts some replicas heat up like insane. The REMD
>> keeps
>> >> on
>> >> >>>>>>> running, but three replicas are at ~600 000K (!) - and obviously
>> >> they
>> >> >>>>>>> don't participate in the exchanges anymore, so the simulation
>> does
>> >> not
>> >> >>>>>>> stop.
>> >> >>>>>>> The curious thing is, that it always happens after a successful
>> >> >>>>>>> exchange, and it happens always to the same replicas. What I
>> mean,
>> >> in
>> >> >>>>>>> the rem.log file where all the replicas and the relevant info is
>> >> >>>>>>> listed, the 9th, 17th and 25th replicas heat up. Always this
>> three.
>> >> I
>> >> >>>>>>> tried it with different parameters, for example the timestep was
>> >> >>>>>>> reduced to 1 ps, the iwrap option was turned off, the vlimit was
>> >> >>>>>>> reduced to 10, but nothing helped, the same replicas
>> systematically
>> >> >>>>>>> has gone wild every time.
>> >> >>>>>>>
>> >> >>>>>>> If anyone has any idea, what could be the reason for this
>> >> phenomenon,
>> >> >>>>>>> it would be much appreciated.
>> >> >>>>>>>
>> >> >>>>>>> Thanks in advance
>> >> >>>>>>>
>> >> >>>>>>> Gabor P. Janzso
>> >> >>>>>>> PhD student
>> >> >>>>>>> Institute of Biophysics,
>> >> >>>>>>> Biological Research Center
>> >> >>>>>>> H-6726, Szeged, Temesvári krt. 62.
>> >> >>>>>>>
>> >> >>>>>>> Janzsó Gábor Péter
>> >> >>>>>>> PhD hallgató
>> >> >>>>>>> Szegedi Biológiai Központ,
>> >> >>>>>>> Biofizikai Intézet
>> >> >>>>>>> 6726, Szeged, Temesvári krt. 62.
>> >> >>>>>>>
>> >> >>>>>>> ----------------------------------------------------------------
>> >> >>>>>>> This message was sent using IMP, the Internet Messaging Program.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> _______________________________________________
>> >> >>>>>>> AMBER mailing list
>> >> >>>>>>> AMBER.ambermd.org
>> >> >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>>>>>>
>> >> >>>>>> _______________________________________________
>> >> >>>>>> AMBER mailing list
>> >> >>>>>> AMBER.ambermd.org
>> >> >>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Gabor P. Janzso
>> >> >>>>> PhD student
>> >> >>>>> Institute of Biophysics,
>> >> >>>>> Biological Research Centre
>> >> >>>>> Hungarian Academy of Sciences Szeged
>> >> >>>>> H-6726, Szeged, Temesvári krt. 62.
>> >> >>>>>
>> >> >>>>> Janzsó Gábor Péter
>> >> >>>>> PhD hallgató
>> >> >>>>> Szegedi Biológiai Központ,
>> >> >>>>> Biofizikai Intézet
>> >> >>>>> 6726, Szeged, Temesvári krt. 62.
>> >> >>>>>
>> >> >>>>> ----------------------------------------------------------------
>> >> >>>>> This message was sent using IMP, the Internet Messaging Program.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> _______________________________________________
>> >> >>>>> AMBER mailing list
>> >> >>>>> AMBER.ambermd.org
>> >> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>>>>
>> >> >>>>> _______________________________________________
>> >> >>>> AMBER mailing list
>> >> >>>> AMBER.ambermd.org
>> >> >>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>>>
>> >> >>>>
>> >> >>>
>> >> >>>
>> >> >>> Gabor P. Janzso
>> >> >>> PhD student
>> >> >>> Institute of Biophysics,
>> >> >>> Biological Research Centre
>> >> >>> Hungarian Academy of Sciences Szeged
>> >> >>> H-6726, Szeged, Temesvári krt. 62.
>> >> >>>
>> >> >>> Janzsó Gábor Péter
>> >> >>> PhD hallgató
>> >> >>> Szegedi Biológiai Központ,
>> >> >>> Biofizikai Intézet
>> >> >>> 6726, Szeged, Temesvári krt. 62.
>> >> >>>
>> >> >>> ----------------------------------------------------------------
>> >> >>> This message was sent using IMP, the Internet Messaging Program.
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> AMBER mailing list
>> >> >>> AMBER.ambermd.org
>> >> >>> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>>
>> >> >>>
>> >> >> _______________________________________________
>> >> >> AMBER mailing list
>> >> >> AMBER.ambermd.org
>> >> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >> >>
>> >> >
>> >> > --
>> >> > Dr. Adrian E. Roitberg
>> >> > Associate Professor
>> >> > Quantum Theory Project, Department of Chemistry
>> >> > University of Florida
>> >> >
>> >> > Senior Editor. Journal of Physical Chemistry.
>> >> >
>> >> > on Sabbatical in Barcelona until August 2011.
>> >> > Email roitberg.ufl.edu
>> >> >
>> >> > _______________________________________________
>> >> > AMBER mailing list
>> >> > AMBER.ambermd.org
>> >> > http://lists.ambermd.org/mailman/listinfo/amber
>> >> >
>> >>
>> >>
>> >>
>> >> Gabor P. Janzso
>> >> PhD student
>> >> Institute of Biophysics,
>> >> Biological Research Centre
>> >> Hungarian Academy of Sciences Szeged
>> >> H-6726, Szeged, Temesvári krt. 62.
>> >>
>> >> Janzsó Gábor Péter
>> >> PhD hallgató
>> >> Szegedi Biológiai Központ,
>> >> Biofizikai Intézet
>> >> 6726, Szeged, Temesvári krt. 62.
>> >>
>> >> ----------------------------------------------------------------
>> >> This message was sent using IMP, the Internet Messaging Program.
>> >>
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> >
>>
>>
>>
>> Gabor P. Janzso
>> PhD student
>> Institute of Biophysics,
>> Biological Research Centre
>> Hungarian Academy of Sciences Szeged
>> H-6726, Szeged, Temesvári krt. 62.
>>
>> Janzsó Gábor Péter
>> PhD hallgató
>> Szegedi Biológiai Központ,
>> Biofizikai Intézet
>> 6726, Szeged, Temesvári krt. 62.
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



Gabor P. Janzso
PhD student
Institute of Biophysics,
Biological Research Centre
Hungarian Academy of Sciences Szeged
H-6726, Szeged, Temesvári krt. 62.

Janzsó Gábor Péter
PhD hallgató
Szegedi Biológiai Központ,
Biofizikai Intézet
6726, Szeged, Temesvári krt. 62.

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Dec 09 2010 - 09:30:02 PST
Custom Search