Re: AMBER: problems with Replica Exchange from rebeca.mmb.pcb.ub.es on 2008-02-28 (Amber Archive Feb 2008)

From: <rebeca.mmb.pcb.ub.es>
Date: Thu, 28 Feb 2008 16:16:27 +0100

Thanks for your reply. I am using only 2 replicas, and only one processor for
each replica, since it is easier to optimize the method first with only
2 ones.
When it works, I will add more, of course.
I have tried your suggestion, the multisander job works fine with the same
restart and topology but DIFFERENT inputs (those for a standar molecular
dynamics). So do you think it could be a problem with the inputs? I am using
those that work for the tests, these ones:

rem.in.001:

Title Line
&cntrl
        imin = 0, nstlim = 100, dt = 0.002,
        ntx = 5, tempi = 0.0, temp0 = 325.0,
        ntt = 3, tol = 0.000001, gamma_ln = 1.0,
        ntc = 2, ntf = 1, ntb = 0,
        ntwx = 500, ntwe = 0, ntwr =500, ntpr = 100,
        scee = 1.2, cut = 99.0,
        ntr = 0, tautp = 0.1, offset = 0.09,
        nscm = 500, igb = 5, irest=1,
        ntave = 0, numexchg=5,
&end

rem.in.002

Title Line
&cntrl
        imin = 0, nstlim = 100, dt = 0.002,
        ntx = 5, tempi = 0.0, temp0 = 350.0,
        ntt = 3, tol = 0.000001, gamma_ln = 1.0,
        ntc = 2, ntf = 1, ntb = 0,
        ntwx = 500, ntwe = 0, ntwr =500, ntpr = 100,
        scee = 1.2, cut = 99.0,
        ntr = 0, tautp = 0.1, offset = 0.09,
        nscm = 500, igb = 5, irest=1,
        ntave = 0, numexchg=5,
&end

As groupfile I use:

#
#
-O -rem 1 -remlog rem.log -i ./rem.in.001 -p ./1ftg_wat.top -c
./md_prod_5.r -o
./rem.out.001 -inf reminfo.001 -r ./rem.r.001
-O -rem 1 -remlog rem.log -i ./rem.in.002 -p ./1ftg_wat.top -c
./md_prod_5.r -o
./rem.out.002 -inf reminfo.002 -r ./rem.r.002

And the script for executing the calculation is:

#!/bin/bash
# . class = bsc_ls
# . job_name = test_parallel
# . initialdir = .
# . output = OUTPUT/mpi_%j.out
# . error = OUTPUT/mpi_%j.err
# . total_tasks = 2
# . wall_clock_limit = 00:01:00

export XLFRTEOPTS="namelist=old:xrf_messages=no"

srun /gpfs/apps/AMBER/src/9/exe/sander.MPI -O -ng 2 -groupfile groupfile <
/dev/null

As I told you the restart and topology work well for a multisander job, with
standar molecular dynamics. When I try to execute this inputs for Replica
Exchange calculations, it only generates the EMPTY files rem.out.001 and
rem.out.002 and I get this error in the error file:

[0] MPI Abort by user Aborting program !
[0] Aborting program!
[1] MPI Abort by user Aborting program !
[1] Aborting program!
srun: error: s26c2b12: task[0-1]: Exited with exit code 255

The output file gives:

  Running multisander version of sander amber9
     Total processors = 2
     Number of groups = 2

     Looping over processors:
        WorldRank is the global PE rank
        NodeID is the local PE rank in current group

        Group = 0
        WorldRank = 0
        NodeID = 0

        Group = 1
        WorldRank = 1
        NodeID = 0

Any idea? Something wrong with the inputs?

Rebeca García Fandiño Ph. D.
Parc Cientific de Barcelona
Barcelona Spain
rebeca.mmb.pcb.ub.es

Quoting Carlos Simmerling <carlos.simmerling.gmail.com>:

> the thing to try first is 1 processor per group. this way you
> know that output from shake errors etc will get written to the
> output file, which only the master process for each replica can do.
> this is the same situation in normal MD- if there is a problem with no
> error msg in the output always try to run single processor to test it.
> you should not need anything special in the restart file from sander,
> it can be used directly for remd. it's hard to help more since you haven't
> told us much of anything about how you are doing the calculation.
>
> are you using only 2 replicas?
>
> does the same multisander job work fine if you just turn remd off (but
> otherwise use exactly the same input files)?
>
> On Thu, Feb 28, 2008 at 7:29 AM, <rebeca.mmb.pcb.ub.es> wrote:
>> Hello,
>> I am trying to do Replica Exchange calculations using Amber 9. When
>> I try with
>> the files of the example of the tests, it works, but when I try
>> with my protein
>> I have problems. Using directly the usual restart file from a
>> sander calculation
>> I get problems of the type
>>
>> [1] MPI Abort by user Aborting program !
>> [1] Aborting program!
>> [0] MPI Abort by user Aborting program !
>> [0] Aborting program!
>> srun: error: s30c1b04: task[0-1]: Exited with exit code 255
>>
>> However, when I create the restart file from the trajectory file
>> with ptraj the
>> calculation stops with no errors, but stop writting at the point (in the
>> rem.out files):
>>
>> ...................
>> trajectory generated by ptraj
>> begin time read from input coords = 0.000 ps
>>
>> Number of triangulated 3-point waters found: 0
>> | Atom division among processors:
>> | 0 2573
>> | Running AMBER/MPI version on 1 nodes
>>
>> | MULTISANDER: 2 groups. 1 processors out of 2 total.
>> ....................
>>
>> It creates the correspondent files reminfo and rem.log, but they
>> are all empty.
>> In the error file I only can see "srun: Force Terminated job".
>>
>> Since the same calculation works with the protein that appears in the test
>> examples, maybe could it be a problem of format? Should I do any special
>> treatment to the restart file I use for the calculations?
>>
>> Thank you very much for you help, in advance.
>>
>> Rebeca García Fandiño Ph. D.
>> Parc Cientific de Barcelona
>> Barcelona Spain
>> rebeca.mmb.pcb.ub.es
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber.scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Mar 02 2008 - 06:07:25 PST