Re: [AMBER] REMD error from koushik kasavajhala on 2019-07-03 (Amber Archive Jul 2019)

From: koushik kasavajhala <koushik.sbiiit.gmail.com>
Date: Wed, 3 Jul 2019 15:55:59 -0400

I do not have any other issues except REMD on GPUs. REMD on CPUs runs
fine.

Sorry, it wasn't clear from the above line. Do other parallel GPU tests
pass?

On Wed, Jul 3, 2019 at 3:45 PM Marcela Madrid <mmadrid.psc.edu> wrote:

> hi, thanks Koushik.
>
> I do not think it is just the test case as the user is having the same
> problem.
> I do not have any other issues except REMD on GPUs. REMD on CPUs runs fine.
> That is why I am fearing that I may be assigning the GPUs wrong?
> I wonder if you have an account at the PSC and would like to try, or we
> can give you one?
>
> I asked for an interactive session with 2 GPUs and 2 tasks per node:
>
> interact -p GPU --gres=gpu:p100:2 -n 2
> export DO_PARALLEL=“mpirun -np 2"
> ./Run.rem.sh
>
> I diff the file that you sent me and they are identical. I commented out
> the last lines in the Run script,
> but no other output is produced, only:
>
> > rem_2rep_gb]$ ./Run.rem.sh
> > No precision model specified. Defaulting to DPFP.
> >
> --------------------------------------------------------------------------------
> > Two replica GB REMD test.
> >
> > Running multipmemd version of pmemd Amber18
> > Total processors = 2
> > Number of groups = 2
> >
> >
> > Unit 5 Error on OPEN: rem.in.001
>
>
>
> >
> > Unit 5 Error on OPEN: rem.in.001
>
>
>
> > Abort(1) on node 1 (rank 1 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> > ./Run.rem.sh: Program error
>
> and the two input files that are not deleted because I commented out the
> line to delete them:
>
> more rem.in.001
> Ala3 GB REMD
> &cntrl
> imin = 0, nstlim = 100, dt = 0.002,
> ntx = 5, irest = 1, ig = -71277,
> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
> ioutfm = 0, ntxo = 1,
> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
> cut = 9999.0, nscm = 500,
> igb = 5, offset = 0.09,
> numexchg = 5,
> &end
>
> and rem.in.000
>
> thanks, Marcela
>
> > On Jul 2, 2019, at 4:23 PM, koushik kasavajhala <
> koushik.sbiiit.gmail.com> wrote:
> >
> > Interesting. I have attached my rem_2rep_gb test directory as a
> reference.
> > I just want to make sure you do not have a corrupted version of AMBER.
> > Check if there are differences between your files and the attached files.
> > After that, can you comment out lines 52-57 and run the test and let us
> > know the output of rem.out.000 file? There is usually more information in
> > that file if a program fails.
> >
> > Also, do you have any issues running other tests besides REMD tests?
> >
> > On Tue, Jul 2, 2019 at 2:59 PM Marcela Madrid <mmadrid.psc.edu> wrote:
> >
> >> Thanks Koushik,
> >>
> >> The user is getting this same error about not finding the input files.
> >> That is why I am doing these tests for her.
> >> I run from the directory $AMBERHOME/test/cuda/remd/rem_2rep_gb
> >> export DO_PARALLEL=“mpirun -np 2"
> >> with 2 GPUs and number of tasks =2
> >> And this is the error that I am getting:
> >>
> >>> ./Run.rem.sh
> >>> No precision model specified. Defaulting to DPFP.
> >>>
> >>
> --------------------------------------------------------------------------------
> >>> Two replica GB REMD test.
> >>>
> >>> Running multipmemd version of pmemd Amber18
> >>> Total processors = 2
> >>> Number of groups = 2
> >>>
> >>>
> >>> Unit 5 Error on OPEN: rem.in.001
> >>
> >>
> >>
> >>>
> >>> Unit 5 Error on OPEN: rem.in.001
> >>
> >>
> >>
> >>> Abort(1) on node 1 (rank 1 in comm 0): application called
> >> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> >>> ./Run.rem.sh: Program error
> >>>
> >> I commented out the line at the end of Run.rem.sh so that it does not
> >> erase the input files and they are there:
> >> more rem.in.001
> >>
> >> Ala3 GB REMD
> >> &cntrl
> >> imin = 0, nstlim = 100, dt = 0.002,
> >> ntx = 5, irest = 1, ig = -71277,
> >> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
> >> ioutfm = 0, ntxo = 1,
> >> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
> >> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
> >> cut = 9999.0, nscm = 500,
> >> igb = 5, offset = 0.09,
> >> numexchg = 5,
> >> &end
> >>
> >> So at least I know that your are running the same way we are, and not
> >> getting error message.
> >> It is quite puzzling.
> >>
> >> Marcela
> >>
> >>
> >>
> >>> On Jul 2, 2019, at 11:56 AM, koushik kasavajhala <
> >> koushik.sbiiit.gmail.com> wrote:
> >>>
> >>> Hi Marcela,
> >>>
> >>> Our lab also uses a slurm queuing system. We use the below script,
> which
> >> is
> >>> similar to your script, to submit 2 replica REMD jobs to one node.
> >>>
> >>> #!/bin/bash
> >>> #SBATCH -N 1
> >>> #SBATCH --tasks-per-node 2
> >>> #SBATCH --gres=gpu:2
> >>>
> >>> mpirun -np 2 /opt/amber/bin/pmemd.cuda.MPI -O -ng 2 -groupfile
> >> groupremd
> >>>
> >>> So, I do not see anything wrong with your submission script. Since your
> >> CPU
> >>> jobs run fine, I think there might be some issue with the way the GPUs
> >> are
> >>> configured on your cluster. Note: CPU REMD jobs require 2 cpus per
> >> replica
> >>> whereas the GPU REMD jobs require only 1 gpu per replica.
> >>>
> >>> I just ran the test cases with 2 and 4 replicas, they all pass for me.
> If
> >>> you are having issues with the test cases, I think something might be
> >> wrong
> >>> with the way files are being sourced. I don't think it is a compiler
> >> issue
> >>> either. We use gnu compilers on our cluster and all tests pass for us.
> >>>
> >>> Can you run the test cases inside the directory that David Case pointed
> >>> out? There is a Run.rem.sh file inside
> >> AMBERHOME/test/cuda/remd/rem_2rep_gb
> >>> directory. Executing this file should not give the error message that
> >> input
> >>> files were not found. If this doesn't work, then can you post the error
> >> the
> >>> user had? They might have had a different error instead of input files
> >> not
> >>> being found.
> >>>
> >>> .David Case: I looked at the files in test/cuda/remd folder.
> rem_gb_2rep,
> >>> rem_gb_4rep, rem_wat are not used at all. Deleting those folders did
> not
> >>> affect any of the test cases; they all passed.
> >>>
> >>> Best,
> >>> Koushik
> >>> Carlos Simmerling Lab
> >>>
> >>>
> >>>
> >>> On Tue, Jul 2, 2019 at 9:57 AM Marcela Madrid <mmadrid.psc.edu> wrote:
> >>>
> >>>> hi Dave,
> >>>>
> >>>> thanks for your answer. It is not just a problem with the test
> examples.
> >>>> It is a problem whenever we try to run REMD on the GPUs on Bridges at
> >> the
> >>>> PSC.
> >>>> The reason why I am looking at it is a user wants to run it. REMD on
> the
> >>>> CPUs works fine (with the corresponding executable of course),
> >>>> it is just a problem with the GPUs. So it occurred to me to see if it
> >>>> passes the tests and we have the same error
> >>>> messages. The user has her input files in the directory where she
> runs.
> >>>>
> >>>> I think it is either a problem with the configuration of the GPU nodes
> >> on
> >>>> Bridges or a bug.
> >>>> Each Bridges node has 24 cores and 2 P100 GPUs. I have asked for 1
> node,
> >>>> ntasks-per-node=2 and the 2 GPUs
> >>>> but I get the error message about not finding the input files.
> >>>> Amber on GPUs was compiled with
> >>>> ./configure -cuda -mpi gnu
> >>>> Attempting to compile with intel compilers instead of gnu gave error
> >>>> messages.
> >>>>
> >>>> O3 -ccbin icpc -o cuda_mg_wrapper.o -c cuda_mg_wrapper.cu
> >>>> In file included from
> >>>>
> >>
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/host_config.h(50),
> >>>> from
> >>>>
> >>
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/cuda_runtime.h(78),
> >>>> from cuda_mg_wrapper.cu(0):
> >>>>
> >>
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/crt/host_config.h(79):
> >>>> error: #error directive: -- unsupported ICC configuration! Only
> >>>> ICC 15.0, ICC 16.0, and ICC 17.0 on Linux x86_64 are supported!
> >>>> #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0, and
> >>>> ICC 17.0 on Linux x86_64 are supported!
> >>>>
> >>>> We do not have such old versions of the compilers. Any hints will be
> >>>> appreciated
> >>>> as to how to run REMD on the GPUS. Thanks so much,
> >>>>
> >>>> Marcela
> >>>>
> >>>>
> >>>>> On Jul 2, 2019, at 9:01 AM, David A Case <david.case.rutgers.edu>
> >> wrote:
> >>>>>
> >>>>> On Mon, Jul 01, 2019, Marcela Madrid wrote:
> >>>>>
> >>>>>>> Two replica GB REMD test.
> >>>>>>>
> >>>>>>>
> >>>>>>> Unit 5 Error on OPEN: rem.in.001
> >>>>>
> >>>>> OK: query for the REMD experts: in AMBERHOME/test/cuda/remd there are
> >>>>> two directories: rem_2rep_gb and rem_gb_2rep. The rem.in.00? files
> are
> >>>>> in the former, but the tests actually get run in the latter
> directory.
> >>>>>
> >>>>> Same general problem for rem_2rep_pme: the needed rem.in.00? files
> are
> >>>>> in rem_wat_2 (or maybe in rem_wat).
> >>>>>
> >>>>> I'm probably missing something here, but cleaning up (or at least
> >>>>> commenting) the cuda/remd test folder seems worthwhile: there are
> >>>>> folders that seem never to be used, and input files that seem to be
> in
> >>>>> the wrong place.
> >>>>>
> >>>>> Marcela: I'd ignore these failures for now; something should get
> posted
> >>>>> here that either fixes the problem, or figures out a problem with
> your
> >>>>> inputs. (My money is on the former.)
> >>>>>
> >>>>> ...dac
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> AMBER mailing list
> >>>>> AMBER.ambermd.org
> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>> _______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > <rem_2rep_gb.tar.gz>_______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 03 2019 - 13:00:03 PDT