Re: [AMBER] REMD error from koushik kasavajhala on 2019-07-03 (Amber Archive Jul 2019)

From: koushik kasavajhala <koushik.sbiiit.gmail.com>
Date: Wed, 3 Jul 2019 16:53:44 -0400

Sorry, I don’t see the attachments. Can you resend? The 2 GPU test log file
(makecudatestmpi2.log?) should be sufficient.

I just want to make sure it is only a REMD issue and does not affect other
parallel gpu jobs.

On Wed, Jul 3, 2019 at 4:25 PM Marcela Madrid <mmadrid.psc.edu> wrote:

> Which ones will those be? There are some other failures on 4 GPUs but I am
> not sure.
> I am attaching the log files for the tests for 2 and 4 GPUs.
>
> Marcela
>
>
> > On Jul 3, 2019, at 3:55 PM, koushik kasavajhala <
> koushik.sbiiit.gmail.com> wrote:
> >
> > I do not have any other issues except REMD on GPUs. REMD on CPUs runs
> > fine.
> >
> > Sorry, it wasn't clear from the above line. Do other parallel GPU tests
> > pass?
> >
> >
> >
> > On Wed, Jul 3, 2019 at 3:45 PM Marcela Madrid <mmadrid.psc.edu> wrote:
> >
> >> hi, thanks Koushik.
> >>
> >> I do not think it is just the test case as the user is having the same
> >> problem.
> >> I do not have any other issues except REMD on GPUs. REMD on CPUs runs
> fine.
> >> That is why I am fearing that I may be assigning the GPUs wrong?
> >> I wonder if you have an account at the PSC and would like to try, or we
> >> can give you one?
> >>
> >> I asked for an interactive session with 2 GPUs and 2 tasks per node:
> >>
> >> interact -p GPU --gres=gpu:p100:2 -n 2
> >> export DO_PARALLEL=“mpirun -np 2"
> >> ./Run.rem.sh
> >>
> >> I diff the file that you sent me and they are identical. I commented out
> >> the last lines in the Run script,
> >> but no other output is produced, only:
> >>
> >>> rem_2rep_gb]$ ./Run.rem.sh
> >>> No precision model specified. Defaulting to DPFP.
> >>>
> >>
> --------------------------------------------------------------------------------
> >>> Two replica GB REMD test.
> >>>
> >>> Running multipmemd version of pmemd Amber18
> >>> Total processors = 2
> >>> Number of groups = 2
> >>>
> >>>
> >>> Unit 5 Error on OPEN: rem.in.001
> >>
> >>
> >>
> >>>
> >>> Unit 5 Error on OPEN: rem.in.001
> >>
> >>
> >>
> >>> Abort(1) on node 1 (rank 1 in comm 0): application called
> >> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> >>> ./Run.rem.sh: Program error
> >>
> >> and the two input files that are not deleted because I commented out the
> >> line to delete them:
> >>
> >> more rem.in.001
> >> Ala3 GB REMD
> >> &cntrl
> >> imin = 0, nstlim = 100, dt = 0.002,
> >> ntx = 5, irest = 1, ig = -71277,
> >> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
> >> ioutfm = 0, ntxo = 1,
> >> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
> >> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
> >> cut = 9999.0, nscm = 500,
> >> igb = 5, offset = 0.09,
> >> numexchg = 5,
> >> &end
> >>
> >> and rem.in.000
> >>
> >> thanks, Marcela
> >>
> >>> On Jul 2, 2019, at 4:23 PM, koushik kasavajhala <
> >> koushik.sbiiit.gmail.com> wrote:
> >>>
> >>> Interesting. I have attached my rem_2rep_gb test directory as a
> >> reference.
> >>> I just want to make sure you do not have a corrupted version of AMBER.
> >>> Check if there are differences between your files and the attached
> files.
> >>> After that, can you comment out lines 52-57 and run the test and let us
> >>> know the output of rem.out.000 file? There is usually more information
> in
> >>> that file if a program fails.
> >>>
> >>> Also, do you have any issues running other tests besides REMD tests?
> >>>
> >>> On Tue, Jul 2, 2019 at 2:59 PM Marcela Madrid <mmadrid.psc.edu> wrote:
> >>>
> >>>> Thanks Koushik,
> >>>>
> >>>> The user is getting this same error about not finding the input
> files.
> >>>> That is why I am doing these tests for her.
> >>>> I run from the directory $AMBERHOME/test/cuda/remd/rem_2rep_gb
> >>>> export DO_PARALLEL=“mpirun -np 2"
> >>>> with 2 GPUs and number of tasks =2
> >>>> And this is the error that I am getting:
> >>>>
> >>>>> ./Run.rem.sh
> >>>>> No precision model specified. Defaulting to DPFP.
> >>>>>
> >>>>
> >>
> --------------------------------------------------------------------------------
> >>>>> Two replica GB REMD test.
> >>>>>
> >>>>> Running multipmemd version of pmemd Amber18
> >>>>> Total processors = 2
> >>>>> Number of groups = 2
> >>>>>
> >>>>>
> >>>>> Unit 5 Error on OPEN: rem.in.001
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> Unit 5 Error on OPEN: rem.in.001
> >>>>
> >>>>
> >>>>
> >>>>> Abort(1) on node 1 (rank 1 in comm 0): application called
> >>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> >>>>> ./Run.rem.sh: Program error
> >>>>>
> >>>> I commented out the line at the end of Run.rem.sh so that it does not
> >>>> erase the input files and they are there:
> >>>> more rem.in.001
> >>>>
> >>>> Ala3 GB REMD
> >>>> &cntrl
> >>>> imin = 0, nstlim = 100, dt = 0.002,
> >>>> ntx = 5, irest = 1, ig = -71277,
> >>>> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
> >>>> ioutfm = 0, ntxo = 1,
> >>>> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
> >>>> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
> >>>> cut = 9999.0, nscm = 500,
> >>>> igb = 5, offset = 0.09,
> >>>> numexchg = 5,
> >>>> &end
> >>>>
> >>>> So at least I know that your are running the same way we are, and not
> >>>> getting error message.
> >>>> It is quite puzzling.
> >>>>
> >>>> Marcela
> >>>>
> >>>>
> >>>>
> >>>>> On Jul 2, 2019, at 11:56 AM, koushik kasavajhala <
> >>>> koushik.sbiiit.gmail.com> wrote:
> >>>>>
> >>>>> Hi Marcela,
> >>>>>
> >>>>> Our lab also uses a slurm queuing system. We use the below script,
> >> which
> >>>> is
> >>>>> similar to your script, to submit 2 replica REMD jobs to one node.
> >>>>>
> >>>>> #!/bin/bash
> >>>>> #SBATCH -N 1
> >>>>> #SBATCH --tasks-per-node 2
> >>>>> #SBATCH --gres=gpu:2
> >>>>>
> >>>>> mpirun -np 2 /opt/amber/bin/pmemd.cuda.MPI -O -ng 2 -groupfile
> >>>> groupremd
> >>>>>
> >>>>> So, I do not see anything wrong with your submission script. Since
> your
> >>>> CPU
> >>>>> jobs run fine, I think there might be some issue with the way the
> GPUs
> >>>> are
> >>>>> configured on your cluster. Note: CPU REMD jobs require 2 cpus per
> >>>> replica
> >>>>> whereas the GPU REMD jobs require only 1 gpu per replica.
> >>>>>
> >>>>> I just ran the test cases with 2 and 4 replicas, they all pass for
> me.
> >> If
> >>>>> you are having issues with the test cases, I think something might be
> >>>> wrong
> >>>>> with the way files are being sourced. I don't think it is a compiler
> >>>> issue
> >>>>> either. We use gnu compilers on our cluster and all tests pass for
> us.
> >>>>>
> >>>>> Can you run the test cases inside the directory that David Case
> pointed
> >>>>> out? There is a Run.rem.sh file inside
> >>>> AMBERHOME/test/cuda/remd/rem_2rep_gb
> >>>>> directory. Executing this file should not give the error message that
> >>>> input
> >>>>> files were not found. If this doesn't work, then can you post the
> error
> >>>> the
> >>>>> user had? They might have had a different error instead of input
> files
> >>>> not
> >>>>> being found.
> >>>>>
> >>>>> .David Case: I looked at the files in test/cuda/remd folder.
> >> rem_gb_2rep,
> >>>>> rem_gb_4rep, rem_wat are not used at all. Deleting those folders did
> >> not
> >>>>> affect any of the test cases; they all passed.
> >>>>>
> >>>>> Best,
> >>>>> Koushik
> >>>>> Carlos Simmerling Lab
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 2, 2019 at 9:57 AM Marcela Madrid <mmadrid.psc.edu>
> wrote:
> >>>>>
> >>>>>> hi Dave,
> >>>>>>
> >>>>>> thanks for your answer. It is not just a problem with the test
> >> examples.
> >>>>>> It is a problem whenever we try to run REMD on the GPUs on Bridges
> at
> >>>> the
> >>>>>> PSC.
> >>>>>> The reason why I am looking at it is a user wants to run it. REMD on
> >> the
> >>>>>> CPUs works fine (with the corresponding executable of course),
> >>>>>> it is just a problem with the GPUs. So it occurred to me to see if
> it
> >>>>>> passes the tests and we have the same error
> >>>>>> messages. The user has her input files in the directory where she
> >> runs.
> >>>>>>
> >>>>>> I think it is either a problem with the configuration of the GPU
> nodes
> >>>> on
> >>>>>> Bridges or a bug.
> >>>>>> Each Bridges node has 24 cores and 2 P100 GPUs. I have asked for 1
> >> node,
> >>>>>> ntasks-per-node=2 and the 2 GPUs
> >>>>>> but I get the error message about not finding the input files.
> >>>>>> Amber on GPUs was compiled with
> >>>>>> ./configure -cuda -mpi gnu
> >>>>>> Attempting to compile with intel compilers instead of gnu gave error
> >>>>>> messages.
> >>>>>>
> >>>>>> O3 -ccbin icpc -o cuda_mg_wrapper.o -c cuda_mg_wrapper.cu
> >>>>>> In file included from
> >>>>>>
> >>>>
> >>
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/host_config.h(50),
> >>>>>> from
> >>>>>>
> >>>>
> >>
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/cuda_runtime.h(78),
> >>>>>> from cuda_mg_wrapper.cu(0):
> >>>>>>
> >>>>
> >>
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/crt/host_config.h(79):
> >>>>>> error: #error directive: -- unsupported ICC configuration! Only
> >>>>>> ICC 15.0, ICC 16.0, and ICC 17.0 on Linux x86_64 are supported!
> >>>>>> #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0,
> and
> >>>>>> ICC 17.0 on Linux x86_64 are supported!
> >>>>>>
> >>>>>> We do not have such old versions of the compilers. Any hints will be
> >>>>>> appreciated
> >>>>>> as to how to run REMD on the GPUS. Thanks so much,
> >>>>>>
> >>>>>> Marcela
> >>>>>>
> >>>>>>
> >>>>>>> On Jul 2, 2019, at 9:01 AM, David A Case <david.case.rutgers.edu>
> >>>> wrote:
> >>>>>>>
> >>>>>>> On Mon, Jul 01, 2019, Marcela Madrid wrote:
> >>>>>>>
> >>>>>>>>> Two replica GB REMD test.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Unit 5 Error on OPEN: rem.in.001
> >>>>>>>
> >>>>>>> OK: query for the REMD experts: in AMBERHOME/test/cuda/remd there
> are
> >>>>>>> two directories: rem_2rep_gb and rem_gb_2rep. The rem.in.00? files
> >> are
> >>>>>>> in the former, but the tests actually get run in the latter
> >> directory.
> >>>>>>>
> >>>>>>> Same general problem for rem_2rep_pme: the needed rem.in.00? files
> >> are
> >>>>>>> in rem_wat_2 (or maybe in rem_wat).
> >>>>>>>
> >>>>>>> I'm probably missing something here, but cleaning up (or at least
> >>>>>>> commenting) the cuda/remd test folder seems worthwhile: there are
> >>>>>>> folders that seem never to be used, and input files that seem to be
> >> in
> >>>>>>> the wrong place.
> >>>>>>>
> >>>>>>> Marcela: I'd ignore these failures for now; something should get
> >> posted
> >>>>>>> here that either fixes the problem, or figures out a problem with
> >> your
> >>>>>>> inputs. (My money is on the former.)
> >>>>>>>
> >>>>>>> ...dac
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> AMBER mailing list
> >>>>>>> AMBER.ambermd.org
> >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> AMBER mailing list
> >>>>>> AMBER.ambermd.org
> >>>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>>>
> >>>>> _______________________________________________
> >>>>> AMBER mailing list
> >>>>> AMBER.ambermd.org
> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>>> _______________________________________________
> >>>> AMBER mailing list
> >>>> AMBER.ambermd.org
> >>>> http://lists.ambermd.org/mailman/listinfo/amber
> >>>>
> >>> <rem_2rep_gb.tar.gz>_______________________________________________
> >>> AMBER mailing list
> >>> AMBER.ambermd.org
> >>> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 03 2019 - 14:00:02 PDT