Ohh!! I see what the issue is. REMD jobs use multiple input files - one
file for each replica. In your case, it is always the second input file
(rem.in.001) that isn’t found. Have you tried it on the K80 nodes on your
cluster?
On Wed, Jul 3, 2019 at 4:53 PM koushik kasavajhala <koushik.sbiiit.gmail.com>
wrote:
> Sorry, I don’t see the attachments. Can you resend? The 2 GPU test log
> file (makecudatestmpi2.log?) should be sufficient.
>
> I just want to make sure it is only a REMD issue and does not affect other
> parallel gpu jobs.
>
> On Wed, Jul 3, 2019 at 4:25 PM Marcela Madrid <mmadrid.psc.edu> wrote:
>
>> Which ones will those be? There are some other failures on 4 GPUs but I
>> am not sure.
>> I am attaching the log files for the tests for 2 and 4 GPUs.
>>
>> Marcela
>>
>>
>> > On Jul 3, 2019, at 3:55 PM, koushik kasavajhala <
>> koushik.sbiiit.gmail.com> wrote:
>> >
>> > I do not have any other issues except REMD on GPUs. REMD on CPUs runs
>> > fine.
>> >
>> > Sorry, it wasn't clear from the above line. Do other parallel GPU tests
>> > pass?
>> >
>> >
>> >
>> > On Wed, Jul 3, 2019 at 3:45 PM Marcela Madrid <mmadrid.psc.edu> wrote:
>> >
>> >> hi, thanks Koushik.
>> >>
>> >> I do not think it is just the test case as the user is having the same
>> >> problem.
>> >> I do not have any other issues except REMD on GPUs. REMD on CPUs runs
>> fine.
>> >> That is why I am fearing that I may be assigning the GPUs wrong?
>> >> I wonder if you have an account at the PSC and would like to try, or we
>> >> can give you one?
>> >>
>> >> I asked for an interactive session with 2 GPUs and 2 tasks per node:
>> >>
>> >> interact -p GPU --gres=gpu:p100:2 -n 2
>> >> export DO_PARALLEL=“mpirun -np 2"
>> >> ./Run.rem.sh
>> >>
>> >> I diff the file that you sent me and they are identical. I commented
>> out
>> >> the last lines in the Run script,
>> >> but no other output is produced, only:
>> >>
>> >>> rem_2rep_gb]$ ./Run.rem.sh
>> >>> No precision model specified. Defaulting to DPFP.
>> >>>
>> >>
>> --------------------------------------------------------------------------------
>> >>> Two replica GB REMD test.
>> >>>
>> >>> Running multipmemd version of pmemd Amber18
>> >>> Total processors = 2
>> >>> Number of groups = 2
>> >>>
>> >>>
>> >>> Unit 5 Error on OPEN: rem.in.001
>> >>
>> >>
>> >>
>> >>>
>> >>> Unit 5 Error on OPEN: rem.in.001
>> >>
>> >>
>> >>
>> >>> Abort(1) on node 1 (rank 1 in comm 0): application called
>> >> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>> >>> ./Run.rem.sh: Program error
>> >>
>> >> and the two input files that are not deleted because I commented out
>> the
>> >> line to delete them:
>> >>
>> >> more rem.in.001
>> >> Ala3 GB REMD
>> >> &cntrl
>> >> imin = 0, nstlim = 100, dt = 0.002,
>> >> ntx = 5, irest = 1, ig = -71277,
>> >> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
>> >> ioutfm = 0, ntxo = 1,
>> >> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
>> >> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
>> >> cut = 9999.0, nscm = 500,
>> >> igb = 5, offset = 0.09,
>> >> numexchg = 5,
>> >> &end
>> >>
>> >> and rem.in.000
>> >>
>> >> thanks, Marcela
>> >>
>> >>> On Jul 2, 2019, at 4:23 PM, koushik kasavajhala <
>> >> koushik.sbiiit.gmail.com> wrote:
>> >>>
>> >>> Interesting. I have attached my rem_2rep_gb test directory as a
>> >> reference.
>> >>> I just want to make sure you do not have a corrupted version of AMBER.
>> >>> Check if there are differences between your files and the attached
>> files.
>> >>> After that, can you comment out lines 52-57 and run the test and let
>> us
>> >>> know the output of rem.out.000 file? There is usually more
>> information in
>> >>> that file if a program fails.
>> >>>
>> >>> Also, do you have any issues running other tests besides REMD tests?
>> >>>
>> >>> On Tue, Jul 2, 2019 at 2:59 PM Marcela Madrid <mmadrid.psc.edu>
>> wrote:
>> >>>
>> >>>> Thanks Koushik,
>> >>>>
>> >>>> The user is getting this same error about not finding the input
>> files.
>> >>>> That is why I am doing these tests for her.
>> >>>> I run from the directory $AMBERHOME/test/cuda/remd/rem_2rep_gb
>> >>>> export DO_PARALLEL=“mpirun -np 2"
>> >>>> with 2 GPUs and number of tasks =2
>> >>>> And this is the error that I am getting:
>> >>>>
>> >>>>> ./Run.rem.sh
>> >>>>> No precision model specified. Defaulting to DPFP.
>> >>>>>
>> >>>>
>> >>
>> --------------------------------------------------------------------------------
>> >>>>> Two replica GB REMD test.
>> >>>>>
>> >>>>> Running multipmemd version of pmemd Amber18
>> >>>>> Total processors = 2
>> >>>>> Number of groups = 2
>> >>>>>
>> >>>>>
>> >>>>> Unit 5 Error on OPEN: rem.in.001
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Unit 5 Error on OPEN: rem.in.001
>> >>>>
>> >>>>
>> >>>>
>> >>>>> Abort(1) on node 1 (rank 1 in comm 0): application called
>> >>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>> >>>>> ./Run.rem.sh: Program error
>> >>>>>
>> >>>> I commented out the line at the end of Run.rem.sh so that it does
>> not
>> >>>> erase the input files and they are there:
>> >>>> more rem.in.001
>> >>>>
>> >>>> Ala3 GB REMD
>> >>>> &cntrl
>> >>>> imin = 0, nstlim = 100, dt = 0.002,
>> >>>> ntx = 5, irest = 1, ig = -71277,
>> >>>> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
>> >>>> ioutfm = 0, ntxo = 1,
>> >>>> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
>> >>>> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
>> >>>> cut = 9999.0, nscm = 500,
>> >>>> igb = 5, offset = 0.09,
>> >>>> numexchg = 5,
>> >>>> &end
>> >>>>
>> >>>> So at least I know that your are running the same way we are, and not
>> >>>> getting error message.
>> >>>> It is quite puzzling.
>> >>>>
>> >>>> Marcela
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On Jul 2, 2019, at 11:56 AM, koushik kasavajhala <
>> >>>> koushik.sbiiit.gmail.com> wrote:
>> >>>>>
>> >>>>> Hi Marcela,
>> >>>>>
>> >>>>> Our lab also uses a slurm queuing system. We use the below script,
>> >> which
>> >>>> is
>> >>>>> similar to your script, to submit 2 replica REMD jobs to one node.
>> >>>>>
>> >>>>> #!/bin/bash
>> >>>>> #SBATCH -N 1
>> >>>>> #SBATCH --tasks-per-node 2
>> >>>>> #SBATCH --gres=gpu:2
>> >>>>>
>> >>>>> mpirun -np 2 /opt/amber/bin/pmemd.cuda.MPI -O -ng 2 -groupfile
>> >>>> groupremd
>> >>>>>
>> >>>>> So, I do not see anything wrong with your submission script. Since
>> your
>> >>>> CPU
>> >>>>> jobs run fine, I think there might be some issue with the way the
>> GPUs
>> >>>> are
>> >>>>> configured on your cluster. Note: CPU REMD jobs require 2 cpus per
>> >>>> replica
>> >>>>> whereas the GPU REMD jobs require only 1 gpu per replica.
>> >>>>>
>> >>>>> I just ran the test cases with 2 and 4 replicas, they all pass for
>> me.
>> >> If
>> >>>>> you are having issues with the test cases, I think something might
>> be
>> >>>> wrong
>> >>>>> with the way files are being sourced. I don't think it is a compiler
>> >>>> issue
>> >>>>> either. We use gnu compilers on our cluster and all tests pass for
>> us.
>> >>>>>
>> >>>>> Can you run the test cases inside the directory that David Case
>> pointed
>> >>>>> out? There is a Run.rem.sh file inside
>> >>>> AMBERHOME/test/cuda/remd/rem_2rep_gb
>> >>>>> directory. Executing this file should not give the error message
>> that
>> >>>> input
>> >>>>> files were not found. If this doesn't work, then can you post the
>> error
>> >>>> the
>> >>>>> user had? They might have had a different error instead of input
>> files
>> >>>> not
>> >>>>> being found.
>> >>>>>
>> >>>>> .David Case: I looked at the files in test/cuda/remd folder.
>> >> rem_gb_2rep,
>> >>>>> rem_gb_4rep, rem_wat are not used at all. Deleting those folders did
>> >> not
>> >>>>> affect any of the test cases; they all passed.
>> >>>>>
>> >>>>> Best,
>> >>>>> Koushik
>> >>>>> Carlos Simmerling Lab
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Jul 2, 2019 at 9:57 AM Marcela Madrid <mmadrid.psc.edu>
>> wrote:
>> >>>>>
>> >>>>>> hi Dave,
>> >>>>>>
>> >>>>>> thanks for your answer. It is not just a problem with the test
>> >> examples.
>> >>>>>> It is a problem whenever we try to run REMD on the GPUs on Bridges
>> at
>> >>>> the
>> >>>>>> PSC.
>> >>>>>> The reason why I am looking at it is a user wants to run it. REMD
>> on
>> >> the
>> >>>>>> CPUs works fine (with the corresponding executable of course),
>> >>>>>> it is just a problem with the GPUs. So it occurred to me to see
>> if it
>> >>>>>> passes the tests and we have the same error
>> >>>>>> messages. The user has her input files in the directory where she
>> >> runs.
>> >>>>>>
>> >>>>>> I think it is either a problem with the configuration of the GPU
>> nodes
>> >>>> on
>> >>>>>> Bridges or a bug.
>> >>>>>> Each Bridges node has 24 cores and 2 P100 GPUs. I have asked for 1
>> >> node,
>> >>>>>> ntasks-per-node=2 and the 2 GPUs
>> >>>>>> but I get the error message about not finding the input files.
>> >>>>>> Amber on GPUs was compiled with
>> >>>>>> ./configure -cuda -mpi gnu
>> >>>>>> Attempting to compile with intel compilers instead of gnu gave
>> error
>> >>>>>> messages.
>> >>>>>>
>> >>>>>> O3 -ccbin icpc -o cuda_mg_wrapper.o -c cuda_mg_wrapper.cu
>> >>>>>> In file included from
>> >>>>>>
>> >>>>
>> >>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/host_config.h(50),
>> >>>>>> from
>> >>>>>>
>> >>>>
>> >>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/cuda_runtime.h(78),
>> >>>>>> from cuda_mg_wrapper.cu(0):
>> >>>>>>
>> >>>>
>> >>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/crt/host_config.h(79):
>> >>>>>> error: #error directive: -- unsupported ICC configuration! Only
>> >>>>>> ICC 15.0, ICC 16.0, and ICC 17.0 on Linux x86_64 are supported!
>> >>>>>> #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0,
>> and
>> >>>>>> ICC 17.0 on Linux x86_64 are supported!
>> >>>>>>
>> >>>>>> We do not have such old versions of the compilers. Any hints will
>> be
>> >>>>>> appreciated
>> >>>>>> as to how to run REMD on the GPUS. Thanks so much,
>> >>>>>>
>> >>>>>> Marcela
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Jul 2, 2019, at 9:01 AM, David A Case <david.case.rutgers.edu>
>> >>>> wrote:
>> >>>>>>>
>> >>>>>>> On Mon, Jul 01, 2019, Marcela Madrid wrote:
>> >>>>>>>
>> >>>>>>>>> Two replica GB REMD test.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Unit 5 Error on OPEN: rem.in.001
>> >>>>>>>
>> >>>>>>> OK: query for the REMD experts: in AMBERHOME/test/cuda/remd there
>> are
>> >>>>>>> two directories: rem_2rep_gb and rem_gb_2rep. The rem.in.00?
>> files
>> >> are
>> >>>>>>> in the former, but the tests actually get run in the latter
>> >> directory.
>> >>>>>>>
>> >>>>>>> Same general problem for rem_2rep_pme: the needed rem.in.00? files
>> >> are
>> >>>>>>> in rem_wat_2 (or maybe in rem_wat).
>> >>>>>>>
>> >>>>>>> I'm probably missing something here, but cleaning up (or at least
>> >>>>>>> commenting) the cuda/remd test folder seems worthwhile: there are
>> >>>>>>> folders that seem never to be used, and input files that seem to
>> be
>> >> in
>> >>>>>>> the wrong place.
>> >>>>>>>
>> >>>>>>> Marcela: I'd ignore these failures for now; something should get
>> >> posted
>> >>>>>>> here that either fixes the problem, or figures out a problem with
>> >> your
>> >>>>>>> inputs. (My money is on the former.)
>> >>>>>>>
>> >>>>>>> ...dac
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> AMBER mailing list
>> >>>>>>> AMBER.ambermd.org
>> >>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> AMBER mailing list
>> >>>>>> AMBER.ambermd.org
>> >>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>>>>
>> >>>>> _______________________________________________
>> >>>>> AMBER mailing list
>> >>>>> AMBER.ambermd.org
>> >>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>>
>> >>>> _______________________________________________
>> >>>> AMBER mailing list
>> >>>> AMBER.ambermd.org
>> >>>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>>>
>> >>> <rem_2rep_gb.tar.gz>_______________________________________________
>> >>> AMBER mailing list
>> >>> AMBER.ambermd.org
>> >>> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER.ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 03 2019 - 14:30:02 PDT