Re: [AMBER] REMD error from Marcela Madrid on 2019-07-03 (Amber Archive Jul 2019)

From: Marcela Madrid <mmadrid.psc.edu>
Date: Wed, 3 Jul 2019 16:24:45 -0400

Which ones will those be? There are some other failures on 4 GPUs but I am not sure.
I am attaching the log files for the tests for 2 and 4 GPUs.

Marcela

> On Jul 3, 2019, at 3:55 PM, koushik kasavajhala <koushik.sbiiit.gmail.com> wrote:
>
> I do not have any other issues except REMD on GPUs. REMD on CPUs runs
> fine.
>
> Sorry, it wasn't clear from the above line. Do other parallel GPU tests
> pass?
>
>
>
> On Wed, Jul 3, 2019 at 3:45 PM Marcela Madrid <mmadrid.psc.edu> wrote:
>
>> hi, thanks Koushik.
>>
>> I do not think it is just the test case as the user is having the same
>> problem.
>> I do not have any other issues except REMD on GPUs. REMD on CPUs runs fine.
>> That is why I am fearing that I may be assigning the GPUs wrong?
>> I wonder if you have an account at the PSC and would like to try, or we
>> can give you one?
>>
>> I asked for an interactive session with 2 GPUs and 2 tasks per node:
>>
>> interact -p GPU --gres=gpu:p100:2 -n 2
>> export DO_PARALLEL=“mpirun -np 2"
>> ./Run.rem.sh
>>
>> I diff the file that you sent me and they are identical. I commented out
>> the last lines in the Run script,
>> but no other output is produced, only:
>>
>>> rem_2rep_gb]$ ./Run.rem.sh
>>> No precision model specified. Defaulting to DPFP.
>>>
>> --------------------------------------------------------------------------------
>>> Two replica GB REMD test.
>>>
>>> Running multipmemd version of pmemd Amber18
>>> Total processors = 2
>>> Number of groups = 2
>>>
>>>
>>> Unit 5 Error on OPEN: rem.in.001
>>
>>
>>
>>>
>>> Unit 5 Error on OPEN: rem.in.001
>>
>>
>>
>>> Abort(1) on node 1 (rank 1 in comm 0): application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>> ./Run.rem.sh: Program error
>>
>> and the two input files that are not deleted because I commented out the
>> line to delete them:
>>
>> more rem.in.001
>> Ala3 GB REMD
>> &cntrl
>> imin = 0, nstlim = 100, dt = 0.002,
>> ntx = 5, irest = 1, ig = -71277,
>> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
>> ioutfm = 0, ntxo = 1,
>> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
>> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
>> cut = 9999.0, nscm = 500,
>> igb = 5, offset = 0.09,
>> numexchg = 5,
>> &end
>>
>> and rem.in.000
>>
>> thanks, Marcela
>>
>>> On Jul 2, 2019, at 4:23 PM, koushik kasavajhala <
>> koushik.sbiiit.gmail.com> wrote:
>>>
>>> Interesting. I have attached my rem_2rep_gb test directory as a
>> reference.
>>> I just want to make sure you do not have a corrupted version of AMBER.
>>> Check if there are differences between your files and the attached files.
>>> After that, can you comment out lines 52-57 and run the test and let us
>>> know the output of rem.out.000 file? There is usually more information in
>>> that file if a program fails.
>>>
>>> Also, do you have any issues running other tests besides REMD tests?
>>>
>>> On Tue, Jul 2, 2019 at 2:59 PM Marcela Madrid <mmadrid.psc.edu> wrote:
>>>
>>>> Thanks Koushik,
>>>>
>>>> The user is getting this same error about not finding the input files.
>>>> That is why I am doing these tests for her.
>>>> I run from the directory $AMBERHOME/test/cuda/remd/rem_2rep_gb
>>>> export DO_PARALLEL=“mpirun -np 2"
>>>> with 2 GPUs and number of tasks =2
>>>> And this is the error that I am getting:
>>>>
>>>>> ./Run.rem.sh
>>>>> No precision model specified. Defaulting to DPFP.
>>>>>
>>>>
>> --------------------------------------------------------------------------------
>>>>> Two replica GB REMD test.
>>>>>
>>>>> Running multipmemd version of pmemd Amber18
>>>>> Total processors = 2
>>>>> Number of groups = 2
>>>>>
>>>>>
>>>>> Unit 5 Error on OPEN: rem.in.001
>>>>
>>>>
>>>>
>>>>>
>>>>> Unit 5 Error on OPEN: rem.in.001
>>>>
>>>>
>>>>
>>>>> Abort(1) on node 1 (rank 1 in comm 0): application called
>>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>>>> ./Run.rem.sh: Program error
>>>>>
>>>> I commented out the line at the end of Run.rem.sh so that it does not
>>>> erase the input files and they are there:
>>>> more rem.in.001
>>>>
>>>> Ala3 GB REMD
>>>> &cntrl
>>>> imin = 0, nstlim = 100, dt = 0.002,
>>>> ntx = 5, irest = 1, ig = -71277,
>>>> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
>>>> ioutfm = 0, ntxo = 1,
>>>> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
>>>> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
>>>> cut = 9999.0, nscm = 500,
>>>> igb = 5, offset = 0.09,
>>>> numexchg = 5,
>>>> &end
>>>>
>>>> So at least I know that your are running the same way we are, and not
>>>> getting error message.
>>>> It is quite puzzling.
>>>>
>>>> Marcela
>>>>
>>>>
>>>>
>>>>> On Jul 2, 2019, at 11:56 AM, koushik kasavajhala <
>>>> koushik.sbiiit.gmail.com> wrote:
>>>>>
>>>>> Hi Marcela,
>>>>>
>>>>> Our lab also uses a slurm queuing system. We use the below script,
>> which
>>>> is
>>>>> similar to your script, to submit 2 replica REMD jobs to one node.
>>>>>
>>>>> #!/bin/bash
>>>>> #SBATCH -N 1
>>>>> #SBATCH --tasks-per-node 2
>>>>> #SBATCH --gres=gpu:2
>>>>>
>>>>> mpirun -np 2 /opt/amber/bin/pmemd.cuda.MPI -O -ng 2 -groupfile
>>>> groupremd
>>>>>
>>>>> So, I do not see anything wrong with your submission script. Since your
>>>> CPU
>>>>> jobs run fine, I think there might be some issue with the way the GPUs
>>>> are
>>>>> configured on your cluster. Note: CPU REMD jobs require 2 cpus per
>>>> replica
>>>>> whereas the GPU REMD jobs require only 1 gpu per replica.
>>>>>
>>>>> I just ran the test cases with 2 and 4 replicas, they all pass for me.
>> If
>>>>> you are having issues with the test cases, I think something might be
>>>> wrong
>>>>> with the way files are being sourced. I don't think it is a compiler
>>>> issue
>>>>> either. We use gnu compilers on our cluster and all tests pass for us.
>>>>>
>>>>> Can you run the test cases inside the directory that David Case pointed
>>>>> out? There is a Run.rem.sh file inside
>>>> AMBERHOME/test/cuda/remd/rem_2rep_gb
>>>>> directory. Executing this file should not give the error message that
>>>> input
>>>>> files were not found. If this doesn't work, then can you post the error
>>>> the
>>>>> user had? They might have had a different error instead of input files
>>>> not
>>>>> being found.
>>>>>
>>>>> .David Case: I looked at the files in test/cuda/remd folder.
>> rem_gb_2rep,
>>>>> rem_gb_4rep, rem_wat are not used at all. Deleting those folders did
>> not
>>>>> affect any of the test cases; they all passed.
>>>>>
>>>>> Best,
>>>>> Koushik
>>>>> Carlos Simmerling Lab
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 2, 2019 at 9:57 AM Marcela Madrid <mmadrid.psc.edu> wrote:
>>>>>
>>>>>> hi Dave,
>>>>>>
>>>>>> thanks for your answer. It is not just a problem with the test
>> examples.
>>>>>> It is a problem whenever we try to run REMD on the GPUs on Bridges at
>>>> the
>>>>>> PSC.
>>>>>> The reason why I am looking at it is a user wants to run it. REMD on
>> the
>>>>>> CPUs works fine (with the corresponding executable of course),
>>>>>> it is just a problem with the GPUs. So it occurred to me to see if it
>>>>>> passes the tests and we have the same error
>>>>>> messages. The user has her input files in the directory where she
>> runs.
>>>>>>
>>>>>> I think it is either a problem with the configuration of the GPU nodes
>>>> on
>>>>>> Bridges or a bug.
>>>>>> Each Bridges node has 24 cores and 2 P100 GPUs. I have asked for 1
>> node,
>>>>>> ntasks-per-node=2 and the 2 GPUs
>>>>>> but I get the error message about not finding the input files.
>>>>>> Amber on GPUs was compiled with
>>>>>> ./configure -cuda -mpi gnu
>>>>>> Attempting to compile with intel compilers instead of gnu gave error
>>>>>> messages.
>>>>>>
>>>>>> O3 -ccbin icpc -o cuda_mg_wrapper.o -c cuda_mg_wrapper.cu
>>>>>> In file included from
>>>>>>
>>>>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/host_config.h(50),
>>>>>> from
>>>>>>
>>>>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/cuda_runtime.h(78),
>>>>>> from cuda_mg_wrapper.cu(0):
>>>>>>
>>>>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/crt/host_config.h(79):
>>>>>> error: #error directive: -- unsupported ICC configuration! Only
>>>>>> ICC 15.0, ICC 16.0, and ICC 17.0 on Linux x86_64 are supported!
>>>>>> #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0, and
>>>>>> ICC 17.0 on Linux x86_64 are supported!
>>>>>>
>>>>>> We do not have such old versions of the compilers. Any hints will be
>>>>>> appreciated
>>>>>> as to how to run REMD on the GPUS. Thanks so much,
>>>>>>
>>>>>> Marcela
>>>>>>
>>>>>>
>>>>>>> On Jul 2, 2019, at 9:01 AM, David A Case <david.case.rutgers.edu>
>>>> wrote:
>>>>>>>
>>>>>>> On Mon, Jul 01, 2019, Marcela Madrid wrote:
>>>>>>>
>>>>>>>>> Two replica GB REMD test.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Unit 5 Error on OPEN: rem.in.001
>>>>>>>
>>>>>>> OK: query for the REMD experts: in AMBERHOME/test/cuda/remd there are
>>>>>>> two directories: rem_2rep_gb and rem_gb_2rep. The rem.in.00? files
>> are
>>>>>>> in the former, but the tests actually get run in the latter
>> directory.
>>>>>>>
>>>>>>> Same general problem for rem_2rep_pme: the needed rem.in.00? files
>> are
>>>>>>> in rem_wat_2 (or maybe in rem_wat).
>>>>>>>
>>>>>>> I'm probably missing something here, but cleaning up (or at least
>>>>>>> commenting) the cuda/remd test folder seems worthwhile: there are
>>>>>>> folders that seem never to be used, and input files that seem to be
>> in
>>>>>>> the wrong place.
>>>>>>>
>>>>>>> Marcela: I'd ignore these failures for now; something should get
>> posted
>>>>>>> here that either fixes the problem, or figures out a problem with
>> your
>>>>>>> inputs. (My money is on the former.)
>>>>>>>
>>>>>>> ...dac
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> AMBER mailing list
>>>>>>> AMBER.ambermd.org
>>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>>> _______________________________________________
>>>>>> AMBER mailing list
>>>>>> AMBER.ambermd.org
>>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> <rem_2rep_gb.tar.gz>_______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 03 2019 - 13:30:02 PDT