Re: [AMBER] REMD error from Marcela Madrid on 2019-07-03 (Amber Archive Jul 2019)

From: Marcela Madrid <mmadrid.psc.edu>
Date: Wed, 3 Jul 2019 15:44:47 -0400

hi, thanks Koushik.

I do not think it is just the test case as the user is having the same problem.
I do not have any other issues except REMD on GPUs. REMD on CPUs runs fine.
That is why I am fearing that I may be assigning the GPUs wrong?
I wonder if you have an account at the PSC and would like to try, or we can give you one?

I asked for an interactive session with 2 GPUs and 2 tasks per node:

interact -p GPU --gres=gpu:p100:2 -n 2
export DO_PARALLEL=“mpirun -np 2"
./Run.rem.sh

I diff the file that you sent me and they are identical. I commented out the last lines in the Run script,
but no other output is produced, only:

> rem_2rep_gb]$ ./Run.rem.sh
> No precision model specified. Defaulting to DPFP.
> --------------------------------------------------------------------------------
> Two replica GB REMD test.
>
> Running multipmemd version of pmemd Amber18
> Total processors = 2
> Number of groups = 2
>
>
> Unit 5 Error on OPEN: rem.in.001
>
> Unit 5 Error on OPEN: rem.in.001
> Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> ./Run.rem.sh: Program error

and the two input files that are not deleted because I commented out the line to delete them:

more rem.in.001
Ala3 GB REMD
&cntrl
   imin = 0, nstlim = 100, dt = 0.002,
   ntx = 5, irest = 1, ig = -71277,
   ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
   ioutfm = 0, ntxo = 1,
   ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
   ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
   cut = 9999.0, nscm = 500,
   igb = 5, offset = 0.09,
   numexchg = 5,
&end

and rem.in.000

thanks, Marcela

> On Jul 2, 2019, at 4:23 PM, koushik kasavajhala <koushik.sbiiit.gmail.com> wrote:
>
> Interesting. I have attached my rem_2rep_gb test directory as a reference.
> I just want to make sure you do not have a corrupted version of AMBER.
> Check if there are differences between your files and the attached files.
> After that, can you comment out lines 52-57 and run the test and let us
> know the output of rem.out.000 file? There is usually more information in
> that file if a program fails.
>
> Also, do you have any issues running other tests besides REMD tests?
>
> On Tue, Jul 2, 2019 at 2:59 PM Marcela Madrid <mmadrid.psc.edu> wrote:
>
>> Thanks Koushik,
>>
>> The user is getting this same error about not finding the input files.
>> That is why I am doing these tests for her.
>> I run from the directory $AMBERHOME/test/cuda/remd/rem_2rep_gb
>> export DO_PARALLEL=“mpirun -np 2"
>> with 2 GPUs and number of tasks =2
>> And this is the error that I am getting:
>>
>>> ./Run.rem.sh
>>> No precision model specified. Defaulting to DPFP.
>>>
>> --------------------------------------------------------------------------------
>>> Two replica GB REMD test.
>>>
>>> Running multipmemd version of pmemd Amber18
>>> Total processors = 2
>>> Number of groups = 2
>>>
>>>
>>> Unit 5 Error on OPEN: rem.in.001
>>
>>
>>
>>>
>>> Unit 5 Error on OPEN: rem.in.001
>>
>>
>>
>>> Abort(1) on node 1 (rank 1 in comm 0): application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>> ./Run.rem.sh: Program error
>>>
>> I commented out the line at the end of Run.rem.sh so that it does not
>> erase the input files and they are there:
>> more rem.in.001
>>
>> Ala3 GB REMD
>> &cntrl
>> imin = 0, nstlim = 100, dt = 0.002,
>> ntx = 5, irest = 1, ig = -71277,
>> ntwx = 500, ntwe = 0, ntwr = 500, ntpr = 100,
>> ioutfm = 0, ntxo = 1,
>> ntt = 1, tautp = 5.0, tempi = 0.0, temp0 = 350.0,
>> ntc = 2, tol = 0.000001, ntf = 2, ntb = 0,
>> cut = 9999.0, nscm = 500,
>> igb = 5, offset = 0.09,
>> numexchg = 5,
>> &end
>>
>> So at least I know that your are running the same way we are, and not
>> getting error message.
>> It is quite puzzling.
>>
>> Marcela
>>
>>
>>
>>> On Jul 2, 2019, at 11:56 AM, koushik kasavajhala <
>> koushik.sbiiit.gmail.com> wrote:
>>>
>>> Hi Marcela,
>>>
>>> Our lab also uses a slurm queuing system. We use the below script, which
>> is
>>> similar to your script, to submit 2 replica REMD jobs to one node.
>>>
>>> #!/bin/bash
>>> #SBATCH -N 1
>>> #SBATCH --tasks-per-node 2
>>> #SBATCH --gres=gpu:2
>>>
>>> mpirun -np 2 /opt/amber/bin/pmemd.cuda.MPI -O -ng 2 -groupfile
>> groupremd
>>>
>>> So, I do not see anything wrong with your submission script. Since your
>> CPU
>>> jobs run fine, I think there might be some issue with the way the GPUs
>> are
>>> configured on your cluster. Note: CPU REMD jobs require 2 cpus per
>> replica
>>> whereas the GPU REMD jobs require only 1 gpu per replica.
>>>
>>> I just ran the test cases with 2 and 4 replicas, they all pass for me. If
>>> you are having issues with the test cases, I think something might be
>> wrong
>>> with the way files are being sourced. I don't think it is a compiler
>> issue
>>> either. We use gnu compilers on our cluster and all tests pass for us.
>>>
>>> Can you run the test cases inside the directory that David Case pointed
>>> out? There is a Run.rem.sh file inside
>> AMBERHOME/test/cuda/remd/rem_2rep_gb
>>> directory. Executing this file should not give the error message that
>> input
>>> files were not found. If this doesn't work, then can you post the error
>> the
>>> user had? They might have had a different error instead of input files
>> not
>>> being found.
>>>
>>> .David Case: I looked at the files in test/cuda/remd folder. rem_gb_2rep,
>>> rem_gb_4rep, rem_wat are not used at all. Deleting those folders did not
>>> affect any of the test cases; they all passed.
>>>
>>> Best,
>>> Koushik
>>> Carlos Simmerling Lab
>>>
>>>
>>>
>>> On Tue, Jul 2, 2019 at 9:57 AM Marcela Madrid <mmadrid.psc.edu> wrote:
>>>
>>>> hi Dave,
>>>>
>>>> thanks for your answer. It is not just a problem with the test examples.
>>>> It is a problem whenever we try to run REMD on the GPUs on Bridges at
>> the
>>>> PSC.
>>>> The reason why I am looking at it is a user wants to run it. REMD on the
>>>> CPUs works fine (with the corresponding executable of course),
>>>> it is just a problem with the GPUs. So it occurred to me to see if it
>>>> passes the tests and we have the same error
>>>> messages. The user has her input files in the directory where she runs.
>>>>
>>>> I think it is either a problem with the configuration of the GPU nodes
>> on
>>>> Bridges or a bug.
>>>> Each Bridges node has 24 cores and 2 P100 GPUs. I have asked for 1 node,
>>>> ntasks-per-node=2 and the 2 GPUs
>>>> but I get the error message about not finding the input files.
>>>> Amber on GPUs was compiled with
>>>> ./configure -cuda -mpi gnu
>>>> Attempting to compile with intel compilers instead of gnu gave error
>>>> messages.
>>>>
>>>> O3 -ccbin icpc -o cuda_mg_wrapper.o -c cuda_mg_wrapper.cu
>>>> In file included from
>>>>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/host_config.h(50),
>>>> from
>>>>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/cuda_runtime.h(78),
>>>> from cuda_mg_wrapper.cu(0):
>>>>
>> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/crt/host_config.h(79):
>>>> error: #error directive: -- unsupported ICC configuration! Only
>>>> ICC 15.0, ICC 16.0, and ICC 17.0 on Linux x86_64 are supported!
>>>> #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0, and
>>>> ICC 17.0 on Linux x86_64 are supported!
>>>>
>>>> We do not have such old versions of the compilers. Any hints will be
>>>> appreciated
>>>> as to how to run REMD on the GPUS. Thanks so much,
>>>>
>>>> Marcela
>>>>
>>>>
>>>>> On Jul 2, 2019, at 9:01 AM, David A Case <david.case.rutgers.edu>
>> wrote:
>>>>>
>>>>> On Mon, Jul 01, 2019, Marcela Madrid wrote:
>>>>>
>>>>>>> Two replica GB REMD test.
>>>>>>>
>>>>>>>
>>>>>>> Unit 5 Error on OPEN: rem.in.001
>>>>>
>>>>> OK: query for the REMD experts: in AMBERHOME/test/cuda/remd there are
>>>>> two directories: rem_2rep_gb and rem_gb_2rep. The rem.in.00? files are
>>>>> in the former, but the tests actually get run in the latter directory.
>>>>>
>>>>> Same general problem for rem_2rep_pme: the needed rem.in.00? files are
>>>>> in rem_wat_2 (or maybe in rem_wat).
>>>>>
>>>>> I'm probably missing something here, but cleaning up (or at least
>>>>> commenting) the cuda/remd test folder seems worthwhile: there are
>>>>> folders that seem never to be used, and input files that seem to be in
>>>>> the wrong place.
>>>>>
>>>>> Marcela: I'd ignore these failures for now; something should get posted
>>>>> here that either fixes the problem, or figures out a problem with your
>>>>> inputs. (My money is on the former.)
>>>>>
>>>>> ...dac
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> AMBER mailing list
>>>>> AMBER.ambermd.org
>>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
> <rem_2rep_gb.tar.gz>_______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Wed Jul 03 2019 - 13:00:02 PDT