Hi Marcela,
Our lab also uses a slurm queuing system. We use the below script, which is
similar to your script, to submit 2 replica REMD jobs to one node.
#!/bin/bash
#SBATCH -N 1
#SBATCH --tasks-per-node 2
#SBATCH --gres=gpu:2
mpirun -np 2 /opt/amber/bin/pmemd.cuda.MPI -O -ng 2 -groupfile groupremd
So, I do not see anything wrong with your submission script. Since your CPU
jobs run fine, I think there might be some issue with the way the GPUs are
configured on your cluster. Note: CPU REMD jobs require 2 cpus per replica
whereas the GPU REMD jobs require only 1 gpu per replica.
I just ran the test cases with 2 and 4 replicas, they all pass for me. If
you are having issues with the test cases, I think something might be wrong
with the way files are being sourced. I don't think it is a compiler issue
either. We use gnu compilers on our cluster and all tests pass for us.
Can you run the test cases inside the directory that David Case pointed
out? There is a Run.rem.sh file inside AMBERHOME/test/cuda/remd/rem_2rep_gb
directory. Executing this file should not give the error message that input
files were not found. If this doesn't work, then can you post the error the
user had? They might have had a different error instead of input files not
being found.
.David Case: I looked at the files in test/cuda/remd folder. rem_gb_2rep,
rem_gb_4rep, rem_wat are not used at all. Deleting those folders did not
affect any of the test cases; they all passed.
Best,
Koushik
Carlos Simmerling Lab
On Tue, Jul 2, 2019 at 9:57 AM Marcela Madrid <mmadrid.psc.edu> wrote:
> hi Dave,
>
> thanks for your answer. It is not just a problem with the test examples.
> It is a problem whenever we try to run REMD on the GPUs on Bridges at the
> PSC.
> The reason why I am looking at it is a user wants to run it. REMD on the
> CPUs works fine (with the corresponding executable of course),
> it is just a problem with the GPUs. So it occurred to me to see if it
> passes the tests and we have the same error
> messages. The user has her input files in the directory where she runs.
>
> I think it is either a problem with the configuration of the GPU nodes on
> Bridges or a bug.
> Each Bridges node has 24 cores and 2 P100 GPUs. I have asked for 1 node,
> ntasks-per-node=2 and the 2 GPUs
> but I get the error message about not finding the input files.
> Amber on GPUs was compiled with
> ./configure -cuda -mpi gnu
> Attempting to compile with intel compilers instead of gnu gave error
> messages.
>
> O3 -ccbin icpc -o cuda_mg_wrapper.o -c cuda_mg_wrapper.cu
> In file included from
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/host_config.h(50),
> from
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/cuda_runtime.h(78),
> from cuda_mg_wrapper.cu(0):
> /opt/packages/cuda/9.2/bin/../targets/x86_64-linux/include/crt/host_config.h(79):
> error: #error directive: -- unsupported ICC configuration! Only
> ICC 15.0, ICC 16.0, and ICC 17.0 on Linux x86_64 are supported!
> #error -- unsupported ICC configuration! Only ICC 15.0, ICC 16.0, and
> ICC 17.0 on Linux x86_64 are supported!
>
> We do not have such old versions of the compilers. Any hints will be
> appreciated
> as to how to run REMD on the GPUS. Thanks so much,
>
> Marcela
>
>
> > On Jul 2, 2019, at 9:01 AM, David A Case <david.case.rutgers.edu> wrote:
> >
> > On Mon, Jul 01, 2019, Marcela Madrid wrote:
> >
> >>> Two replica GB REMD test.
> >>>
> >>>
> >>> Unit 5 Error on OPEN: rem.in.001
> >
> > OK: query for the REMD experts: in AMBERHOME/test/cuda/remd there are
> > two directories: rem_2rep_gb and rem_gb_2rep. The rem.in.00? files are
> > in the former, but the tests actually get run in the latter directory.
> >
> > Same general problem for rem_2rep_pme: the needed rem.in.00? files are
> > in rem_wat_2 (or maybe in rem_wat).
> >
> > I'm probably missing something here, but cleaning up (or at least
> > commenting) the cuda/remd test folder seems worthwhile: there are
> > folders that seem never to be used, and input files that seem to be in
> > the wrong place.
> >
> > Marcela: I'd ignore these failures for now; something should get posted
> > here that either fixes the problem, or figures out a problem with your
> > inputs. (My money is on the former.)
> >
> > ...dac
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 02 2019 - 09:00:03 PDT