Re: [AMBER] Running AmberTools21 on HPC cluster using distributed memory

From: Manuel Fernandez Merino <manuel.fernandez.crg.eu>
Date: Thu, 13 Jan 2022 14:54:46 +0000

Hello,

Yes, I did read everything. Sorry, I was not very clear at the end of my previous email. I meant that I was also going to try just avoiding the nab testing in parallel (I did, and that part of the test wasn't done, but the rest remained the same). Indeed, I do not plan to run nab or cpptraj in the HPC system. At the moment, I'm mostly interested in getting sander.MPI to run in the cluster, and as soon as I get the Amber license, I plan to use pmemd. However, I figured that it would make sense to check that the testing works properly in the HPC system while I wait to get Amber and the pmemd features.

I think that your guess about calling mpirun within mpirun might be right. Indeed, in the job script I defined DO_PARALLEL="mpirun -np 2" to run the parallel test. My queue submission script also includes mpirun, following what I understood had to be done to launch a job using the MPI environment in the cluster.

I am in contact with my cluster service to see if they can finally help me figure this out (but they haven't been too fast answering, I have to say). I just wasn't sure whether it was a problem with my installation of the software so I decided to use the mailing list. Now I'm almost 100% sure that I'm mostly dealing with a problem submitting the job.

Best regards and thank you,
Manuel

-----Original Message-----
From: David A Case <david.case.rutgers.edu>
Sent: Thursday, January 13, 2022 3:18 PM
To: AMBER Mailing List <amber.ambermd.org>
Subject: Re: [AMBER] Running AmberTools21 on HPC cluster using distributed memory

On Thu, Jan 13, 2022, Manuel Fernandez Merino wrote:
>
>Thanks a lot for your answer.

I hope you read all of it, especially the part about asking for local help.
Some person knowledgeable about your cluster, and with access to it, might be able to fix many of these things, but it's quite hard to do remotely.

>mpirun does not support recursive calls

This sounds like you are calling mpirun within mpirun. My guess is that that you have "mpirun" in your DO_PARALLEL variable, and your queueing system itself is invoking mpirun (via a command like "srun" or similar). You can see how hard it is to figure things out based only on the final error.

Are you really planning to run nab programs in parallel on a high-performance cluster via a job submission script? Same question for cpptraj.MPI. If not, consider changing your test strategy. Many (most?) Amber users would use a big cluster for pmemd jobs, and do the rest of preparation and analysis on a local PC. I gave suggestions for that option in my earlier email.

Also, can you run short parallel jobs via the command line, without submitting them to the queueing system. Run a few tests, e.g. Run.dhfr.
If the codes work, you can ask for help on the job submission system from a local administrator.

...good luck....dac


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Jan 13 2022 - 07:00:02 PST
Custom Search