Hi, this looks like a problem with the MPI and/or UCX library. Try running somr test job only with orca using more than 1 cpu if that works.
Best,
Martin
Odoslané z aplikácie Outlook pre Android<
https://aka.ms/AAb9ysg>
________________________________
From: Ramdhan,Peter A via AMBER <amber.ambermd.org>
Sent: Friday, October 11, 2024 3:08:47 PM
To: AMBER Mailing List <amber.ambermd.org>
Subject: [AMBER] AMBER-ORCA Interface
[EXTERNAL EMAIL]
Hi everyone,
I have a question about using QM with ORCA as the external program for AMBER. When I perform this calculation in serial, it works fine, however, when using parallel it fails after a couple of steps. Does anyone have experience with this?
Here is my mdin file:
&cntrl
 imin = 0,          ! Perform MD, not minimization
 irest = 1,         ! Restart simulation from previous run
 ntx = 5,           ! Coordinates and velocities from the restart file
 nstlim =100,   ! Number of MD steps = 1 ps
 dt = 0.0005,        ! Time step in picoseconds
 cut = 8.0,         ! Non-bonded cutoff in angstroms
 ntr = 0,           ! No positional restraints
 restraint_wt = 0.0, ! Weight of restraint (no restraints applied)
 ntb = 2,           ! Constant pressure periodic boundary conditions
 ntp = 1,           ! Isotropic position scaling (NPT ensemble)
 barostat = 1,      ! Berendsen pressure control
 ntc = 2,           ! SHAKE on bonds involving hydrogen
 ntf = 2,           ! Bond interactions with hydrogens excluded
 ntt = 3,           ! Langevin thermostat
 gamma_ln = 5.0,    ! Collision frequency for Langevin dynamics
 tempi = 310,    ! Initial temperature
 temp0 = 310,    ! Target temperature
 ioutfm = 1,        ! Write binary trajectory file
 ntpr = 1,        ! Print energy information every 500 steps
 ntwx = 1,        ! Write coordinates to trajectory file every 500 steps
 ntwr = 1,        ! Write restart file every 500 steps
 ifqnt=1,           ! Turn on QM/MM
/
&qmmm
 qmmask=':CYP.SG,CB|:HEM.FE,O1,NA,NB,NC,ND,C1C,C2C,C3C,C4C,CHD,HHD,C1D,C2D,C3D,C4D,CHA,HHA,C1A,C2A,C3A,C4A,CHB,HHB,C1B,C2B,C3B,C4B,CHC,HHC',   ! QM region, specifying residues 1 and 465
 qmmm_int=1,       !
 qm_theory='EXTERN',   !
 qmcharge=-2,
 spin=4,
 qmshake=0,
 qm_ewald = 0,
 qm_pme=0,
/
&orc
 method = 'bp86',
 basis = 'sv(p)',
 num_threads=8,
 maxcore=2000,
/
&wt
 type='END'
&end
And here is my slurm file:
#!/bin/bash
#SBATCH --job-name=clop_qm
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:1
#SBATCH --time=4-00:00:00
#SBATCH --output=job.%j.out
#SBATCH --error=job.%j.err
module purge
ml gcc
#ml openmpi/4.1.1
export PATH=/apps/gcc/12.2.0/openmpi/4.1.1/orca/5.0.4:$PATH
export LD_LIBRARY_PATH=/apps/gcc/12.2.0/openmpi/4.1.1/orca/5.0.4:$LD_LIBRARY_PATH
export PATH=/apps/mpi/gcc/12.2.0/openmpi/4.1.1/bin:$PATH
export LD_LIBRARY_PATH=/apps/mpi/gcc/12.2.0/openmpi/4.1.1/lib:$LD_LIBRARY_PATH
source $AMBERHOME/amber.sh
$AMBERHOME/bin/sander -O -i step8_qm.mdin -o step8_qm.out -p com.parm7 -c step6.ncrst -r step7_qm.ncrst -x step7_qm.nc -ref step6.ncrst -inf step7_qm.info
I am encountering this error after a couple of steps:
-------------------------   --------------------
FINAL SINGLE POINT ENERGY     -2762.899076990582
-------------------------   --------------------
[1728651922.027067] [c0800a-s17:1471414:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471416/fd/71 flags=0x0) failed: No such file or directory
[1728651922.028197] [c0800a-s17:1471414:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471416/fd/71 flags=0x0) failed: No such file or directory
[1728651922.028735] [c0800a-s17:1471414:0]         mm_sysv.c:59   UCX  ERROR   shmat(shmid=3145743) failed: Invalid argument
[1728651922.028742] [c0800a-s17:1471414:0]           mm_ep.c:189  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x30000f: Shared memory error
[1728651922.025391] [c0800a-s17:1471415:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471416/fd/71 flags=0x0) failed: No such file or directory
[1728651922.026517] [c0800a-s17:1471415:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471416/fd/71 flags=0x0) failed: No such file or directory
[1728651922.027067] [c0800a-s17:1471415:0]         mm_sysv.c:59   UCX  ERROR   shmat(shmid=3145743) failed: Invalid argument
[1728651922.027073] [c0800a-s17:1471415:0]           mm_ep.c:189  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x30000f: Shared memory error
[1728651922.032593] [c0800a-s17:1471417:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471426/fd/71 flags=0x0) failed: No such file or directory
[1728651922.033773] [c0800a-s17:1471417:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471426/fd/71 flags=0x0) failed: No such file or directory
[1728651922.034341] [c0800a-s17:1471417:0]         mm_sysv.c:59   UCX  ERROR   shmat(shmid=3145746) failed: Invalid argument
[1728651922.034347] [c0800a-s17:1471417:0]           mm_ep.c:189  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x300012: Shared memory error
[1728651922.030778] [c0800a-s17:1471419:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471426/fd/71 flags=0x0) failed: No such file or directory
[1728651922.031930] [c0800a-s17:1471419:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471426/fd/71 flags=0x0) failed: No such file or directory
[1728651922.032472] [c0800a-s17:1471419:0]         mm_sysv.c:59   UCX  ERROR   shmat(shmid=3145746) failed: Invalid argument
[1728651922.032479] [c0800a-s17:1471419:0]           mm_ep.c:189  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x300012: Shared memory error
[1728651922.025267] [c0800a-s17:1471420:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471426/fd/71 flags=0x0) failed: No such file or directory
[1728651922.026408] [c0800a-s17:1471420:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471426/fd/71 flags=0x0) failed: No such file or directory
[1728651922.026952] [c0800a-s17:1471420:0]         mm_sysv.c:59   UCX  ERROR   shmat(shmid=3145746) failed: Invalid argument
[1728651922.026959] [c0800a-s17:1471420:0]           mm_ep.c:189  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x300012: Shared memory error
[1728651922.032055] [c0800a-s17:1471413:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471416/fd/71 flags=0x0) failed: No such file or directory
[1728651922.033186] [c0800a-s17:1471413:0]        mm_posix.c:234  UCX  ERROR   open(file_name=/proc/1471416/fd/71 flags=0x0) failed: No such file or directory
[1728651922.033720] [c0800a-s17:1471413:0]         mm_sysv.c:59   UCX  ERROR   shmat(shmid=3145743) failed: Invalid argument
[1728651922.033727] [c0800a-s17:1471413:0]           mm_ep.c:189  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x30000f: Shared memory error
ORCA finished by error termination in SCF gradient
Calling Command: mpirun -np 8  /apps/gcc/12.2.0/openmpi/4.1.1/orca/5.0.4/orca_scfgrad_mpi orc_job.scfgrad.inp orc_job
[file orca_tools/qcmsg.cpp, line 465]:
  .... aborting the run
Am I not allocating enough memory? According to the job file, it uses about 200-300 mb per step and if maxcore is mb allocated per cpu (ntasks is 8 and cpus per task is 1 so 8 total) then I figured it would be enough if my maxcore is 2000.
Sincerely,
Peter Ramdhan, PharmD
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Upozornení: Není-li v této zpráve výslovne uvedeno jinak, má tato e-mailová zpráva nebo její prílohy pouze informativní charakter. Tato zpráva ani její prílohy v zádném ohledu Univerzitu Karlovu k nicemu nezavazují. Text této zprávy nebo jejích príloh není návrhem na uzavrení smlouvy, ani prijetím prípadného návrhu na uzavrení smlouvy, ani jiným právním jednáním smerujícím k uzavrení jakékoliv smlouvy a nezakládá predsmluvní odpovednost Univerzity Karlovy. Obsahuje-li tento e-mail nebo nekterá z jeho príloh osobní údaje, dbejte pri jeho dalsím zpracování (zejména pri archivaci) souladu s pravidly evropského narízení GDPR.
Disclaimer: If not expressly stated otherwise, this e-mail message (including any attached files) is intended purely for informational purposes and does not represent a binding agreement on the part of Charles University. The text of this message and its attachments cannot be considered as a proposal to conclude a contract, nor the acceptance of a proposal to conclude a contract, nor any other legal act leading to concluding any contract; nor does it create any pre-contractual liability on the part of Charles University. If this e-mail or any of its attachments contains personal data, please be aware of data processing (particularly document management and archival policy) in accordance with Regulation (EU) 2016/679 of the European Parliament and of the Council on GDPR.
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Oct 11 2024 - 12:00:01 PDT