Re: [AMBER] pmemd.MPI fails to run

From: Ryan Novosielski <novosirj.ca.rutgers.edu>
Date: Sun, 28 Dec 2014 12:06:48 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think with an interactive job, that is to be expected. You're
running that mpirun process on one node and asking for 24 processes. I
don't know how it would ever get anything started on the other node.
Don't interactive jobs, more or less, simply block off the nodes and
leave you to execute any code yourself?

On 12/28/2014 11:32 AM, Fabian Glaser wrote:
> Hi Jason,
>
> we have followed your suggestion to recompile amber 14 but our
> system people still having problems compiling and running parallel
> pmemd.MPI amber 14 for more than one node, I forward you our
> system people detailed trials and questions, we would highly
> appreciate your help and suggestion what to do next:
>
> Thanks a lot, and happy new year,
>
> Fabian Glaser Technion, Israel
>
> ===
>
> Shalom Fabian,
>
> After several compilation attempts the Amber14 has been compiled
> with OpenMPI 1.5.4 (system default).
>
> The PBS script minz.q in your directory
> /u/fglaser/projects/IsraelVlodavsky/hep1_v2/MD/ETA_1/min contains
> the setup that provides the access to the respective "mpirun"
> command. I've submitted the job on 2 nodes with
>
>> qsub -I minz.q
>
> - interactive batch command to see the output after the setup:
>
>> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>> export PATH=/usr/lib64/openmpi/bin:$PATH source
>> /usr/local/amber14/amber.sh
>
>> which mpirun
> mpirun is /usr/lib64/openmpi/bin/mpirun
>
>> mpirun -version
> mpirun (Open MPI) 1.5.4
>
>> which pmemd.MPI
> pmemd.MPI is /usr/local/amber14/bin/pmemd.MPI
>
>> ldd /usr/local/amber14/bin/pmemd.MPI
> linux-vdso.so.1 => (0x00002aaaaaaab000) libstdc++.so.6 =>
> /usr/lib64/libstdc++.so.6 (0x00000039b4400000) libmpi_cxx.so.1 =>
> /usr/lib64/openmpi/lib/libmpi_cxx.so.1 (0x00002aaaaaac1000)
> libmpi_f90.so.1 => /usr/lib64/openmpi/lib/libmpi_f90.so.1
> (0x00002aaaaacdb000) libmpi_f77.so.1 =>
> /usr/lib64/openmpi/lib/libmpi_f77.so.1 (0x00002aaaaaedf000)
> libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1
> (0x00000039b4000000) libdl.so.2 => /lib64/libdl.so.2
> (0x00000039b2400000) libgfortran.so.3 =>
> /usr/lib64/libgfortran.so.3 (0x00000039b3400000) libm.so.6 =>
> /lib64/libm.so.6 (0x00000039b1c00000) libgcc_s.so.1 =>
> /lib64/libgcc_s.so.1 (0x00000039b3800000) libpthread.so.0 =>
> /lib64/libpthread.so.0 (0x00000039b2800000) libc.so.6 =>
> /lib64/libc.so.6 (0x00000039b2000000) /lib64/ld-linux-x86-64.so.2
> (0x00000039b1800000) libnsl.so.1 => /lib64/libnsl.so.1
> (0x00000039b4c00000) libutil.so.1 => /lib64/libutil.so.1
> (0x00000039b3c00000) libltdl.so.7 => /usr/lib64/libltdl.so.7
> (0x00002aaaab115000)
>
> the run command
>
>> mpirun -np 24 pmemd.MPI -O -i min.in -o min.out -p
>> ../hep1_system_ETA_ETA_1.prmtop -c
>> ../hep1_system_ETA_ETA_1.prmcrd -r min.rst -ref
>> ../hep1_system_ETA_ETA_1.prmcrd
>
> starts execution, and eventually produces the min.out file. However
> all the 24 processes are executed on ONE NODE instead of
> distributing 12 processes on each of the TWO nodes.
>> qstat -1nu fglaser
> Req'd Req'd Elap Job ID Username Queue Jobname
> SessID NDS TSK Memory Time S Time --------------- --------
> -------- ---------- ------ --- --- ------ ----- - -----
> 1076133.tamnun fglaser amir_q ETA_1_minz 6320 2 24 --
> 168:0 R 00:10 n101/0*12+n102/0*12
>
>> hostname
> n101
>
>> top
> top - 17:42:18 up 13 days, 2:51, 1 user, load average: 9.52,
> 2.31, 0.77 Tasks: 378 total, 25 running, 353 sleeping, 0
> stopped, 0 zombie Cpu(s): 65.8%us, 34.2%sy, 0.0%ni, 0.0%id,
> 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99063900k total, 3033516k
> used, 96030384k free, 221696k buffers Swap: 8191992k total,
> 0k used, 8191992k free, 864036k cached PID USER PR NI
> VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6452 fglaser 20 0
> 229m 57m 19m R 59.1 0.1 0:18.88 pmemd.MPI 6451 fglaser 20
> 0 230m 58m 20m R 58.4 0.1 0:17.26 pmemd.MPI 6445 fglaser 20
> 0 230m 58m 21m R 56.4 0.1 0:18.26 pmemd.MPI 6461 fglaser 20
> 0 230m 56m 18m R 56.4 0.1 0:18.21 pmemd.MPI 6449 fglaser 20
> 0 230m 57m 20m R 56.1 0.1 0:17.58 pmemd.MPI 6459 fglaser 20
> 0 230m 56m 19m R 56.1 0.1 0:18.00 pmemd.MPI 6460 fglaser 20
> 0 230m 55m 19m R 54.8 0.1 0:17.30 pmemd.MPI 6453 fglaser 20
> 0 229m 57m 20m R 52.5 0.1 0:17.63 pmemd.MPI 6444 fglaser 20
> 0 240m 70m 25m R 50.1 0.1 0:17.49 pmemd.MPI 6448 fglaser 20
> 0 230m 58m 20m R 50.1 0.1 0:16.04 pmemd.MPI 6462 fglaser 20
> 0 230m 55m 18m R 50.1 0.1 0:14.81 pmemd.MPI 6455 fglaser 20
> 0 230m 56m 19m R 49.8 0.1 0:15.50 pmemd.MPI 6457 fglaser 20
> 0 231m 55m 19m R 49.8 0.1 0:16.08 pmemd.MPI 6446 fglaser 20
> 0 230m 58m 21m R 49.5 0.1 0:16.58 pmemd.MPI 6464 fglaser 20
> 0 230m 56m 19m R 49.5 0.1 0:17.54 pmemd.MPI 6450 fglaser 20
> 0 230m 57m 20m R 48.1 0.1 0:14.53 pmemd.MPI 6447 fglaser 20
> 0 230m 58m 21m R 47.8 0.1 0:15.20 pmemd.MPI 6458 fglaser 20
> 0 230m 56m 19m R 47.1 0.1 0:14.37 pmemd.MPI 6454 fglaser 20
> 0 230m 56m 19m R 46.5 0.1 0:13.97 pmemd.MPI 6466 fglaser 20
> 0 230m 56m 19m R 45.8 0.1 0:14.74 pmemd.MPI 6456 fglaser 20
> 0 230m 56m 19m R 41.8 0.1 0:14.55 pmemd.MPI 6467 fglaser 20
> 0 230m 57m 20m R 41.8 0.1 0:14.36 pmemd.MPI 6463 fglaser 20
> 0 230m 55m 18m R 40.2 0.1 0:14.40 pmemd.MPI 6465 fglaser 20
> 0 230m 56m 19m R 39.8 0.1 0:14.84 pmemd.MPI 55 root 20
> 0 0 0 0 S 0.3 0.0 0:20.28 events/4 6570 fglaser 20
> 0 13396 1408 896 R 0.3 0.0 0:00.03 top 1 root 20 0
> 23588 1660 1312 S 0.0 0.0 0:02.66 init 2 root 20 0 0
> 0 0 S 0.0 0.0 0:00.01 kthreadd 3 root RT 0 0 0
> 0 S 0.0 0.0 0:00.04 migration/0
> ............................................................................
>>
>
hostname
> n102
>> top
> top - 17:46:09 up 13 days, 2:55, 1 user, load average: 0.07,
> 0.03, 0.00 Tasks: 349 total, 1 running, 348 sleeping, 0
> stopped, 0 zombie Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 99.9%id,
> 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99063900k total, 1962460k
> used, 97101440k free, 220228k buffers Swap: 8191992k total,
> 0k used, 8191992k free, 746280k cached PID USER PR NI
> VIRT RES SHR S %CPU %MEM TIME+ COMMAND 182 root 39 19
> 0 0 0 S 0.3 0.0 47:28.09 kipmi0 1 root 20 0 23592
> 1660 1312 S 0.0 0.0 0:02.64 init 2 root 20 0 0 0
> 0 S 0.0 0.0 0:00.01 kthreadd 3 root RT 0 0 0 0
> S 0.0 0.0 0:00.04 migration/0 4 root 20 0 0 0
> 0 S 0.0 0.0 0:00.36 ksoftirqd/0 5 root RT 0 0 0
> 0 S 0.0 0.0 0:00.00 migration/0 6 root RT 0 0 0
> 0 S 0.0 0.0 0:00.80 watchdog/0 7 root RT 0 0 0
> 0 S 0.0 0.0 0:00.75 migration/1 8 root RT 0 0 0
> 0 S 0.0 0.0 0:00.00 migration/1
> ..............................................................................
>
> As you understand, the parallel execution in this way is not
> effective. I suggest to send this output to the AMBER support/forum
> and ask for their recommendations.
>
> To save time, I should mention that we have encountered apparently
> similar problem with a couple of other applications. At that time
> the problem was solved by recompiling and running with Intel MPI.
> Can AMBER 14 work with Intel MPI generally? Meanwhile the attempt
> of compilation with Intel MPI (Intel version 14.0.2) has failed.
>
> Any recommendations would be deeply appreciated.
>
> Regards, Yulia Halupovich, Technion - CIS, TAMNUN Team phone:
> 972-4-8292654, fax: 972-4-8236212 Reply-to: hpc.technion.ac.il
>
>
>
>
>
>
> _______________________________ Fabian Glaser, PhD
>
> Technion - Israel Institute of Technology Haifa 32000, ISRAEL
>
> fglaser.technion.ac.il Tel: +972 4 8293701 Fax: +972 4
> 8225153
>
>> On Dec 21, 2014, at 5:39 PM, Jason Swails
>> <jason.swails.gmail.com> wrote:
>>
>> On Sun, Dec 21, 2014 at 9:02 AM, Fabian Glaser
>> <fabian.glaser.gmail.com> wrote:
>>
>>> Hi Amber experts,
>>>
>>>
>>> We had pmemd.MPI (amber 14) installed correctly and running,
>>> but after a disk addition to our cluster it fails to run, we
>>> try to run pmemd.MPI with the following setup
>>>
>>>> source /usr/local/amber14/setup.csh
>>> which contains the following definitions
>>>
>>>> more /usr/local/amber14/setup.csh
>>> #!/bin/csh -f # # Setup for Amber 14 # setenv AMBERHOME
>>> /usr/local/amber14 setenv PATH $AMBERHOME/bin:$PATH setenv
>>> LD_LIBRARY_PATH $AMBERHOME/lib:$LD_LIBRARY_PATH
>>>
>>> and sets the MPI path as follows:
>>>
>>>> which mpirun
>>> /usr/local/amber14/bin/mpirun
>>>
>>> we get the following error message:
>>>
>>>> mpirun -np 12 pmemd.MPI
>>> ------------------------------------------------------- Primary
>>> job terminated normally, but 1 process returned a non-zero
>>> exit code.. Per user-direction, the job has been aborted.
>>> ------------------------------------------------------- MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors! MPI
>>> version of PMEMD must be used with 2 or more processors!
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>>
>>> it seems that the "numtasks" parameter is not passed to the
>>> pmemd.MPI executable.
>>>
>>> We would quite appreciate any help, and are ready to provide
>>> any additional information regarding the AMBER 14 installation
>>> and compilation on our system.
>>>
>>> All the other non-MPI programs including pmemd or sander run
>>> fine.
>>>
>>
>> ​This happens when you compile Amber with one MPI and try to use
>> the "mpirun" from a different MPI. This is true of every MPI
>> program, not just Amber.
>>
>> You need to make sure that Amber is compiled with the same MPI
>> installation you intend on using to run it.
>>
>> ​HTH, Jason ​ -- Jason M. Swails BioMaPS, Rutgers University
>> Postdoctoral Researcher
>> _______________________________________________ AMBER mailing
>> list AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________ AMBER mailing list
> AMBER.ambermd.org http://lists.ambermd.org/mailman/listinfo/amber
>

- --
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj.rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
     `'
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlSgOJwACgkQmb+gadEcsb6FtgCgm9r5X/J/ZM1u1QCnrjRuITN7
Jp0AnR6mvOfk86gV9DqaXLqlFdz2zl5K
=a2/P
-----END PGP SIGNATURE-----

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Dec 28 2014 - 09:30:02 PST
Custom Search