Re: [AMBER] crash o the pmemd.cuda from Enrico Martinez via AMBER on 2022-09-05 (Amber Archive Sep 2022)

From: Enrico Martinez via AMBER <amber.ambermd.org>
Date: Mon, 5 Sep 2022 09:45:26 +0200

Thanks a lot, David!
I believe this was due to the GPU overheating since I could run this
job using check-out file without any issues
Cheers
Enrico

Il giorno ven 2 set 2022 alle ore 22:19 David A Case
<david.case.rutgers.edu> ha scritto:
>
> On Fri, Sep 02, 2022, Enrico Martinez via AMBER wrote:
>
> >The second time I have a crash of pmemd.cuda after 370ns of the
> >production run without any reasons for it. I checked the system and
> >did not observe any instabilities (like jump of RMSD or other
> >parameters).
> >
> >in the nohup log I could found only this :
> >
> >/home/enrico/Desktop/test_md/NMR/output_testDIM77/MDtest/dim_310K/runjob_gpu.sh:
> >line 24: 1625945 Killed pmemd.cuda -O -i
> >./in/md_production_310K.in -o prod_.out -p protein.prmtop -c
> >equil1p.rst -ref equil1p.rst -r prod_310K.rst -x prod_310K.netcdf
> >
> >Note that at that moment two different simulations were executing
> >using 2 separate GPUs and both of them were killed.
>
> This has some of the characteristics of a hardware failure, but it's hard to
> be sure. Consider using shorter runs, and restarting more often. Then, if
> the system fails as 370 nsec, you can restart shortly before the failure, to
> see if bad things still happen at 370 nsec. Ideally, you could get a
> restart close enough to the failure that you could compare GPU and CPU runs
> (or compare cuda_SPDP with cuda_DPFP, etc.)
>
> ....dac
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon Sep 05 2022 - 01:00:03 PDT