[AMBER] pmemd24 problems with RTX5090 from Oscar Conchillo-Solé via AMBER on 2025-07-15 (Amber Archive Jul 2025)

From: Oscar Conchillo-Solé via AMBER <amber.ambermd.org>
Date: Tue, 15 Jul 2025 09:48:54 +0200

Dear Amber people

We've just acquired an Nvidia RTX5090 which we plan to use mainly with
amber.
We already have a 3080 and a 4090 which both work great with amber.

#### Summary of the following long mail ######
Amber24 crashes when running pmemd.cuda (pmemd.cuda_SPFP) with this
error in stderr:
of length = 42Failed an illegal memory access was encountered
#### END of summary #####

However, In the computer with the Nvidia RTX5090:
I could not manage to run Amber23 , neither the one we already had
compiled nor could we recompile it again due to a miniconda failure with
broken dependencies.
Since I saw that there is a new version of amber I went for it.

First thing I realized is that after executing:
./update_amber --update
I got this error:
Downloading updates for Amber 24
Downloading Amber 24/update.1 (1.10 KB)
Applying Amber 24/update.1
Downloading Amber 24/update.2 (317.64 KB)
Downloading: [::::::::::::::::::::::::::::::::::::::::::::::::::] 100.0%
Done.
Applying Amber 24/update.2
PatchingError: .patches/Amber24_Unapplied_Patches/update.2 failed to
apply. No changes made from this patch

(All went well for Ambertools24)

I decided to keep going to see what happened and I managed to compile
the code, this was my cmake command:
cmake ../pmemd24_src
-DCMAKE_INSTALL_PREFIX=/sharelab/labapps/AMBER/AMBER25/Amber25-GPU_MPI
-DCOMPILER=GNU -DMPI=TRUE -DOPENMP=TRUE -DCUDA=TRUE
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.8 -DINSTALL_TESTS=TRUE
-DDOWNLOAD_MINICONDA=FALSE -DBUILD_PYTHON=FALSE -DBUILD_PERL=FALSE
-DBUILD_GUI=FALSE -DPMEMD_ONLY=TRUE -DCHECK_UPDATES=FALSE

lots of warnings but it did compile.

(cuda-12.6 does not work with blackwell GPUs, but I still tried and it
did not work)

make test.serial and test.cuda.serial went well, but I realized that
test.cuda.serial uses PREC_MODEL = DPFP which uses "bin/pmemd.cuda_DPFP"
as executable.

make.test.cuda.seral last lines:
Finished CUDA test suite for Amber 24 at Fri Jul 4 11:56:53 CEST 2025.

273 file comparisons passed
15 file comparisons failed (8 of which can be ignored)
4 tests experienced errors

this were the errors:

==============================================================
cd tip4pew/ && ./Run.tip4pew_box_npt DPFP yes
ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
   Your system density has likely changed by a large amount, probably from
   starting the simulation from a structure a long way from equilibrium.

   [Although this error can also occur if the simulation has blown up
for some reason]

   The GPU code does not automatically reorganize grid cells and thus you
   will need to restart the calculation from the previous restart file.
   This will generate new grid cells and allow the calculation to continue.
   It may be necessary to repeat this restarting multiple times if your
system
   is a long way from an equilibrated density.

   Alternatively you can run with the CPU code until the density has
converged
   and then switch back to the GPU code.

   ./Run.tip4pew_box_npt: Program error
make[1]: [Makefile:198: test.pmemd.cuda.pme] Error 1 (ignored)
cd tip4pew/ && ./Run.tip4pew_oct_nvt DPFP yes
diffing mdout.tip4pew_oct_nvt.GPU_DPFP with mdout.tip4pew_oct_nvt
PASSED
==============================================================

==============================================================
cd gamd/PPIGaMD && ./Run.ppigamd DPFP yes
ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
   Your system density has likely changed by a large amount, probably from
   starting the simulation from a structure a long way from equilibrium.

   [Although this error can also occur if the simulation has blown up
for some reason]

   The GPU code does not automatically reorganize grid cells and thus you
   will need to restart the calculation from the previous restart file.
   This will generate new grid cells and allow the calculation to continue.
   It may be necessary to repeat this restarting multiple times if your
system
   is a long way from an equilibrated density.

   Alternatively you can run with the CPU code until the density has
converged
   and then switch back to the GPU code.

   ./Run.ppigamd: Program error
make[1]: [Makefile:227: test.pmemd.cuda.pme.sgamd] Error 1 (ignored)
#Begin tests
------------------------------------
Running CUDA Virtual Site tests.
cd virtual_sites/tip4p && ./Run.ec DPFP
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
diffing etot.type5_DPFP.save with etot.type5
PASSED
==============================================================

==============================================================
STOP PMEMD Terminated Abnormally!
   ./Run.npt: Program error
make[1]: [Makefile:424: test.pmemd.cuda.VirtualSites] Error 1 (ignored)
cd virtual_sites/tip5p && ./Run.ec DPFP
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
diffing etot.type8_DPFP.save with etot.type8
PASSED
==============================================================

==============================================================
cd virtual_sites/BromoBenzene && ./Run.npt DPFP
ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
   Your system density has likely changed by a large amount, probably from
   starting the simulation from a structure a long way from equilibrium.

   [Although this error can also occur if the simulation has blown up
for some reason]

   The GPU code does not automatically reorganize grid cells and thus you
   will need to restart the calculation from the previous restart file.
   This will generate new grid cells and allow the calculation to continue.
   It may be necessary to repeat this restarting multiple times if your
system
   is a long way from an equilibrated density.

   Alternatively you can run with the CPU code until the density has
converged
   and then switch back to the GPU code.

   ./Run.npt: Program error
make[1]: [Makefile:430: test.pmemd.cuda.VirtualSites] Error 1 (ignored)
cd virtual_sites/DimethylEther && ./Run.ec DPFP
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_DENORMAL
diffing etot.type8_DPFP.save with etot.type8
PASSED
==============================================================

But being only 4 and all reported as "PASSED" I have a strong feeling
that I can ignore them.

However, since the pmemd.cuda that the software has installed is a link
to bin/pmemd.cuda_SPFP (like in the other previous versions) I also
executed inside the test directory:
./test_amber_cuda_serial.sh SPFP

In this case it reported many errors:
Finished CUDA test suite for Amber 24 at Fri Jul 4 12:11:47 CEST 2025.

146 file comparisons passed
44 file comparisons failed (9 of which can be ignored)
101 tests experienced errors

I can see in the logfile that since the line that says:
Running CUDA GTI free energy tests.
all the tests to the end fail.

Apart from that we also run two MD simulations that we had previously
run in a computer with an Nvidia 4090 with the same drivers.

* 1st simulation:
The first one completed without an error in both cases, but we have
observed some events we think are worth noticing:

4090 SPFP:
     time: real    14m13.784s    Total wall time:         854
seconds     0.24 hours
     the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
     final mdinfo:
NSTEP =   950000   TIME(PS) =   16450.000 TEMP(K) =   304.24 PRESS =
31.5
  Etot   =   -221832.0665 EKtot   =     74523.8516 EPtot      =
-296355.9181
  BOND   =      5825.3681 ANGLE   =     22477.4361 DIHED =     13736.1496
  UB     =         0.0000 IMP     =         0.0000 CMAP =       401.9350
  1-4 NB =      5548.2509 1-4 EEL =    -26704.3318 VDWAALS =
11817.9771
  EELEC =   -329679.5838 EHBOND =         0.0000 RESTRAINT =
220.8807
  EAMBER (non-restraint) =   -296576.7988
  EKCMT =     21371.2401 VIRIAL =     20620.6276 VOLUME     =
1103684.8218
                                                     SURFTEN =
-147.0103
                                                     Density =
1.0234
  ------------------------------------------------------------------------------
| Current Timing Info
| -------------------
| Total steps:   1000000 | Completed:    950000 ( 95.0%) |
Remaining:     50000
|
| Average timings for last   75000 steps:
|     Elapsed(s) =      63.72 Per Step(ms) =       0.85
|         ns/day =     101.69   seconds/ns =     849.65
|
| Average timings for all steps:
|     Elapsed(s) =     808.09 Per Step(ms) =       0.85
|         ns/day =     101.57   seconds/ns =     850.62
|
|
| Estimated time remaining:      42.5 seconds.
  ------------------------------------------------------------------------------

4090 DPFP:
     time:real    326m8.975s    Total wall time:       19569 seconds
5.44 hours
     the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
     final mdinfo:
  NSTEP = 1000000   TIME(PS) =   16500.000 TEMP(K) =   304.20 PRESS
=    98.9
  Etot   =   -221528.4593 EKtot   =     74512.1291 EPtot      =
-296040.5883
  BOND   =      5783.8532 ANGLE   =     22242.1897 DIHED =     13737.2298
  UB     =         0.0000 IMP     =         0.0000 CMAP =       389.4306
  1-4 NB =      5605.6012 1-4 EEL =    -26765.6143 VDWAALS =
11568.7499
  EELEC =   -328834.0661 EHBOND =         0.0000 RESTRAINT =
232.0378
  EAMBER (non-restraint) =   -296272.6262
  EKCMT =     21224.3892 VIRIAL =     18867.0417 VOLUME     =
1103916.7228
                                                     SURFTEN =
10.9442
                                                     Density =
1.0231

| Final Performance Info:
|     -----------------------------------------------------
|     Average timings for last    5000 steps:
|     Elapsed(s) =      97.90 Per Step(ms) =      19.58
|         ns/day =       4.41   seconds/ns =   19579.20
|
|     Average timings for all steps:
|     Elapsed(s) =   19566.12 Per Step(ms) =      19.57
|         ns/day =       4.42   seconds/ns =   19566.12
|     -----------------------------------------------------

5090 SPFP:
     time: real    12m0.566s    Total wall time:         720
seconds     0.20 hours
the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
final mdinfo:
NSTEP =   940000   TIME(PS) =   16440.000 TEMP(K) =   302.39 PRESS =
38.1
  Etot   =   -222309.6997 EKtot   =     74070.1875 EPtot      =
-296379.8872
  BOND   =      5820.7564 ANGLE   =     22301.2271 DIHED =     13729.4669
  UB     =         0.0000 IMP     =         0.0000 CMAP =       404.1852
  1-4 NB =      5591.7641 1-4 EEL =    -26747.9411 VDWAALS =
11968.3334
  EELEC =   -329676.0689 EHBOND =         0.0000 RESTRAINT =
228.3895
  EAMBER (non-restraint) =   -296608.2767
  EKCMT =     21188.2013 VIRIAL =     20278.1851 VOLUME     =
1105365.1992
                                                     SURFTEN =
37.9072
                                                     Density =
1.0218
  ------------------------------------------------------------------------------
| Current Timing Info
| -------------------
| Total steps:   1000000 | Completed:    940000 ( 94.0%) |
Remaining:     60000
|
| Average timings for last   85000 steps:
|     Elapsed(s) =      61.10 Per Step(ms) =       0.72
|         ns/day =     120.19   seconds/ns =     718.83
|
| Average timings for all steps:
|     Elapsed(s) =     676.44 Per Step(ms) =       0.72
|         ns/day =     120.06   seconds/ns =     719.62
|
|
| Estimated time remaining:      43.2 seconds.
  ------------------------------------------------------------------------------

5090 DPFP:
     time: real    251m50.459s    Total wall time:       15110
seconds     4.20 hours
     final mdinfo:
  NSTEP = 1000000   TIME(PS) =   16500.000 TEMP(K) =   302.87 PRESS
=   -34.9
  Etot   =   -221858.7506 EKtot   =     74185.9438 EPtot      =
-296044.6944
  BOND   =      5945.5673 ANGLE   =     22373.4519 DIHED =     13670.1717
  UB     =         0.0000 IMP     =         0.0000 CMAP =       400.3730
  1-4 NB =      5552.6117 1-4 EEL =    -26732.7382 VDWAALS =
11747.7378
  EELEC =   -329234.9688 EHBOND =         0.0000 RESTRAINT =
233.0991
  EAMBER (non-restraint) =   -296277.7935
  EKCMT =     21040.2882 VIRIAL =     21873.4599 VOLUME     =
1105929.0761
                                                     SURFTEN =
58.1391
                                                     Density =
1.0213

| Final Performance Info:
|     -----------------------------------------------------
|     Average timings for last    5000 steps:
|     Elapsed(s) =      75.16 Per Step(ms) =      15.03
|         ns/day =       5.75   seconds/ns =   15032.95
|
|     Average timings for all steps:
|     Elapsed(s) =   15109.55 Per Step(ms) =      15.11
|         ns/day =       5.72   seconds/ns =   15109.55
|     -----------------------------------------------------

* 2nd simulation:

4090 SPFP:
     time: real    432m41.851s    Total wall time:       25962
seconds     7.21 hours
the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
      final mdinfo:
  NSTEP = 10000000   TIME(PS) = 177500.000 TEMP(K) =   299.26 PRESS
=   129.1
  Etot   =   -810728.8221 EKtot   =    222621.7188 EPtot      =
-1033350.5408
  BOND   =     19292.7860 ANGLE   =     74966.4717 DIHED =     45941.9703
  UB     =         0.0000 IMP     =         0.0000 CMAP =       996.5214
  1-4 NB =     18251.2200 1-4 EEL =    -16867.0356 VDWAALS =
49981.1683
  EELEC = -1225913.6430 EHBOND =         0.0000 RESTRAINT =
0.0000
  EKCMT =     58240.5720 VIRIAL =     48950.1407 VOLUME     =
3334060.8314
                                                     SURFTEN =
-66.6101
                                                     Density =
1.0175

| Final Performance Info:
|     -----------------------------------------------------
|     Average timings for last   30000 steps:
|     Elapsed(s) =      77.60 Per Step(ms) =       2.59
|         ns/day =      33.40   seconds/ns =    2586.60
|
|     Average timings for all steps:
|     Elapsed(s) =   25954.90 Per Step(ms) =       2.60
|         ns/day =      33.29   seconds/ns =    2595.49
|     -----------------------------------------------------

4090 DPFP:
     time:
     time: still running. It's been running for more than two days and
all appears well.

5090 SPFP:
     time: real    5m0.957s
     crashes with this error in the STDE:
of length = 42Failed an illegal memory access was encountered
     last step in outfile:
NSTEP =   110000   TIME(PS) = 167610.000 TEMP(K) =   300.88 PRESS
=-45891.4

5090 DPFP:
     time: still running. It's been running for more than two days and
all appears well.

It is interesting to point out that when running in single precision in
the 1st simulation, both runs (with 4090 and 5090) had an strange
behavior with mdinfo. It did not show the final data, despite the fact
the ".out" file reports that it did end well. However for the second
simulation, running over the 4090, the mdinfo reports the final data.

Summarizing:
There appears to be a problem when running pmemd.cuda_SPFP in a RTX5090
(Blackwell).
I know this has been reported for other software. Could it be the same
case here?

These are my system settings:

cat /etc/os-release
PRETTY_NAME="AlmaLinux 8.10 (Cerulean Leopard)"

uname -a
Linux brizard 6.14.7-1.el8.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Sun May
18 11:48:16 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux

lscpu |grep Ryzen
Model name:          AMD Ryzen 9 7950X 16-Core Processor

nvidia-smi
Fri May 30 17:44:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03 CUDA
Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU Name                 Persistence-M | Bus-Id          Disp.A |
Volatile Uncorr. ECC |
| Fan Temp   Perf          Pwr:Usage/Cap |           Memory-Usage |
GPU-Util Compute M. |
|                                         | |               MIG M. |
|=========================================+========================+======================|
|   0 NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off
|                  N/A |
| 71%   68C    P1            574W / 575W |     512MiB / 32607MiB |
100%      Default |
|                                         | |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI              PID   Type   Process
name                        GPU Memory |
|        ID ID Usage      |
|=========================================================================================|
|    0   N/A N/A          633167      C
./gpu_stress_test                       502MiB |

  gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-26)

$ modinfo nvidia | grep license
license:        Dual MIT/GPL

thank you very much

Oscar C.S.

-- 
Oscar Conchillo Solé
Computational Biology Group
Data Center Manager, Sysadmin and Bioinformatics
Institut de Biotecnologia i Biomedicina (UAB)
Department of Genetics and Microbiology (UAB)
mail:Oscar.Conchillo.uab.cat
telf: 0034 93581 4431
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber

Received on Tue Jul 15 2025 - 01:00:02 PDT