Dear Amber people
We've just acquired an Nvidia RTX5090 which we plan to use mainly with
amber.
We already have a 3080 and a 4090 which both work great with amber.
#### Summary of the following long mail ######
Amber24 crashes when running pmemd.cuda (pmemd.cuda_SPFP) with this
error in stderr:
of length = 42Failed an illegal memory access was encountered
#### END of summary #####
However, In the computer with the Nvidia RTX5090:
I could not manage to run Amber23 , neither the one we already had
compiled nor could we recompile it again due to a miniconda failure with
broken dependencies.
Since I saw that there is a new version of amber I went for it.
First thing I realized is that after executing:
./update_amber --update
I got this error:
Downloading updates for Amber 24
Downloading Amber 24/update.1 (1.10 KB)
Applying Amber 24/update.1
Downloading Amber 24/update.2 (317.64 KB)
Downloading: [::::::::::::::::::::::::::::::::::::::::::::::::::] 100.0%
Done.
Applying Amber 24/update.2
PatchingError: .patches/Amber24_Unapplied_Patches/update.2 failed to
apply. No changes made from this patch
(All went well for Ambertools24)
I decided to keep going to see what happened and I managed to compile
the code, this was my cmake command:
cmake ../pmemd24_src
-DCMAKE_INSTALL_PREFIX=/sharelab/labapps/AMBER/AMBER25/Amber25-GPU_MPI
-DCOMPILER=GNU -DMPI=TRUE -DOPENMP=TRUE -DCUDA=TRUE
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.8 -DINSTALL_TESTS=TRUE
-DDOWNLOAD_MINICONDA=FALSE -DBUILD_PYTHON=FALSE -DBUILD_PERL=FALSE
-DBUILD_GUI=FALSE -DPMEMD_ONLY=TRUE -DCHECK_UPDATES=FALSE
lots of warnings but it did compile.
(cuda-12.6 does not work with blackwell GPUs, but I still tried and it
did not work)
make test.serial and test.cuda.serial went well, but I realized that
test.cuda.serial uses PREC_MODEL = DPFP which uses "bin/pmemd.cuda_DPFP"
as executable.
make.test.cuda.seral last lines:
Finished CUDA test suite for Amber 24 at Fri Jul 4 11:56:53 CEST 2025.
273 file comparisons passed
15 file comparisons failed (8 of which can be ignored)
4 tests experienced errors
this were the errors:
==============================================================
cd tip4pew/ && ./Run.tip4pew_box_npt DPFP yes
ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
Your system density has likely changed by a large amount, probably from
starting the simulation from a structure a long way from equilibrium.
[Although this error can also occur if the simulation has blown up
for some reason]
The GPU code does not automatically reorganize grid cells and thus you
will need to restart the calculation from the previous restart file.
This will generate new grid cells and allow the calculation to continue.
It may be necessary to repeat this restarting multiple times if your
system
is a long way from an equilibrated density.
Alternatively you can run with the CPU code until the density has
converged
and then switch back to the GPU code.
./Run.tip4pew_box_npt: Program error
make[1]: [Makefile:198: test.pmemd.cuda.pme] Error 1 (ignored)
cd tip4pew/ && ./Run.tip4pew_oct_nvt DPFP yes
diffing mdout.tip4pew_oct_nvt.GPU_DPFP with mdout.tip4pew_oct_nvt
PASSED
==============================================================
==============================================================
cd gamd/PPIGaMD && ./Run.ppigamd DPFP yes
ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
Your system density has likely changed by a large amount, probably from
starting the simulation from a structure a long way from equilibrium.
[Although this error can also occur if the simulation has blown up
for some reason]
The GPU code does not automatically reorganize grid cells and thus you
will need to restart the calculation from the previous restart file.
This will generate new grid cells and allow the calculation to continue.
It may be necessary to repeat this restarting multiple times if your
system
is a long way from an equilibrated density.
Alternatively you can run with the CPU code until the density has
converged
and then switch back to the GPU code.
./Run.ppigamd: Program error
make[1]: [Makefile:227: test.pmemd.cuda.pme.sgamd] Error 1 (ignored)
#Begin tests
------------------------------------
Running CUDA Virtual Site tests.
cd virtual_sites/tip4p && ./Run.ec DPFP
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
diffing etot.type5_DPFP.save with etot.type5
PASSED
==============================================================
==============================================================
STOP PMEMD Terminated Abnormally!
./Run.npt: Program error
make[1]: [Makefile:424: test.pmemd.cuda.VirtualSites] Error 1 (ignored)
cd virtual_sites/tip5p && ./Run.ec DPFP
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
diffing etot.type8_DPFP.save with etot.type8
PASSED
==============================================================
==============================================================
cd virtual_sites/BromoBenzene && ./Run.npt DPFP
ERROR: Calculation halted. Periodic box dimensions have changed too
much from their initial values.
Your system density has likely changed by a large amount, probably from
starting the simulation from a structure a long way from equilibrium.
[Although this error can also occur if the simulation has blown up
for some reason]
The GPU code does not automatically reorganize grid cells and thus you
will need to restart the calculation from the previous restart file.
This will generate new grid cells and allow the calculation to continue.
It may be necessary to repeat this restarting multiple times if your
system
is a long way from an equilibrated density.
Alternatively you can run with the CPU code until the density has
converged
and then switch back to the GPU code.
./Run.npt: Program error
make[1]: [Makefile:430: test.pmemd.cuda.VirtualSites] Error 1 (ignored)
cd virtual_sites/DimethylEther && ./Run.ec DPFP
Note: The following floating-point exceptions are signalling:
IEEE_INVALID_FLAG IEEE_DENORMAL
diffing etot.type8_DPFP.save with etot.type8
PASSED
==============================================================
But being only 4 and all reported as "PASSED" I have a strong feeling
that I can ignore them.
However, since the pmemd.cuda that the software has installed is a link
to bin/pmemd.cuda_SPFP (like in the other previous versions) I also
executed inside the test directory:
./test_amber_cuda_serial.sh SPFP
In this case it reported many errors:
Finished CUDA test suite for Amber 24 at Fri Jul 4 12:11:47 CEST 2025.
146 file comparisons passed
44 file comparisons failed (9 of which can be ignored)
101 tests experienced errors
I can see in the logfile that since the line that says:
Running CUDA GTI free energy tests.
all the tests to the end fail.
Apart from that we also run two MD simulations that we had previously
run in a computer with an Nvidia 4090 with the same drivers.
* 1st simulation:
The first one completed without an error in both cases, but we have
observed some events we think are worth noticing:
4090 SPFP:
time: real 14m13.784s Total wall time: 854
seconds 0.24 hours
the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
final mdinfo:
NSTEP = 950000 TIME(PS) = 16450.000 TEMP(K) = 304.24 PRESS =
31.5
Etot = -221832.0665 EKtot = 74523.8516 EPtot =
-296355.9181
BOND = 5825.3681 ANGLE = 22477.4361 DIHED = 13736.1496
UB = 0.0000 IMP = 0.0000 CMAP = 401.9350
1-4 NB = 5548.2509 1-4 EEL = -26704.3318 VDWAALS =
11817.9771
EELEC = -329679.5838 EHBOND = 0.0000 RESTRAINT =
220.8807
EAMBER (non-restraint) = -296576.7988
EKCMT = 21371.2401 VIRIAL = 20620.6276 VOLUME =
1103684.8218
SURFTEN =
-147.0103
Density =
1.0234
------------------------------------------------------------------------------
| Current Timing Info
| -------------------
| Total steps: 1000000 | Completed: 950000 ( 95.0%) |
Remaining: 50000
|
| Average timings for last 75000 steps:
| Elapsed(s) = 63.72 Per Step(ms) = 0.85
| ns/day = 101.69 seconds/ns = 849.65
|
| Average timings for all steps:
| Elapsed(s) = 808.09 Per Step(ms) = 0.85
| ns/day = 101.57 seconds/ns = 850.62
|
|
| Estimated time remaining: 42.5 seconds.
------------------------------------------------------------------------------
4090 DPFP:
time:real 326m8.975s Total wall time: 19569 seconds
5.44 hours
the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
final mdinfo:
NSTEP = 1000000 TIME(PS) = 16500.000 TEMP(K) = 304.20 PRESS
= 98.9
Etot = -221528.4593 EKtot = 74512.1291 EPtot =
-296040.5883
BOND = 5783.8532 ANGLE = 22242.1897 DIHED = 13737.2298
UB = 0.0000 IMP = 0.0000 CMAP = 389.4306
1-4 NB = 5605.6012 1-4 EEL = -26765.6143 VDWAALS =
11568.7499
EELEC = -328834.0661 EHBOND = 0.0000 RESTRAINT =
232.0378
EAMBER (non-restraint) = -296272.6262
EKCMT = 21224.3892 VIRIAL = 18867.0417 VOLUME =
1103916.7228
SURFTEN =
10.9442
Density =
1.0231
| Final Performance Info:
| -----------------------------------------------------
| Average timings for last 5000 steps:
| Elapsed(s) = 97.90 Per Step(ms) = 19.58
| ns/day = 4.41 seconds/ns = 19579.20
|
| Average timings for all steps:
| Elapsed(s) = 19566.12 Per Step(ms) = 19.57
| ns/day = 4.42 seconds/ns = 19566.12
| -----------------------------------------------------
5090 SPFP:
time: real 12m0.566s Total wall time: 720
seconds 0.20 hours
the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
final mdinfo:
NSTEP = 940000 TIME(PS) = 16440.000 TEMP(K) = 302.39 PRESS =
38.1
Etot = -222309.6997 EKtot = 74070.1875 EPtot =
-296379.8872
BOND = 5820.7564 ANGLE = 22301.2271 DIHED = 13729.4669
UB = 0.0000 IMP = 0.0000 CMAP = 404.1852
1-4 NB = 5591.7641 1-4 EEL = -26747.9411 VDWAALS =
11968.3334
EELEC = -329676.0689 EHBOND = 0.0000 RESTRAINT =
228.3895
EAMBER (non-restraint) = -296608.2767
EKCMT = 21188.2013 VIRIAL = 20278.1851 VOLUME =
1105365.1992
SURFTEN =
37.9072
Density =
1.0218
------------------------------------------------------------------------------
| Current Timing Info
| -------------------
| Total steps: 1000000 | Completed: 940000 ( 94.0%) |
Remaining: 60000
|
| Average timings for last 85000 steps:
| Elapsed(s) = 61.10 Per Step(ms) = 0.72
| ns/day = 120.19 seconds/ns = 718.83
|
| Average timings for all steps:
| Elapsed(s) = 676.44 Per Step(ms) = 0.72
| ns/day = 120.06 seconds/ns = 719.62
|
|
| Estimated time remaining: 43.2 seconds.
------------------------------------------------------------------------------
5090 DPFP:
time: real 251m50.459s Total wall time: 15110
seconds 4.20 hours
final mdinfo:
NSTEP = 1000000 TIME(PS) = 16500.000 TEMP(K) = 302.87 PRESS
= -34.9
Etot = -221858.7506 EKtot = 74185.9438 EPtot =
-296044.6944
BOND = 5945.5673 ANGLE = 22373.4519 DIHED = 13670.1717
UB = 0.0000 IMP = 0.0000 CMAP = 400.3730
1-4 NB = 5552.6117 1-4 EEL = -26732.7382 VDWAALS =
11747.7378
EELEC = -329234.9688 EHBOND = 0.0000 RESTRAINT =
233.0991
EAMBER (non-restraint) = -296277.7935
EKCMT = 21040.2882 VIRIAL = 21873.4599 VOLUME =
1105929.0761
SURFTEN =
58.1391
Density =
1.0213
| Final Performance Info:
| -----------------------------------------------------
| Average timings for last 5000 steps:
| Elapsed(s) = 75.16 Per Step(ms) = 15.03
| ns/day = 5.75 seconds/ns = 15032.95
|
| Average timings for all steps:
| Elapsed(s) = 15109.55 Per Step(ms) = 15.11
| ns/day = 5.72 seconds/ns = 15109.55
| -----------------------------------------------------
* 2nd simulation:
4090 SPFP:
time: real 432m41.851s Total wall time: 25962
seconds 7.21 hours
the following message appeared in the stderr/stdout:
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
final mdinfo:
NSTEP = 10000000 TIME(PS) = 177500.000 TEMP(K) = 299.26 PRESS
= 129.1
Etot = -810728.8221 EKtot = 222621.7188 EPtot =
-1033350.5408
BOND = 19292.7860 ANGLE = 74966.4717 DIHED = 45941.9703
UB = 0.0000 IMP = 0.0000 CMAP = 996.5214
1-4 NB = 18251.2200 1-4 EEL = -16867.0356 VDWAALS =
49981.1683
EELEC = -1225913.6430 EHBOND = 0.0000 RESTRAINT =
0.0000
EKCMT = 58240.5720 VIRIAL = 48950.1407 VOLUME =
3334060.8314
SURFTEN =
-66.6101
Density =
1.0175
| Final Performance Info:
| -----------------------------------------------------
| Average timings for last 30000 steps:
| Elapsed(s) = 77.60 Per Step(ms) = 2.59
| ns/day = 33.40 seconds/ns = 2586.60
|
| Average timings for all steps:
| Elapsed(s) = 25954.90 Per Step(ms) = 2.60
| ns/day = 33.29 seconds/ns = 2595.49
| -----------------------------------------------------
4090 DPFP:
time:
time: still running. It's been running for more than two days and
all appears well.
5090 SPFP:
time: real 5m0.957s
crashes with this error in the STDE:
of length = 42Failed an illegal memory access was encountered
last step in outfile:
NSTEP = 110000 TIME(PS) = 167610.000 TEMP(K) = 300.88 PRESS
=-45891.4
5090 DPFP:
time: still running. It's been running for more than two days and
all appears well.
It is interesting to point out that when running in single precision in
the 1st simulation, both runs (with 4090 and 5090) had an strange
behavior with mdinfo. It did not show the final data, despite the fact
the ".out" file reports that it did end well. However for the second
simulation, running over the 4090, the mdinfo reports the final data.
Summarizing:
There appears to be a problem when running pmemd.cuda_SPFP in a RTX5090
(Blackwell).
I know this has been reported for other software. Could it be the same
case here?
These are my system settings:
cat /etc/os-release
PRETTY_NAME="AlmaLinux 8.10 (Cerulean Leopard)"
uname -a
Linux brizard 6.14.7-1.el8.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Sun May
18 11:48:16 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
lscpu |grep Ryzen
Model name: AMD Ryzen 9 7950X 16-Core Processor
nvidia-smi
Fri May 30 17:44:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA
Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A |
Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage |
GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:01:00.0 Off
| N/A |
| 71% 68C P1 574W / 575W | 512MiB / 32607MiB |
100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process
name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 633167 C
./gpu_stress_test 502MiB |
gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-26)
$ modinfo nvidia | grep license
license: Dual MIT/GPL
thank you very much
Oscar C.S.
--
Oscar Conchillo Solé
Computational Biology Group
Data Center Manager, Sysadmin and Bioinformatics
Institut de Biotecnologia i Biomedicina (UAB)
Department of Genetics and Microbiology (UAB)
mail:Oscar.Conchillo.uab.cat
telf: 0034 93581 4431
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 15 2025 - 01:00:02 PDT