Re: [AMBER] Accelerating microsecond-scale simulations on GPUs

From: Carlos Simmerling via AMBER <amber.ambermd.org>
Date: Sun, 24 Nov 2024 09:25:46 -0500

I think you just need to wonder whether it is better overall to run 1 at a
time using all 8 gpus, or use fewer gpus per job and run more at once.
Multi gpu md is not 100% efficient, but multiple md runs at once is.

On Sun, Nov 24, 2024, 9:21 AM Maciej Spiegel <maciej.spiegel.umw.edu.pl>
wrote:

> Actually, these are five systems, three separate replicas for each, and
> replica running for five consecutive 1-microsecond intervals. That’s why I
> thought about other approach to overcome… 2 weeks (?) of computations.
>
> Best
>
>
> Maciej Spiegel, MPharm PhD
> *assistant professor*
>
> *Department of Organic Chemistry **and **Pharmaceutical Technology,*
> *Faculty of Pharmacy, **Wroclaw Medical University*
> *Borowska 211A,
> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g> **50-556
> Wroclaw, Poland
> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g>*
>
> Wiadomość napisana przez Carlos Simmerling <carlos.simmerling.gmail.com>
> w dniu 24 lis 2024, o godz. 14:51:
>
> 
> Yes, it's 4x slower compared to 8 gpus. But you have 15 jobs to run so
> it's still about 2x faster to run each on 1 gpu compared to 1 md after
> another, each using 8 gpus. Runs 8 single jobs, then 7 single (or 6 using 1
> gpu and 1 using 2 gpu).
>
> On Sun, Nov 24, 2024, 8:36 AM Maciej Spiegel <maciej.spiegel.umw.edu.pl>
> wrote:
>
>> Alright, so with just one GPU, the averaged timings drops drastically to
>>
>> *| Average timings for all steps:*
>> *| Elapsed(s) = 310.95 Per Step(ms) = 1.78*
>> *| ns/day = 194.50 seconds/ns = 444.21*
>>
>> and so:
>>
>> *| Estimated time remaining: 123.3 hours.*
>>
>> Not good, given its just a first step (1us) of five.
>>
>> best,
>> –
>> Maciej Spiegel, MPharm PhD
>> *assistant professor*
>> .GitHub <https://farmaceut.github.io>
>>
>> *Department of Organic Chemistry **and **Pharmaceutical Technology,*
>> *Faculty of Pharmacy, **Wroclaw Medical University*
>> *Borowska 211A,
>> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g> **50-556
>> Wroclaw, Poland
>> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g>*
>>
>> Wiadomość napisana przez Carlos Simmerling <carlos.simmerling.gmail.com>
>> w dniu 24.11.2024, o godz. 14:03:
>>
>> Try it and see (compare timings with 1 vs 8). Since you have multiple md
>> runs to perform, it will be faster overall to run 1 per gpu.also i don't
>> mix cpu and gpu jobs in the same slurm script, you hold all resources while
>> each step runs.i have the end of 1 script submit the next (or use slurm
>> dependencies).
>>
>> On Sun, Nov 24, 2024, 7:56 AM Maciej Spiegel <maciej.spiegel.umw.edu.pl>
>> wrote:
>>
>>> Sorry if the snippet was unclear: Minimization, heating and equilibrium
>>> steps are run with the CPU, then switch to the GPU for production (see
>>> below).
>>> Do you think 1 GPU would really be sufficient? If I achieve 762.52
>>> ns/day with 8 GPUs, wouldn’t the performance drop drastically with fewer
>>> GPUs?
>>>
>>> #!/bin/bash -l
>>> #SBATCH --partition=tesla
>>> #SBATCH --nodes=1
>>> #SBATCH --ntasks=32
>>> #SBATCH --cpus-per-task=1
>>> #SBATCH --gres=gpu:tesla:8
>>> #SBATCH --time=168:00:00
>>> #SBATCH --error=%j.err
>>> #SBATCH --output=%j.out
>>>
>>> module purge
>>> module load amber/22
>>>
>>> prmtop=hmr.prmtop
>>> inpcrd=md.inpcrd
>>>
>>> # Run minimization steps (MPI)
>>> for i in {1..4}; do
>>> if [ $i -gt 1 ]; then
>>> prev_rst="step${prev_run}.rst" # Corrected the restart file
>>> name from previous minimization step
>>> mpirun -np 32 $AMBERHOME/bin/pmemd.MPI \
>>> -O \
>>> -i step${i}.in \
>>> -o step${i}.out \
>>> -p $prmtop \
>>> -c $prev_rst \
>>> -r step${i}.rst \
>>> -x step${i}.nc \
>>> -ref $prev_rst
>>> else
>>> mpirun -np 32 $AMBERHOME/bin/pmemd.MPI \
>>> -O \
>>> -i step${i}.in \
>>> -o step${i}.out \
>>> -p $prmtop \
>>> -c $inpcrd \
>>> -r step${i}.rst \
>>> -x step${i}.nc \
>>> -ref $inpcrd
>>> fi
>>> prev_run=$i # Update for the next loop iteration
>>> done
>>>
>>> # Sequential Production MD (5 times)
>>> prev_rst="step${prev_run}.rst" # Corrected to use minimization restart
>>> file for the first MD run
>>> for i in {1..5}; do
>>> if [ $i -gt 1 ]; then
>>> prev_rst="final.${prev_run}.rst" # Correctly use previous MD
>>> run's restart file for subsequent runs
>>> fi
>>> # Run production MD (GPU)
>>> mpirun -np 8 $AMBERHOME/bin/pmemd.cuda.MPI \
>>> -O \
>>> -i final.in \
>>> -o final.${i}.out \
>>> -p $prmtop \
>>> -c $prev_rst \
>>> -r final.${i}.rst \
>>> -x final.${i}.nc \
>>> -ref $prev_rst
>>>
>>> prev_run=$i # Update for the next loop iteration
>>> done
>>>
>>>
>>>
>>>
>>>
>>> –
>>> Maciej Spiegel, MPharm PhD
>>> *assistant professor*
>>> .GitHub <https://farmaceut.github.io/>
>>>
>>> *Department of Organic Chemistry **and **Pharmaceutical Technology,*
>>> *Faculty of Pharmacy, **Wroclaw Medical University*
>>> *Borowska 211A,
>>> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g> **50-556
>>> Wroclaw, Poland
>>> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g>*
>>>
>>> Wiadomość napisana przez Carlos Simmerling <carlos.simmerling.gmail.com>
>>> w dniu 24.11.2024, o godz. 13:32:
>>>
>>> That script submits both a cpu job and a GPU job. Don't do that. I
>>> suggest a GPU job using only 1 gpu per md run and no mpi.
>>> Use your 8 gpus for the multiple md runs, 1 GPU each. It will be much
>>> more efficient.
>>>
>>> On Sun, Nov 24, 2024, 6:49 AM Maciej Spiegel via AMBER <
>>> amber.ambermd.org> wrote:
>>>
>>>> Here’s a corrected and polished version of your text:
>>>>
>>>> Hello,
>>>> I need to run a 5-microsecond simulation of my system containing 39,391
>>>> atoms.
>>>> I am using eight Tesla V100-SXM2 GPUs, running a job in SLURM with the
>>>> following configuration:
>>>>
>>>> $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
>>>> #SBATCH --nodes=1
>>>> #SBATCH --ntasks=32
>>>> #SBATCH --cpus-per-task=1
>>>> #SBATCH --gres=gpu:tesla:8
>>>> #SBATCH --time=168:00:00
>>>> …
>>>> mpirun -np 32 $AMBERHOME/bin/pmemd.MPI …
>>>> mpirun -np 8 $AMBERHOME/bin/pmemd.cuda.MPI ...
>>>> …
>>>> $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
>>>> Based on the current timing information, the average performance is
>>>> 762.52 ns/day, and the estimated runtime is approximately 160 hours. There
>>>> are 5 systems in total, and I also wish to run 3 replicas for each system.
>>>>
>>>> Is there anything else, aside from the HMR topology (which I have
>>>> already applied), that I can use to further accelerate the job?
>>>>
>>>> Thanks
>>>> ———
>>>> Maciej Spiegel, MPharm PhD
>>>> assistant professor
>>>> .GitHub <https://farmaceut.github.io/>
>>>>
>>>> Department of Organic Chemistry and Pharmaceutical Technology,
>>>> Faculty of Pharmacy, Wroclaw Medical University
>>>> Borowska 211A, 50-556 Wroclaw, Poland
>>>> <https://www.google.com/maps/search/Borowska+211A,+50-556+Wroclaw,+Poland?entry=gmail&source=g>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>>
>>>
>>>
>>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Nov 24 2024 - 06:30:02 PST
Custom Search