Re: [AMBER] gpu %utils vs mem used

From: Meij, Henk <hmeij.wesleyan.edu>
Date: Tue, 25 Jul 2017 17:49:58 +0000

Thank you, that was super helpful. I need an amber cluster, a gromacs cluster, a ....


I'll give the mixing of jobs some thoughts but simplicity has it's purpose too.


-Henk

________________________________
From: Ross Walker <ross.rosswalker.co.uk>
Sent: Tuesday, July 25, 2017 10:31:16 AM
To: AMBER Mailing List
Subject: Re: [AMBER] gpu %utils vs mem used

Hi Henk,

If only the laws of physics were so accommodating. ;-)

Unfortunately just because the GPU reports itself as 75% utilized does not mean the other 25% is available for additional computation. The missing 25% is lost in inefficiencies waiting for memory access, other cores to finish computing etc. If the speed of light was infinite one could just swap out tasks (and memory) instantly and do what you suggest. Unfortunately the speed of light is way too slow so the overhead in task switching swamps anything you might stand to gain by attempting this.

In terms of sizing - for AMBER it's super simple since we designed it to run everything on the GPU and not have the headache of having to match CPU cores etc. You simply run one calculation per GPU. It is designed so that GPUs don't interfere with each other so for cost effectiveness one should just max out the GPUs per node - so say 8 GPUs (1080TIs are ideal) on a dual socket node and then run 8 jobs on that single node. You can buy low core and clock speed CPUs to save money since you only need one core per GPU.

For Lammps and Gromacs it's way more complicated (especially Gromacs) since they try to use the CPUs at the same time - this leads to inefficiencies in utilizing the GPUs so you max out at about 2 GPUs per node. You also need to by high speed CPUs which have a massive price premium which makes the effective performance per dollar way lower with Gromacs. Probably your best bet if you have to use both AMBER and Gromacs together is either accept that a chunk of your node will be idle when running Gromacs or if you have say 8 GPUs in a node and 40 cores have a single Gromacs job use 32 cores and 2 GPUs and then also run 6 single GPU AMBER jobs on the same node. It needs some complicated scheduler configs to work but it's possible. Note since AMBER sits entirely on a GPU so you can run multiple jobs on a node without contention (1 per GPU). This is not true with Gromacs due to all the CPU to CPU and CPU to GPU communication that floods the communication channels between CPU cores!
  and the PCI-E bus to the GPU. As such you can't reliably run say 2 Gromacs jobs on the same node where one uses 20 cores and 2 GPUs and another uses the remaining 20 cores and remaining 2 GPUs.

Hope that helps. Unfortunately the conflicting code designs means there is no ideal config for AMBER, Lammps and Gromacs. :-(

All the best
Ross

> On Jul 25, 2017, at 9:06 AM, Meij, Henk <hmeij.wesleyan.edu> wrote:
>
> Indeed that helps Ross. My thought experiment went along the lines of: if a gpu is 75% utilized and you have two of them then half a "virtual" gpu is idle with two gpus, 5 idle gpus in 20, and so on. If that pattern persists into the 200+ range and onwards that's a lot of resources. If I could provide virtual gpus and size them to simulation requirements that would be ideal. Or buy gpus more fitting to our regular type jobs, but that is a difficult target.
>
>
> Gets really complicated with gromacs where multiple mpi ranks can share the gpu or multiple gpus. Any pointers to how to best size your gpu environment to software requirements appreciated; we run mostly amber, lammps and gromacs.
>
>
> -Henk
>
> ________________________________
> From: Ross Walker <ross.rosswalker.co.uk>
> Sent: Monday, July 24, 2017 3:16:09 PM
> To: AMBER Mailing List
> Subject: Re: [AMBER] gpu %utils vs mem used
>
> Hi Henk,
>
> Why would you assume that it would make sense to run more than one job on a single GPU? The AMBER code (and pretty much every other GPU code) is designed to use as much of a GPU as possible. Sure you can run 2 jobs on the same GPU but they'll end up running at half the speed or less (due to contention) each. The memory consideration is largely unrelated to performance. The memory usage, for AMBER, is a function of the size of the simulation you are running and, to a lesser extent, the choice of simulation options (NVT vs NPT, thermostat etc). The total floating point operations per byte is high in AMBER, each atom takes around 72 bytes to store the coordinates, forces and velocities but it is involved in a huge number of interactions involving bonds, angles, dihedrals, pair wise electrostatic and van der waals interactions and all the FFT framework making up the PME reciprocal space. The net result is that it is perfectly reasonable for a small simulation using a couple of!
  h!
> undred MB of memory to max out the compute units on the GPU itself.
>
> Hope that helps,
>
> All the best
> Ross
>
>
>> On Jul 24, 2017, at 2:20 PM, Meij, Henk <hmeij.wesleyan.edu> wrote:
>>
>> Hi All, this is not a pure Amber question, I observe the same with my Lammps users, but I figured there may be gpu expertise on this list to give me some insights.
>>
>>
>> My K20 environment is running with exclusive/persistent enabled. Taking a look at the size of the jobs I was wondering going the disabled route and push more jobs through.
>>
>>
>> But how/why do these tiny jobs each push gpu %util to above 70% while consuming such little memory? If that's real then the gpu can only handle one such job at a time?
>>
>>
>> -Henk
>>
>>
>> Mon Jul 24 13:51:26 2017
>> +------------------------------------------------------+
>> | NVIDIA-SMI 4.304.54 Driver Version: 304.54 |
>> |-------------------------------+----------------------+----------------------+
>> | GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
>> |===============================+======================+======================|
>> | 0 Tesla K20m | 0000:02:00.0 Off | 0 |
>> | N/A 40C P0 98W / 225W | 4% 205MB / 4799MB | 77% E. Process |
>> +-------------------------------+----------------------+----------------------+
>> | 1 Tesla K20m | 0000:03:00.0 Off | 0 |
>> | N/A 41C P0 106W / 225W | 5% 253MB / 4799MB | 72% E. Process |
>> +-------------------------------+----------------------+----------------------+
>> | 2 Tesla K20m | 0000:83:00.0 Off | 0 |
>> | N/A 26C P8 16W / 225W | 0% 13MB / 4799MB | 0% E. Process |
>> +-------------------------------+----------------------+----------------------+
>> | 3 Tesla K20m | 0000:84:00.0 Off | 0 |
>> | N/A 27C P8 15W / 225W | 0% 13MB / 4799MB | 0% E. Process |
>> +-------------------------------+----------------------+----------------------+
>>
>> +-----------------------------------------------------------------------------+
>> | Compute processes: GPU Memory |
>> | GPU PID Process name Usage |
>> |=============================================================================|
>> | 0 16997 pmemd.cuda.MPI 190MB |
>> | 1 16998 pmemd.cuda.MPI 238MB |
>> +-----------------------------------------------------------------------------+
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber


_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jul 25 2017 - 11:00:03 PDT
Custom Search