Hi German,
MIG should just work I believe. I don't think anything would need to be changed in the code. You just enable it in the driver, specify how many 'partial' GPUs you want and they show up in NVIDIA-smi. It's pretty simple but kind of annoyingly implemented in that you have to reboot everytime you want to change it. In terms of whether it will help for things like REMD and TI in AMBER I very much doubt it would be worth the cost. It's > 20x for an A100 over a RTX3080 and I suspect if you just run 4 or so REMD replicas per 3080 GPU it actually won't be that bad (assuming it fits in the memory) A 20GB 3080 is supposedly in the works so if that model gets released the GPU memory won't be the issue. I haven't tried it myself but I suspect it won't be too bad and would certainly save a lot of money.
The 3080 has a 320W TDP. So a little less than the 3090 but not a great deal. I didn't get power numbers on the last test I did unfortunately to see if it hits the peak.I'll try it again in a couple of days. I expect it will similarly hit the max TDP though so you'd be looking at 1.3KW for 4 GPUs under load. Still not great. :-( As for lifespan I doubt it makes much difference. Time will tell with the 3080s and 3090s but certainly the 2080TIs have been way more reliable in my experience than the V100s so there is a good chance that trend will continue. I've seen around a 30% failure rate with SXM2 and SXM3 V100s - the HBM memory is faulty and you start getting thousands of uncorrectable ECC errors. The design flaw is to the point they should probably be recalled but instead NVIDIA just recommends a convoluted approach for continually checking and rebooting. https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html <https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html>
It will be interesting to see if this flaw has been addressed in the A100s or not.
All the best
Ross
> On Oct 10, 2020, at 15:48, German P. Barletta <pbarletta.gmail.com> wrote:
>
> Ross,
> thanks for all the info. Hope you don't mind answering a few extra
> questions:
>
> 1. I've read that Amber developers plan on taking advantage of the new
> A100's MIG feature. So with 2 A100s one should be able to run REMD with up
> to, say, 12 replicas. This would be great news for those of us without
> access to HPC servers, but I wonder if A100s and the necessary hardware
> will be too expensive. Do you think they will be worth it? Or is it better
> to stick with the 3080s and hope the next generation will feature MIG on
> cheaper cards?
>
> 2. About the 3090s power consumption: do the 3080s have the same issue?
> Do you think some optimization on the code can be done or should I expect
> max power consumption and set up my system for extra cooling? I worry about
> the card lifespan too.
>
>
> Thanks for all the help.
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sat Oct 10 2020 - 14:30:02 PDT