Re: [AMBER] Monitoring GPU voltage

From: Kenneth Lam <kenneth.lam.zh.gmail.com>
Date: Fri, 5 Dec 2014 16:08:16 -0500

I see. Thanks for the help!

On Thu, Dec 4, 2014 at 10:06 PM, Ross Walker <ross.rosswalker.co.uk> wrote:

> Hi Kenneth,
>
> I doubt the power draw reported by nvidia-smi, even if it did work would
> be sufficient for this. If you have a failing power supply, or a
> powersupply that wasn't up to the task of running multiple cards (e.g. not
> labelled as gold or platinum) then the most likely issue is going to be
> available current on the various 5 V rails. You'd need to monitor the
> voltage on each of the individual 5 volt rails and the current as well -
> this will require your soldering iron.
>
> A lot of cheap power supplies tie the 5V rails together so you can
> actually end up with two cards both on the same 5V rail - which is out of
> spec in terms of the amps that that rail can provide so it might brown out
> under load. Best way to check this is probably to pull all the cards from
> the case and test them one at a time trying all combinations of 5V PCI-E
> rails that you have for your power supplies. If you think that is really
> the problem and not for example, heat due to fans wearing out etc.
>
> All the best
> Ross
>
>
>
>
>
>
> On 12/4/14, 3:00 PM, "Kenneth Lam" <kenneth.lam.zh.gmail.com> wrote:
>
> >Hi Ross,
> >
> >We're currently using 4 GTX 780s in each of our machines, and some of them
> >are breaking down. We suspect this might be a power issue. We'd like to
> >track the power consumption of each of the cards while running AMBER so
> >that we know whether or not the cards are failing because of insufficient
> >power draw from the PSU. We're trying to find a solution that will allow
> >us to log the power consumption for the GPUs while they're running, and
> >using an ammeter may not be able to provide us with a running log.
> >Thanks!
> >
> >Kenneth
> >
> >
> >On Thu, Dec 4, 2014 at 5:16 PM, Ross Walker <ross.rosswalker.co.uk>
> wrote:
> >
> >> Hi Kenneth,
> >>
> >> Can I ask why you need to monitor the power consumption of the GPUs -
> >>it's
> >> not really information that is of much use as far as I can tell. If you
> >> want to know the power consumption it's probably better just to stick an
> >> ammeter on the lead to the power supply.
> >>
> >> All the best
> >> Ross
> >>
> >>
> >> On 12/4/14, 1:24 PM, "Kenneth Lam" <kenneth.lam.zh.gmail.com> wrote:
> >>
> >> >Unfortunately, the fixes mentioned in the linked threads point to the
> >>same
> >> >github repository that Tru indicated. Users there have noted that this
> >> >fix
> >> >does not work past 331.20, as pointed out here in this thread.
> >> >https://github.com/CFSworks/nvml_fix/issues/5
> >> >
> >> >We've tried going through nvidia-settings, but it does not report the
> >> >voltage consumption. Are there any other alternatives available?
> >>Thanks
> >> >again!
> >> >
> >> >Kenneth
> >> >
> >> >On Thu, Dec 4, 2014 at 4:14 PM, Jason Swails <jason.swails.gmail.com>
> >> >wrote:
> >> >
> >> >>
> >> >> > On Dec 4, 2014, at 3:08 PM, Kenneth Lam <kenneth.lam.zh.gmail.com>
> >> >> wrote:
> >> >> >
> >> >> > Hello all,
> >> >> >
> >> >> > We're unable to monitor the voltage going to our GTX 680s and 780s.
> >> >>We
> >> >> > have been trying to use nvidia-smi to do so, but it does not
> >>support
> >> >>any
> >> >> > cards past the GTX500 series. Is there a recommended software that
> >> >>works
> >> >> > with current gen GPUs (GTX 680+) and works on Linux, or should
> >>this be
> >> >> done
> >> >> > at the hardware level? If yes, what would you recommend? Thanks!
> >> >>
> >> >> This has been discussed in the nVidia forums before with some fixes
> >>to
> >> >> NVML being proposed (really more of a band-aid). You can try the
> >>fixes
> >> >> discussed there (may be outdated). Alternatively, I think you can
> >>also
> >> >>get
> >> >> that information in nvidia-settings.
> >> >>
> >> >>
> >> >>
> >> >>
> >>
> >>
> https://devtalk.nvidia.com/default/topic/560248/system-management-and-mon
> >>
> >>>>itoring-nvml-/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported-
> >>>>/
> >> >> <
> >> >>
> >> >>
> >>
> >>
> https://devtalk.nvidia.com/default/topic/560248/system-management-and-mon
> >>
> >>>>itoring-nvml-/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported-
> >>>>/
> >> >> >
> >> >>
> >> >> These statistics *are* reported for the Tesla line, so I¹m not sure
> >>if
> >> >> this is a marketing move that nVidia is using to promote their HPC
> >>line
> >> >>or
> >> >> what (but according to the above thread, such reporting _is_
> >>supported
> >> >>in
> >> >> hardware for those cards).
> >> >>
> >> >> HTH,
> >> >> Jason
> >> >>
> >> >> --
> >> >> Jason M. Swails
> >> >> BioMaPS,
> >> >> Rutgers University
> >> >> Postdoctoral Researcher
> >> >>
> >> >> _______________________________________________
> >> >> AMBER mailing list
> >> >> AMBER.ambermd.org
> >> >> http://lists.ambermd.org/mailman/listinfo/amber
> >> >>
> >> >_______________________________________________
> >> >AMBER mailing list
> >> >AMBER.ambermd.org
> >> >http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >>
> >>
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER.ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >_______________________________________________
> >AMBER mailing list
> >AMBER.ambermd.org
> >http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri Dec 05 2014 - 13:30:02 PST
Custom Search