Re: [AMBER] Monitoring GPU voltage

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 04 Dec 2014 19:06:31 -0800

Hi Kenneth,

I doubt the power draw reported by nvidia-smi, even if it did work would
be sufficient for this. If you have a failing power supply, or a
powersupply that wasn't up to the task of running multiple cards (e.g. not
labelled as gold or platinum) then the most likely issue is going to be
available current on the various 5 V rails. You'd need to monitor the
voltage on each of the individual 5 volt rails and the current as well -
this will require your soldering iron.

A lot of cheap power supplies tie the 5V rails together so you can
actually end up with two cards both on the same 5V rail - which is out of
spec in terms of the amps that that rail can provide so it might brown out
under load. Best way to check this is probably to pull all the cards from
the case and test them one at a time trying all combinations of 5V PCI-E
rails that you have for your power supplies. If you think that is really
the problem and not for example, heat due to fans wearing out etc.

All the best
Ross






On 12/4/14, 3:00 PM, "Kenneth Lam" <kenneth.lam.zh.gmail.com> wrote:

>Hi Ross,
>
>We're currently using 4 GTX 780s in each of our machines, and some of them
>are breaking down. We suspect this might be a power issue. We'd like to
>track the power consumption of each of the cards while running AMBER so
>that we know whether or not the cards are failing because of insufficient
>power draw from the PSU. We're trying to find a solution that will allow
>us to log the power consumption for the GPUs while they're running, and
>using an ammeter may not be able to provide us with a running log.
>Thanks!
>
>Kenneth
>
>
>On Thu, Dec 4, 2014 at 5:16 PM, Ross Walker <ross.rosswalker.co.uk> wrote:
>
>> Hi Kenneth,
>>
>> Can I ask why you need to monitor the power consumption of the GPUs -
>>it's
>> not really information that is of much use as far as I can tell. If you
>> want to know the power consumption it's probably better just to stick an
>> ammeter on the lead to the power supply.
>>
>> All the best
>> Ross
>>
>>
>> On 12/4/14, 1:24 PM, "Kenneth Lam" <kenneth.lam.zh.gmail.com> wrote:
>>
>> >Unfortunately, the fixes mentioned in the linked threads point to the
>>same
>> >github repository that Tru indicated. Users there have noted that this
>> >fix
>> >does not work past 331.20, as pointed out here in this thread.
>> >https://github.com/CFSworks/nvml_fix/issues/5
>> >
>> >We've tried going through nvidia-settings, but it does not report the
>> >voltage consumption. Are there any other alternatives available?
>>Thanks
>> >again!
>> >
>> >Kenneth
>> >
>> >On Thu, Dec 4, 2014 at 4:14 PM, Jason Swails <jason.swails.gmail.com>
>> >wrote:
>> >
>> >>
>> >> > On Dec 4, 2014, at 3:08 PM, Kenneth Lam <kenneth.lam.zh.gmail.com>
>> >> wrote:
>> >> >
>> >> > Hello all,
>> >> >
>> >> > We're unable to monitor the voltage going to our GTX 680s and 780s.
>> >>We
>> >> > have been trying to use nvidia-smi to do so, but it does not
>>support
>> >>any
>> >> > cards past the GTX500 series. Is there a recommended software that
>> >>works
>> >> > with current gen GPUs (GTX 680+) and works on Linux, or should
>>this be
>> >> done
>> >> > at the hardware level? If yes, what would you recommend? Thanks!
>> >>
>> >> This has been discussed in the nVidia forums before with some fixes
>>to
>> >> NVML being proposed (really more of a band-aid). You can try the
>>fixes
>> >> discussed there (may be outdated). Alternatively, I think you can
>>also
>> >>get
>> >> that information in nvidia-settings.
>> >>
>> >>
>> >>
>> >>
>>
>>https://devtalk.nvidia.com/default/topic/560248/system-management-and-mon
>>
>>>>itoring-nvml-/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported-
>>>>/
>> >> <
>> >>
>> >>
>>
>>https://devtalk.nvidia.com/default/topic/560248/system-management-and-mon
>>
>>>>itoring-nvml-/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported-
>>>>/
>> >> >
>> >>
>> >> These statistics *are* reported for the Tesla line, so I©öm not sure
>>if
>> >> this is a marketing move that nVidia is using to promote their HPC
>>line
>> >>or
>> >> what (but according to the above thread, such reporting _is_
>>supported
>> >>in
>> >> hardware for those cards).
>> >>
>> >> HTH,
>> >> Jason
>> >>
>> >> --
>> >> Jason M. Swails
>> >> BioMaPS,
>> >> Rutgers University
>> >> Postdoctoral Researcher
>> >>
>> >> _______________________________________________
>> >> AMBER mailing list
>> >> AMBER.ambermd.org
>> >> http://lists.ambermd.org/mailman/listinfo/amber
>> >>
>> >_______________________________________________
>> >AMBER mailing list
>> >AMBER.ambermd.org
>> >http://lists.ambermd.org/mailman/listinfo/amber
>>
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber



_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Thu Dec 04 2014 - 19:30:03 PST
Custom Search