Re: [AMBER] GPU kernel error

From: Dmitry Mukha <dvmukha.gmail.com>
Date: Sun, 17 Jun 2012 18:56:11 +0300

The system is HP z800 workstation upgraded with Tesla card 'by hand'.
Certainly, the chassis (minitower) is not engineered to cool a PCI-E card
with additional fan. Could video card cooler solve the problem if it is
properly installed?

2012/6/17 Ross Walker <ross.rosswalker.co.uk>

> Hi Dmitry,
>
> I can almost guarantee that this is caused by inadequate cooling of your
> M2090 card. The fact it locks the machine up is key. I have seen exactly
> the
> same issue with test cards I have in homemade machines. The M2090s have to
> be in a properly ducked chassis designed specifically for them. I would
> suggest both sending the card back under warranty and contacting the
> manufacturer of your machine to find out if their cooling system is
> properly
> certified for M2090 cards. If it is then you might want to check for
> malfunctioning fans etc but most of these 1U or 2U server chassis for
> M2090's have full fan monitoring and alarms so if there is a fan issue it
> should already be alarming.
>
> Out of interest what is the exact specs of your system? Who makes it? If
> you
> let me know the model and the manufacturer I can try to escalate within
> NVIDIA.
>
> All the best
> Ross
>
> > -----Original Message-----
> > From: Dmitry Mukha [mailto:dvmukha.gmail.com]
> > Sent: Sunday, June 17, 2012 5:54 AM
> > To: AMBER Mailing List
> > Subject: Re: [AMBER] GPU kernel error
> >
> > Hi Ross!
> >
> > What do think, may it be connected with some BIOS settings? I faced
> > with
> > the similar problem with M2090. In /var/log/messages I found something
> > like
> > this right before 'stuck' error (this example was taken from forum, but
> > text of the messages was the same)
> >
> > Message from syslogd.phoenix at Apr 20 13:05:41 ...
> > kernel:[ 4787.436095] Do you have a strange power saving mode enabled?
> >
> > Message from syslogd.phoenix at Apr 20 13:05:41 ...
> > kernel:[ 4787.436104] Dazed and confused, but trying to continue
> >
> > System is Fedora 16, kernel 3.3.6-3.fc16.x86_64, error is reproduced
> > oddly,
> > all tests were done fine. PC goes mad and need to be rebooted manually
> > because shutdown -r now doesn't work.
> >
> > 2012/6/16 Ross Walker <ross.rosswalker.co.uk>
> > >
> > > Hi Fernando,
> > >
> > > Is this the only output you get,is there anything in the mdout file.
> > Also
> > is it reproducible as in it always occurs at the same step every time?
> > What
> > about if you try a slightly different system or simulation parameters?
> >
> > >
> > >
> > --
> > Sincerely,
> > Dmitry Mukha
> > Institute of Bioorganic Chemistry, NAS, Minsk, Belarus
> > _______________________________________________
> > AMBER mailing list
> > AMBER.ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Sincerely,
Dmitry Mukha
Institute of Bioorganic Chemistry, NAS, Minsk, Belarus
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Sun Jun 17 2012 - 09:00:02 PDT
Custom Search