Re: AMBER: Strange problems with PMEMD on Intel Xeons with Infiniband from Robert Duke on 2006-06-29 (Amber Archive Jun 2006)

From: Robert Duke <rduke.email.unc.edu>
Date: Thu, 29 Jun 2006 14:24:14 -0400

Fabian -
Well, you can never be certain that you have not encountered some simulation
condition that fails due to some compilation problem, or bug in the code,
but the longer you run without problems, the less likely this would seem,
and bugs in the code and problems with the actual simulation itself are a
lot less likely since you have successful runs on the opteron system. Does
your simulation on the problem machine match the simulation on the opteron
system (that runs without problems) for about 300 steps? This is typically
the case, though depending on the energetics of your starting system and the
options you have chosen, you may get fewer steps with identical results.
Does pmemd pass the amber 9 test suite on the problem cluster? Would the
factor ix benchmark produce essentially identical results for about 300
steps? These would be my first checks if I suspected a compiler problem (I
really don't in your case, given that you have run something like 150,000
steps without problems, but it never hurts to check). So I would probably
concentrate on how reliable the interconnect is as my next point.
Presumably your system guys have ways to check out the system in this
regard.
Regards - Bob

----- Original Message -----
From: "Fabian Boes" <fabian.boes.itb.uni-stuttgart.de>
To: <amber.scripps.edu>
Sent: Thursday, June 29, 2006 1:53 PM
Subject: Re: AMBER: Strange problems with PMEMD on Intel Xeons with
Infiniband

> Dear Bob,
>
>> Are you sure you are not exceeding a file system size limit (say 2GB) for
>> mdcrd or mdvel? Possibly file system quotas? Possibly full file
>> systems? It would seem it almost has to be the file system, somehow,
>> though I
>
> The mdcrd should get not bigger than 180 MB within 50k steps, as it is a
> rather small system. Some of the filesystems i used had over 15 TB of free
> space, but i will check for quotas.
>
>>becomes unreachable in the middle of a run, everything should freeze up as
>>you observe. I have mostly seen this kind of grief with Myrinet, but
>>commodity clusters in general are not immune. We had a lot of trouble at
>>unc for a while with our big infiniband p4 cluster with nodes becoming
>>unreachable; I have no idea whether it was hardware burnin problems,
>>config problems, loose
>
> So, if PMEMD runs for some, lets say 5000, steps, with resonable speed
> over Infiniband, then everything should be OK with the compilation?
>
> I will ask the guys at our computing center if they can help me debugging
> with the Infiniband cards and/or connections.
>
> Thanks for your answer,
>
> Fabian
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun Jul 02 2006 - 06:07:13 PDT