Re: [AMBER] sander.MPI problem

From: Gustavo Seabra <gustavo.seabra.gmail.com>
Date: Fri, 21 May 2010 12:28:27 -0300

Hi Peker,

Thanks a lot. I'll be waiting to know if it works for you.

The problem with fixing this bug is that, as it happened when I first
reported it on the dev list, people just can't reproduce it (see the
thread here: http://dev-archive.ambermd.org/201005/0002.html). Apart
from a somewhat similar report from Lachele Foley (but with the
stalling happening always in the same point), no one else saw anything
similar. Some people there using the same system didn't see the
errors, so it becomes really hard to find. And, it may really be an
OpenMPI bug, not Amber's, since something similar has been reported
before by other MPI users, in programs other than Amber...

Cheers,
Gustavo.

On Fri, May 21, 2010 at 9:47 AM, peker milas <pekermilas.gmail.com> wrote:
> Thank you Gustavo,
>
> Actually this was not a new problem i faced with it before (like 3-4
> months ago). I thought i fixed it but apparently it shows up again.
> The unfortunate thing was after weeks of debugging and everything me
> and my collaborator decided that this sander.MPI may have race
> condition type bug. Of course it s very hard to figure out and we
> couldn't verify if it was really a race condition. We sent an email to
> this group and nobody answered. Anyway i installed locally
> openmpi1.4.1 and 1.4.2 and i m currently re-configuring amber for
> them. If i can find the bug this time or if i can fix it with those
> new versions of openmpi i will let you know. The very unfortunate and
> frustrating thing is it looks like nobody needs to know about this bug
> and nobody needs to fix it.
>
> thank you so much again
> peker milas
>
> On Thu, May 20, 2010 at 4:35 PM, Gustavo Seabra
> <gustavo.seabra.gmail.com> wrote:
>> Hi Peter,
>>
>> I have experienced exactly the same symptoms. It appears to be related
>> to a bug involving OpenMPI and Ubuntu 9.10, as described here:
>>
>> https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/504659
>>
>> Apparently, OpenMPI v 1.4.1 works, but that's not what available from
>> apt-get in Ubuntu, I believe. I haven't tried installing OpneMPI 1.4.1
>> yet.
>>
>> HTH,
>> --
>> Gustavo Seabra
>> Professor Adjunto
>> Departamento de Química Fundamental
>> Universidade Federal de Pernambuco
>> Fone: +55-81-2126-7417
>>
>>
>>
>> On Thu, May 20, 2010 at 5:22 PM, peker milas <pekermilas.gmail.com> wrote:
>>> Thank you so much for your response,
>>>
>>> here is the detailed information;
>>>
>>> Hardware: Two intel Nehalem processors total 8 physical 16 logical cores
>>> OS: Ubuntu Karmic Koala with gcc4.4.1 and gfortran.
>>> Parallelization: OpenMPI 1.4
>>>
>>> I applied all bugfixes, AmberTools installation and tests were fine
>>> just two minor failures. Amber serial installation was also fine and
>>> there was just 4 rounding-off type failures. As already discussed at
>>> different times i created a symbolic link from /bin/bash to /bin/sh.
>>> Parallel installation was fine again. Parallel tests gave me same
>>> failures with serial test when i used 2 processors. After that i tried
>>> with 4 and 8 processors, unfortunately in both cases tests stalled at
>>> different tests. I mean results are totally not reproducible. At once
>>> they stalled at Run.cap, next time at Run.tip4p_nve another time at
>>> Run.dip and so on...I have two locally installed OpenMPI versions as
>>> OpenMPI1.4 and OpenMPI1.2.8 such that i linked their lib and bin
>>> folders to PATH and LD_PATH manually in my .bashrc file, i tried both,
>>> nothing has changed. Also i was using PCGAMESS in parallel mode before
>>> and even if i used it with 8 processors it worked just fine. As a last
>>> piece of information i had from all above stalled processes, they are
>>> all belong to sander.MPI. One last thing to say i cancelled all PIMD
>>> tests because i wouldn't use them and what i explained above for all
>>> the other tests.
>>>
>>> thank you so much
>>> peker milas
>>>
>>> On Thu, May 20, 2010 at 3:56 PM, Jason Swails <jason.swails.gmail.com> wrote:
>>>> Hello,
>>>>
>>>> This doesn't provide much information helpful for debugging.  What are your
>>>> system specs (OS, compilers, etc.)?  What test is it specifically failing on
>>>> (or stalling on)?  Did the serial tests pass?  Have you applied all bug
>>>> fixes?  The more details we have regarding system setup, the better chance
>>>> someone will be able to help.
>>>>
>>>> All the best,
>>>> Jason
>>>>
>>>> On Thu, May 20, 2010 at 3:19 PM, peker milas <pekermilas.gmail.com> wrote:
>>>>
>>>>> Dear Amber user and developers,
>>>>>
>>>>> My parallel (with openmpi1.4) Amber10 installation has a strange
>>>>> problem. Let me try to explain it briefly, if run parallel tests with
>>>>> only 2 processors (mpirun -np 2) everything goes fine except a couple
>>>>> of failures. If i run them with 4 or more than 4 processors (mpirun
>>>>> -np 4) it stalls in an arbitrary test. My computer has 8 physical cpu
>>>>> s and it has shared memory parallelization. I used it for other
>>>>> programs and there was no problem. So, really need help about this
>>>>> number of processor issue and any will be greatly appreciated.
>>>>>
>>>>> thank you so much
>>>>> peker milas
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>



-- 
Gustavo Seabra
Professor Adjunto
Departamento de Química Fundamental
Universidade Federal de Pernambuco
Fone: +55-81-2126-7417
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 21 2010 - 08:30:03 PDT
Custom Search