Re: [AMBER] sander.MPI problem

From: peker milas <pekermilas.gmail.com>
Date: Fri, 21 May 2010 15:30:29 -0400

Hello Gustavo,

I have some good news and some bad news. I will start with bad news.
>From my past installation experience and from the results i recently
had, starting from OpenMPI version 1.3.8 to 1.4.2, problem can appear
anywhere among tests. It looks like it s not only an OpenMPI issues
but also Ubuntu Karmic Koala issue. Non of the versions that i told
above work fine unfortunately. Now, good news are; i installed mpich2
from Ubuntu repos and then i configured amber with it. It worked fine.
The only trick in configuration was; after i installed it, mpich2
automatically created a ".mpd.conf" file with an additional
"MPD_USE_ROOT_MPD=1" line in it (second line from top). I comment it
out then i run mpd. After that i installed parallel version of Amber.

There is also one more thing i need to mention. Without having any
doubts we previously found another race condition for at least one of
the PIMD tests (specifically cd PIMD/full_cmd_water/equilib &&
./Run.full_cmd). Therefore i canceled those tests when i run them this
time. It looks like it will fail again. Anyway, others are fine and
reproducible (both successes and failures). I would like to thank you
for you help again...

best
peker milas

On Fri, May 21, 2010 at 11:28 AM, Gustavo Seabra
<gustavo.seabra.gmail.com> wrote:
> Hi Peker,
>
> Thanks a lot. I'll be waiting to know if it works for you.
>
> The problem with fixing this bug is that, as it happened when I first
> reported it on the dev list, people just can't reproduce it (see the
> thread here: http://dev-archive.ambermd.org/201005/0002.html). Apart
> from a somewhat similar report from Lachele Foley (but with the
> stalling happening always in the same point), no one else saw anything
> similar. Some people there using the same system didn't see the
> errors, so it becomes really hard to find. And, it may really be an
> OpenMPI bug, not Amber's, since something similar has been reported
> before by other MPI users, in programs other than Amber...
>
> Cheers,
> Gustavo.
>
> On Fri, May 21, 2010 at 9:47 AM, peker milas <pekermilas.gmail.com> wrote:
>> Thank you Gustavo,
>>
>> Actually this was not a new problem i faced with it before (like 3-4
>> months ago). I thought i fixed it but apparently it shows up again.
>> The unfortunate thing was after weeks of debugging and everything me
>> and my collaborator decided that this sander.MPI may have race
>> condition type bug. Of course it s very hard to figure out and we
>> couldn't verify if it was really a race condition. We sent an email to
>> this group and nobody answered. Anyway i installed locally
>> openmpi1.4.1 and 1.4.2 and i m currently re-configuring amber for
>> them. If i can find the bug this time or if i can fix it with those
>> new versions of openmpi i will let you know. The very unfortunate and
>> frustrating thing is it looks like nobody needs to know about this bug
>> and nobody needs to fix it.
>>
>> thank you so much again
>> peker milas
>>
>> On Thu, May 20, 2010 at 4:35 PM, Gustavo Seabra
>> <gustavo.seabra.gmail.com> wrote:
>>> Hi Peter,
>>>
>>> I have experienced exactly the same symptoms. It appears to be related
>>> to a bug involving OpenMPI and Ubuntu 9.10, as described here:
>>>
>>> https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/504659
>>>
>>> Apparently, OpenMPI v 1.4.1 works, but that's not what available from
>>> apt-get in Ubuntu, I believe. I haven't tried installing OpneMPI 1.4.1
>>> yet.
>>>
>>> HTH,
>>> --
>>> Gustavo Seabra
>>> Professor Adjunto
>>> Departamento de Química Fundamental
>>> Universidade Federal de Pernambuco
>>> Fone: +55-81-2126-7417
>>>
>>>
>>>
>>> On Thu, May 20, 2010 at 5:22 PM, peker milas <pekermilas.gmail.com> wrote:
>>>> Thank you so much for your response,
>>>>
>>>> here is the detailed information;
>>>>
>>>> Hardware: Two intel Nehalem processors total 8 physical 16 logical cores
>>>> OS: Ubuntu Karmic Koala with gcc4.4.1 and gfortran.
>>>> Parallelization: OpenMPI 1.4
>>>>
>>>> I applied all bugfixes, AmberTools installation and tests were fine
>>>> just two minor failures. Amber serial installation was also fine and
>>>> there was just 4 rounding-off type failures. As already discussed at
>>>> different times i created a symbolic link from /bin/bash to /bin/sh.
>>>> Parallel installation was fine again. Parallel tests gave me same
>>>> failures with serial test when i used 2 processors. After that i tried
>>>> with 4 and 8 processors, unfortunately in both cases tests stalled at
>>>> different tests. I mean results are totally not reproducible. At once
>>>> they stalled at Run.cap, next time at Run.tip4p_nve another time at
>>>> Run.dip and so on...I have two locally installed OpenMPI versions as
>>>> OpenMPI1.4 and OpenMPI1.2.8 such that i linked their lib and bin
>>>> folders to PATH and LD_PATH manually in my .bashrc file, i tried both,
>>>> nothing has changed. Also i was using PCGAMESS in parallel mode before
>>>> and even if i used it with 8 processors it worked just fine. As a last
>>>> piece of information i had from all above stalled processes, they are
>>>> all belong to sander.MPI. One last thing to say i cancelled all PIMD
>>>> tests because i wouldn't use them and what i explained above for all
>>>> the other tests.
>>>>
>>>> thank you so much
>>>> peker milas
>>>>
>>>> On Thu, May 20, 2010 at 3:56 PM, Jason Swails <jason.swails.gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> This doesn't provide much information helpful for debugging.  What are your
>>>>> system specs (OS, compilers, etc.)?  What test is it specifically failing on
>>>>> (or stalling on)?  Did the serial tests pass?  Have you applied all bug
>>>>> fixes?  The more details we have regarding system setup, the better chance
>>>>> someone will be able to help.
>>>>>
>>>>> All the best,
>>>>> Jason
>>>>>
>>>>> On Thu, May 20, 2010 at 3:19 PM, peker milas <pekermilas.gmail.com> wrote:
>>>>>
>>>>>> Dear Amber user and developers,
>>>>>>
>>>>>> My parallel (with openmpi1.4) Amber10 installation has a strange
>>>>>> problem. Let me try to explain it briefly, if run parallel tests with
>>>>>> only 2 processors (mpirun -np 2) everything goes fine except a couple
>>>>>> of failures. If i run them with 4 or more than 4 processors (mpirun
>>>>>> -np 4) it stalls in an arbitrary test. My computer has 8 physical cpu
>>>>>> s and it has shared memory parallelization. I used it for other
>>>>>> programs and there was no problem. So, really need help about this
>>>>>> number of processor issue and any will be greatly appreciated.
>>>>>>
>>>>>> thank you so much
>>>>>> peker milas
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>>
>
>
>
> --
> Gustavo Seabra
> Professor Adjunto
> Departamento de Química Fundamental
> Universidade Federal de Pernambuco
> Fone: +55-81-2126-7417
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Fri May 21 2010 - 13:00:03 PDT
Custom Search