Re: AMBER: pmemd source issues

From: Robert Duke <rduke.email.unc.edu>
Date: Wed, 9 May 2007 11:43:36 -0400

Hi Thomas -
Okay, I'll wait for the test case, but I think I can guess what is going on
here. When a process ends up with 0 images that it processes, processing is
prevented by effectively setting my_img_lo to my_img_hi + 1. However,
find_img_range() does not take this into account. It was written/debugged
prior to hitting such a scenario, and it has just not come up previously.
The way this could happen would be if the task with the highest id also has
an fft slab allocation, and we are running under conditions where it's
direct force processing has been pushed to 0. It is interesting that this
has not bitten us before; I'll probably just issue a patch that does image
range finding a bit differently for this scenario, but would appreciate
getting the test case so I can make sure everything is cool. Thanks much
for reporting this...
Regards - Bob

----- Original Message -----
From: "Thomas Zeiser" <thomas.zeiser.rrze.uni-erlangen.de>
To: <amber.scripps.edu>
Sent: Wednesday, May 09, 2007 11:00 AM
Subject: Re: AMBER: pmemd source issues


> Hi Robert,
>
> On Wed, May 09, 2007 at 09:37:59AM -0400, Robert Duke wrote:
>> Hi Thomas -
>>
>> Issue 1):
>>
>> I recently had the same question in my mind regarding some new code I am
>> developing. None of the various ways I can think of to avoid this is
>> particularly elegant, and they all further obfuscate the code (which is
>> already fairly obfuscated, unfortunately). I tried looking at the
>> standards to determine whether this is allowed - I could not find a
>> definitive statement, and I have never seen a compiler refuse to compile
>> code that does this.
>
> The compiler did not refuse to compile it but "-check all" created
> a runtime error telling that an unallocated variable was referenced.
>
>> So here's the deal. A perfectly good test case on
>> something like this is the allocated() intrinsic. If you can pass stuff
>> in
>> and check it's allocation status, that basically says this is okay - and
>> I
>> think you can.
>
>> Something to remember about using extensive checks is that
>> the checks code itself often does not get everything right.
>
> I fully agree on that - and I only activated it to track down some
> other issues we have in the MPI start up (probably not an Amber
> issue but somewhere between [Intel-MPI], mpiexec and torque - but
> that's a different story.)
>
>> Anyway, I will
>> play with this issue some to insure there is no real issue here, but I
>> think there is not a real issue here (access to igrp(*) is guarded by the
>> ibelly flag, essentially).
>
> correct, the access itself if guarded, so it's more or less only
> cosmetic.
>
>> Issue 2):
>>
>> There is extensive code in loadbal to guard against just this sort of
>> thing. Did you observe an out-of-range value for my_img_lo in an actual
>> run?
>
> yes, it's real-world and occurs with a production case of a user
>
>> Please send me your config.h and your test case as well as the output
>> of ifort -V and an exact description of what happened, and I will look
>> into
>> it further.
>
> my last config.h is attached; the MPI version probably does not
> matter (I used mvapich2 in the last tests and Intel-MPI before)
>
> the code was compiled with
> ifort -V
> Intel(R) Fortran Compiler for Intel(R) EM64T-based applications,
> Version 9.1 Build 20070320 Package ID: l_fc_c_9.1.045
> (and 10.0.017beta gives exactly the same result).
>
> If you apply a trivial patch (attached) you do not rely on the
> compiler to do range checking.
>
> I'm not yet sure how much the boundary violation depends on the
> number of MPI processes; at least using
> mpirun -np 128 ./pmemd -i bench_1jv2.in -p box_neutral.top -c
> bench_1jv2.crd
> results in
> my_img_lo=193534 and img_cnt=193533
> (at least on some processes or calls of the routine). Running with
> 64 processes gives exactly the same result. Running with 16
> processes only does not trigger it.
>
> Concerning the input files for the testcase: I have to check with
> the user who provided them to me. I hope to be able to post a
> download link soon.
>
>> This is in the category of something that is possible, but the
>> code has already taken the issue into consideration and been extensively
>> tested at very high processor count where this sort of thing would be
>> likely to happen (the basic problem is in dividing up the image
>> workload -
>> you have to be sure that it sums exactly to img_cnt, and if it doesn't
>> all
>> sorts of pandemonium would be expected).
>
> sounds reasonable.
>
>> Regards - Bob Duke
>
> Regards,
>
> thomas
>
>> ----- Original Message -----
>> From: "Thomas Zeiser" <thomas.zeiser.rrze.uni-erlangen.de>
>> To: <amber.scripps.edu>
>> Sent: Wednesday, May 09, 2007 7:33 AM
>> Subject: AMBER: pmemd source issues
>>
>>
>> >Dear All,
>> >
>> >I compiled pmemd9 (including Amber9 patches 1-34) using the latest
>> >Intel EM64T compiler and enabled extensive runtime error checking
>> >(-g -traceback -check all). Two types of issues came up:
>> >
>> >1) contraints.f90 only allocates "amt_igroup" if "ibelly" is set.
>> >degcnt() is called from runmd.f90 and "amt_igroup" is passed in all
>> >cases. The Intel compiler now complains (if ibelly is not set)
>> >that "integer :: igrp(*)" is not allowed as an unallocated variable
>> >is accessed.
>> >An "allocate(atm_igroup(0))" in constraints.f90 solves this issue.
>> >
>> >A similar behaviour is observed for "gbl_loadbal_node_dat" which
>> >gets only allocated on the master process (alltasks_setup.f90).
>> >
>> >I did not check the Fortran standard if using "type :: var(*)" is
>> >allowed or not (I guess "no" as an unallocated variable does not
>> >have any defined ranges which can be used for the assumed shape) -
>> >but passing a valid variable seems to be a good idea anyway.
>> >
>> >
>> >2) The probably more sever issue was detected in find_img_range()
>> >from img.f90. At least for the testcase I got from our chemistry
>> >people, "my_img_lo" is one unit larger than "img_cnt", thus, the
>> >check "img_atm_map(img_i) .lt. 0" causes an array bound violation.
>> >No idea about the implication of that (or a correct fix).
>> >
>> >
>> >Kind regards,
>> >
>> >Thomas Zeiser
>


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun May 13 2007 - 06:07:11 PDT
Custom Search