On Wed, May 09, 2007 at 11:43:36AM -0400, Robert Duke wrote:
> Hi Thomas -
> Okay, I'll wait for the test case, but I think I can guess what is going on
you can download the testcase (about 8 MB) from
http://www.rrze.de/~unrz143/bench-1jv2.zip
Regards,
thomas
> here. When a process ends up with 0 images that it processes, processing
> is prevented by effectively setting my_img_lo to my_img_hi + 1. However,
> find_img_range() does not take this into account. It was written/debugged
> prior to hitting such a scenario, and it has just not come up previously.
> The way this could happen would be if the task with the highest id also has
> an fft slab allocation, and we are running under conditions where it's
> direct force processing has been pushed to 0. It is interesting that this
> has not bitten us before; I'll probably just issue a patch that does image
> range finding a bit differently for this scenario, but would appreciate
> getting the test case so I can make sure everything is cool. Thanks much
> for reporting this...
> Regards - Bob
>
> ----- Original Message -----
> From: "Thomas Zeiser" <thomas.zeiser.rrze.uni-erlangen.de>
> To: <amber.scripps.edu>
> Sent: Wednesday, May 09, 2007 11:00 AM
> Subject: Re: AMBER: pmemd source issues
>
>
> >Hi Robert,
> >
> >On Wed, May 09, 2007 at 09:37:59AM -0400, Robert Duke wrote:
> >>Hi Thomas -
> >>
> >>Issue 1):
> >>
> >>I recently had the same question in my mind regarding some new code I am
> >>developing. None of the various ways I can think of to avoid this is
> >>particularly elegant, and they all further obfuscate the code (which is
> >>already fairly obfuscated, unfortunately). I tried looking at the
> >>standards to determine whether this is allowed - I could not find a
> >>definitive statement, and I have never seen a compiler refuse to compile
> >>code that does this.
> >
> >The compiler did not refuse to compile it but "-check all" created
> >a runtime error telling that an unallocated variable was referenced.
> >
> >>So here's the deal. A perfectly good test case on
> >>something like this is the allocated() intrinsic. If you can pass stuff
> >>in
> >>and check it's allocation status, that basically says this is okay - and
> >>I
> >>think you can.
> >
> >>Something to remember about using extensive checks is that
> >>the checks code itself often does not get everything right.
> >
> >I fully agree on that - and I only activated it to track down some
> >other issues we have in the MPI start up (probably not an Amber
> >issue but somewhere between [Intel-MPI], mpiexec and torque - but
> >that's a different story.)
> >
> >>Anyway, I will
> >>play with this issue some to insure there is no real issue here, but I
> >>think there is not a real issue here (access to igrp(*) is guarded by the
> >>ibelly flag, essentially).
> >
> >correct, the access itself if guarded, so it's more or less only
> >cosmetic.
> >
> >>Issue 2):
> >>
> >>There is extensive code in loadbal to guard against just this sort of
> >>thing. Did you observe an out-of-range value for my_img_lo in an actual
> >>run?
> >
> >yes, it's real-world and occurs with a production case of a user
> >
> >>Please send me your config.h and your test case as well as the output
> >>of ifort -V and an exact description of what happened, and I will look
> >>into
> >>it further.
> >
> >my last config.h is attached; the MPI version probably does not
> >matter (I used mvapich2 in the last tests and Intel-MPI before)
> >
> >the code was compiled with
> >ifort -V
> >Intel(R) Fortran Compiler for Intel(R) EM64T-based applications,
> >Version 9.1 Build 20070320 Package ID: l_fc_c_9.1.045
> >(and 10.0.017beta gives exactly the same result).
> >
> >If you apply a trivial patch (attached) you do not rely on the
> >compiler to do range checking.
> >
> >I'm not yet sure how much the boundary violation depends on the
> >number of MPI processes; at least using
> >mpirun -np 128 ./pmemd -i bench_1jv2.in -p box_neutral.top -c
> >bench_1jv2.crd
> >results in
> >my_img_lo=193534 and img_cnt=193533
> >(at least on some processes or calls of the routine). Running with
> >64 processes gives exactly the same result. Running with 16
> >processes only does not trigger it.
> >
> >Concerning the input files for the testcase: I have to check with
> >the user who provided them to me. I hope to be able to post a
> >download link soon.
> >
> >>This is in the category of something that is possible, but the
> >>code has already taken the issue into consideration and been extensively
> >>tested at very high processor count where this sort of thing would be
> >>likely to happen (the basic problem is in dividing up the image
> >>workload -
> >>you have to be sure that it sums exactly to img_cnt, and if it doesn't
> >>all
> >>sorts of pandemonium would be expected).
> >
> >sounds reasonable.
> >
> >>Regards - Bob Duke
> >
> >Regards,
> >
> >thomas
> >
> >>----- Original Message -----
> >>From: "Thomas Zeiser" <thomas.zeiser.rrze.uni-erlangen.de>
> >>To: <amber.scripps.edu>
> >>Sent: Wednesday, May 09, 2007 7:33 AM
> >>Subject: AMBER: pmemd source issues
> >>
> >>
> >>>Dear All,
> >>>
> >>>I compiled pmemd9 (including Amber9 patches 1-34) using the latest
> >>>Intel EM64T compiler and enabled extensive runtime error checking
> >>>(-g -traceback -check all). Two types of issues came up:
> >>>
> >>>1) contraints.f90 only allocates "amt_igroup" if "ibelly" is set.
> >>>degcnt() is called from runmd.f90 and "amt_igroup" is passed in all
> >>>cases. The Intel compiler now complains (if ibelly is not set)
> >>>that "integer :: igrp(*)" is not allowed as an unallocated variable
> >>>is accessed.
> >>>An "allocate(atm_igroup(0))" in constraints.f90 solves this issue.
> >>>
> >>>A similar behaviour is observed for "gbl_loadbal_node_dat" which
> >>>gets only allocated on the master process (alltasks_setup.f90).
> >>>
> >>>I did not check the Fortran standard if using "type :: var(*)" is
> >>>allowed or not (I guess "no" as an unallocated variable does not
> >>>have any defined ranges which can be used for the assumed shape) -
> >>>but passing a valid variable seems to be a good idea anyway.
> >>>
> >>>
> >>>2) The probably more sever issue was detected in find_img_range()
> >>>from img.f90. At least for the testcase I got from our chemistry
> >>>people, "my_img_lo" is one unit larger than "img_cnt", thus, the
> >>>check "img_atm_map(img_i) .lt. 0" causes an array bound violation.
> >>>No idea about the implication of that (or a correct fix).
> >>>
> >>>
> >>>Kind regards,
> >>>
> >>>Thomas Zeiser
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Sun May 13 2007 - 06:07:23 PDT