RE: AMBER: Amber9 serial installation -tests- severe (174): SIGSEGV, segmentation fault occurred

From: Ross Walker <ross.rosswalker.co.uk>
Date: Thu, 19 Oct 2006 12:24:43 -0700

Hi Mina,

> I use the intel compilers 9.1038 (for icc and icpc) and ifort
> (9.10.32). My mkl library is version 8.1.

> tets.sander.DIVCON
> The Run.wnmr in the water2_qmmmnmr/ dir passes
> but the Run.cnmr produces the following
> forrtl: severe (174): SIGSEGV, segmentation fault occurred

> test.antechamber produces segementation faults as well. Specifically
> inthe tp/ dir the script Run.tp produces:
> Running: $AMBERHOME/exe/divcon
> forrtl: severe (174): SIGSEGV, segmentation fault occurred

This has been noted before and we are aware of the problem. I believe that
it is triggered by a bug in the Intel compiler's vectorization unit that is
generating incorrect machine code. These sorts of bugs (compiler based) are
totally demoralizing if you try to fix them as they often make no logical
sense in the 'fortran space'. Often you find where the code is dying, change
something and it just dies elsewhere for no reason. Or you add a print
statement where you think it crashes and the code suddenly magically works.
That said THE 'DIVCON' PEOPLE IN FLORIDA should really be looking to find a
work around for this but I see no messages from them so as usual thus I will
have to do THEIR WORK FOR THEM. :-(.

At present I believe the following is the status of the Intel compilers with
Divcon in Amber 9:

8.1.031 Works
8.1.034 Works
9.0.033 Works
9.1.036 Segfault
9.1.037 Segfault
9.1.039 Segfault

And I guess we can now also add 9.1.038 to that list. So it doesn't look
like the problem is going to go away anytime soon. So one option would be to
see if you can get hold of one of the above compiler versions that works.
Alternatively, I have tracked down that the problem is related to
vectorization in the Intel compiler. This is enabled with -axNP. Ideally you
don't want to turn this off for all routines as it will really slow things
down. I have tracked this down to the following code:

do j=ibeg+1,inum
   nnmratm = nnmratm + 1
   if ( .not. calcMemRequirements) then
     inmratm(nnmratm) = j
   endif
enddo

Changing this to:

if (calcMemRequirements) then
  do j=ibeg+1,inum
    nnmratm = nnmratm + 1
  enddo
else
  do j=ibeg+1,inum
    nnmratm = nnmratm + 1
    inmratm(nnmratm) = j
  enddo
end if

Fixes the problem with the crambin_qmmmnmr test case. Now leaving aside the
fact that the first example of the code is a really stupid way of coding
with respect to performance it is still unacceptable that the Intel compiler
can't cope with it. I really wish the Intel people would get their act
together.

Anyway, changing the above only fixes the first problem. The crambin_divcon
test case still sefaults. This one is due to the following piece of code:

  DO 10 II=1,IIMAX
     FDIAG(II) = 0.0D0
     if(donmr) FIDIAG(II) = 0.0d0
10 ENDDO
  DO 20 IJ=1,IJMAX
     FDIAT(IJ) = 0.0D0
     if(donmr) FIDIAT(IJ) = 0.0d0
20 ENDDO

Again, nothing legally wrong with it, although it is also stupid performance
wize. Note though how similar the logic is to the first problem.

Changing this to the more efficient but logically identical:

  if (donmr) then
    DO II=1,IIMAX
      FDIAG(II) = 0.0D0
      FIDIAG(II) = 0.0d0
    ENDDO
    DO IJ=1,IJMAX
      FDIAT(IJ) = 0.0D0
      FIDIAT(IJ) = 0.0d0
    ENDDO
  else
    DO II=1,IIMAX
     FDIAG(II) = 0.0D0
    ENDDO
    DO IJ=1,IJMAX
     FDIAT(IJ) = 0.0D0
    ENDDO
  end if

fixes the problem. This is thus definately a bug in the Intel compiler. If I
get a chance I shall try to make a simple example and submit it to intel as
a bug report but I wouldn't hold your breath.

In the meantime I have replaced all of these loop constructs that I can find
in the divcon code and have prepared a bugfix patch file that is attached to
this email. This passes all the tests on my machine. Can you try it on your
machine and see if this works.

Copy the file to $AMBERHOME/
cd $AMBERHOME
patch -p0 <divcon_intel.bugfix
cd src
make clean
make
cd ../test
make

See if this fixes the problems. If it does please let me know and I will put
together a formal bug fix on the amber website.

> PS: I build the config.h by running ./configure -nopar
> -bintraj -p4 ifort_ia32 and then replace manually gcc -m32
> with icc and g++ with icpc.

It is not related to your problem but you don't need to manually replace gcc
with icc and g++ with icpc when using the Intel compilers. gcc and g++
should work fine and none of the performance critical code is written in c
so why it does no harm to use the intel compilers it will make little
performance difference. I typically would only change it if I encountered
problems using gcc.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross.rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.


-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu

Received on Sun Oct 22 2006 - 06:07:19 PDT
Custom Search