Hello,
I have been working with Karl on this and it appears that some of the
issue is with how we are using mpich. On this cluster mpich
(mpich2-1.0.5) is running in a ring across all the nodes which is
started up as root. When we use mpiexec we get the waiting state which
was mentioned before. When we use mpirun then the job appears to work as
expected. I am posting to the mpich list to find out more about this.
Anyhow, we did manage to grab the bounce program in order to test out
our mpich setup. Running some tests of increasing amounts of processors
Karl tabulated the following results:
Nproc Time Latency Bandwidth
2 13.5051 17.4 592.368756
4 19.8262 104.5 403.5065196
8 18.9825 398.1 421.439878
16 18.9609 2682.1 421.9199917
32 19.1978 **** 416.7134284
64 18.6346 **** 429.3087954
128 20.4123 **** 391.9196034
This cluster has gig-e interconnects to a baystack 5510 switch.
So would this indicate expected behavior and show the limitations of
gig-e or would you expect that there would be some things that could be
tuned on our side to reduce the latency as we use more numbers of
processors? This cluster was bought strictly to run Amber. Any advice
people might have to make this run optimally for Amber with the
configuration we have would be greatly appreciated. Thanks,
-Steve
On Mon, 2007-06-04 at 13:13 -0400, kkirschn.hamilton.edu wrote:
> Hi Amber Community,
>
> My group is having some problems with Amber9 on a x86_64 cluster
> running RedHat Enterprise 4. Each node has two dual core opteron, for
> a total of 4 processors per node. We are using mpich2 for the message
> passing. We are using Torque(PBS) for resource management. Amber
> serial and parallel seem to compile without error, and the test suite
> passes. We try to run the job in the following four ways -
>
> When we submit a 16 processor job using the command in our Torque
> run file (as shown below): "mpiexec -machinefile $MACHINEFILE -np 16 /
> usr/local/Dist/amber9/exe/sander.MPI -O ... " each node shows four
> sander processes at 0 or 0.1% each.
>
> When we submit a 16 processor job using the command in our Torque
> run file: "mpiexec -machinefile $MACHINEFILE -np 16 /usr/local/Dist/
> amber9/exe/sander -O ... " each node shows four sander processes
> running at 100 % each.
>
> Furthermore, without using Torque(PBS) and submitting by command
> line "mpiexec -np 16 /usr/local/Dist/amber9/exe/sander.MPI -O -i ..."
> we have 16 sander processes spawned, 4 per node on a total of 4
> nodes. However, each process is running at ~10%, which doesn't seem
> efficient.
>
> Without Torque(PBS), submitting by command line "mpiexec -np 16 /usr/
> local/Dist/amber9/exe/sander -O -i ...", we have 16 sander processes
> spawned, 4 per node on a total of 4 nodes, with each process running
> at 100%. Does this mean we have 16 jobs running in serial,
> overwriting the output 16 times?
>
> Does anybody have any insight into what is going on? How do we get
> sander.MPI to run in parallel at maximum CPU efficiency? Below is our
> Torque run file:
>
> Thanks in advance for your input,
> Karl
>
> Torque(PBS) run file:
> ------------------------------------------------------------------------
> ------------------------------
> #PBS -l nodes=4:ppn=4
> #PBS -l walltime=999:00:00
> #PBS -q qname
> #PBS -m ae
> #PBS -j oe
>
> cd $PBS_O_WORKDIR
>
> set MACHINEFILE=$PBS_O_WORKDIR/machinefile
>
> if ( -f $MACHINEFILE ) then
> rm $MACHINEFILE
> touch $MACHINEFILE
> else
> touch $MACHINEFILE
> endif
>
>
> if $?PBS_NODEFILE then
> #debug
> echo "nodefile: $PBS_NODEFILE"
> foreach node ( `cat $PBS_NODEFILE | sort | uniq` )
> echo $node":4" >> $MACHINEFILE
> #debug
> echo $node
> end
> endif
> echo "machinefile is: $MACHINEFILE"
>
> mpiexec -machinefile $MACHINEFILE -np 16 /usr/local/Dist/amber9/exe/
> sander.MPI -O \
> -i /home/me/Sander_Test/HIV/md_heating_rest.in \
> -o /home/me/Sander_Test/HIV/1ZPA_leap_md_heat.out \
> -p /home/me/Sander_Test/HIV/1ZPA_leap.top \
> -c /home/me/Sander_Test/HIV/1ZPA_min.rst \
> -r /home/me/Sander_Test/HIV/1ZPA_leap_md_heat.rst \
> -x /home/me/Sander_Test/HIV/1ZPA_leap_md_heat.crd \
> -ref /home/me/Sander_Test/HIV/1ZPA_min.rst
>
> ____________________________________
> Karl N. Kirschner, Ph.D.
> Center for Molecular Design, Co-Director
> Department of Chemistry
> Hamilton College, Clinton NY 13323
> ____________________________________
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber.scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber.scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo.scripps.edu
Received on Wed Jun 06 2007 - 06:07:35 PDT