Re: [AMBER] JAC benchmark tests on K20X from Ross Walker on 2013-05-20 (Amber Archive May 2013)

From: Ross Walker <ross.rosswalker.co.uk>
Date: Mon, 20 May 2013 12:21:49 -0700

Hi Shan-ho,

The JAC test is very short and so a lot of the variation you might be
seeing might just be that the benchmark isn't long enough. Try increasing
NSTLIM to a value large enough that the benchmark runs for at least 10
minutes or so. Anything under 60 seconds total duration is not going to
give valid statistics, especially in the mdinfo timing file.

ECC is definitely the explanation for run2 differences from the URL value.
I would suggest turning ECC off. Note it tends to have a slightly bigger
impact on smaller systems than larger ones.

File system - ALMOST certainly this is the cause of your problem (assuming
of course you have set the K20 cards to Persistance and compute exclusive
modes - see http://ambermd.org/gpus/#Running to guarantee nobody else is
using the same GPU that you are). The JAC benchmark writes in aggregate
less data to mdcrd than the FactorIX or Cellulose benchmarks but it write
much more frequently. Thus if you have a remote filesystem that has poor
latency then you will see terrible performance. This also happens on a lot
of supercomputers such as BlueWaters and ORNL-Titan and is particularly
acute for parallel filesystems like Lustre. I would advise always writing
to a local filesystem if you can. You can test this a little further by
setting ntwx=0 which will disable trajectory writes and set ntpr to a
higher value to reduce mdout writes. If you see consistent performance in
this situation then that suggests that your filesystem is the problem and
just isn't up to the job. If you still see such poor performance then I
would check to make sure the GPU is not being oversubscribed with other
jobs and if you still see issues then please let us know the exact machine
specs etc.

In terms of the mdinfo and mdout being different. They shouldn't be,
except that mdinfo is written every time and mdout write is triggered for
which more than 60 seconds have elapsed since the last write while the
mdout info is written at the conclusion of the job. Ultimately they should
be very similar but only in the case where your calculation runs for a
reasonable amount of time. In your example mdinfo is written only after
1000 steps and so represents an initial snapshot of the calculation speed
at the very beginning while the mdout file is over all the steps. So
really the sampling error on the mdinfo is very high here. Although again
both are WAY less than 60 seconds so increase nstlim by 25x or so and try
repeating things and you should get much more reliably results.

I plan to update the benchmarks shortly to use a much larger nstlim value.
It is mostly historical where when it was put together nothing took less
than 2 or 3 minutes to complete the run. Now Moore's law has caught up so
things need to be adjusted.

All the best
Ross

On 5/20/13 12:00 PM, "Shan-ho Tsai" <tsai.hal.physast.uga.edu> wrote:

>
>Dear All,
>
>We have Amber12 with bugfixes 1 to 15 installed
>with GPU support (gcc 4.4.7 and CUDA toolkit 4.2.9)
>on our Linux cluster.
>
>We ran the GPU benchmarks available at
>http://ambermd.org/gpus/benchmarks.htm
>on our K20X GPU cards and got the following
>observations (tests run on 1 K20X card):
>
>1. The 2 Cellulose tests and the 2 Factor_IX tests
>had comparable performance as the values reported
>at the URL above. However, for a few days, the JAC
>tests had very poor performance (one of such runs
>is called run1 below). E.g. (ns/day values):
>
> run1 run2 value_from_URLabove
>JAC_PRODUCTION_NVE 12.64 81.19 89.13
>JAC_PRODUCTION_NPT 60.35 67.93 71.80
>
>These tests were run from a mounted file system and
>our GPU cards have ECC turned on. That might
>account for the slower timings for our run2, but
>the run1 had much poorer performance.
>
>2. Then we repeated the benchmark tests from a local
>file system (hard disk on the host). The results of
>all tests were compatible with the results reported
>on the URL above.
>
>Questions:
>=================
>
>1. Can a slow file system affect the JAC tests so much
>more than the Cellulose and the Factor_IX tests?
>
>2. Why is the timing reported by mdinfo and mdout
>different?
>
>For example, for run1 of the JAC_PRODUCTION_NVE test
>mdinfo shows:
>
>| Average timings for last 1000 steps:
>| Elapsed(s) = 13.67 Per Step(ms) = 13.67
>| ns/day = 12.64 seconds/ns = 6833.13
>|
>| Average timings for all steps:
>| Elapsed(s) = 13.67 Per Step(ms) = 13.67
>| ns/day = 12.64 seconds/ns = 6833.13
>
>
>
>And mdout shows:
>
>| Final Performance Info:
>| -----------------------------------------------------
>| Average timings for last 9000 steps:
>| Elapsed(s) = 18.13 Per Step(ms) = 2.01
>| ns/day = 85.77 seconds/ns = 1007.29
>|
>| Average timings for all steps:
>| Elapsed(s) = 31.80 Per Step(ms) = 3.18
>| ns/day = 54.34 seconds/ns = 1589.87
>| -----------------------------------------------------
>
>| Setup CPU time: 3.53 seconds
>| NonSetup CPU time: 19.93 seconds
>| Total CPU time: 23.46 seconds 0.01 hours
>
>| Setup wall time: 18 seconds
>| NonSetup wall time: 32 seconds
>| Total wall time: 50 seconds 0.01 hours
>
>Why are these two sets of timings so different for the same
>run?
>
>Thank you very much for any suggestions.
>
>Regards,
>Shan-Ho
>
>-----------------------------
>Shan-Ho Tsai
>GACRC/EITS, University of Georgia, Athens GA
>
>
>_______________________________________________
>AMBER mailing list
>AMBER.ambermd.org
>http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 20 2013 - 12:30:03 PDT