Re: [AMBER] JAC benchmark tests on K20X from Shan-ho Tsai on 2013-05-20 (Amber Archive May 2013)

From: Shan-ho Tsai <tsai.hal.physast.uga.edu>
Date: Mon, 20 May 2013 16:11:13 -0400 (EDT)

Hi Ross,

Thank you so much for your prompt and detailed response.
It all makes perfect sense. We had enabled the Persistence
Mode and set the Compute Mode to Exclusive_Process. But we
have been having occasional storage latency issues on one
mounted file system. Following your suggestion, I just ran
a few tests with a larger NSTLIM and the results are
consistent with the values reported in the URL.

I really appreciate your detailed explanation and kind
suggestions.

Thank you so much!
Best regards,
Shan-Ho

-----------------------------
Shan-Ho Tsai
GACRC/EITS, University of Georgia, Athens GA

On Mon, 20 May 2013, Ross Walker wrote:

> Hi Shan-ho,
>
> The JAC test is very short and so a lot of the variation you might be
> seeing might just be that the benchmark isn't long enough. Try increasing
> NSTLIM to a value large enough that the benchmark runs for at least 10
> minutes or so. Anything under 60 seconds total duration is not going to
> give valid statistics, especially in the mdinfo timing file.
>
> ECC is definitely the explanation for run2 differences from the URL value.
> I would suggest turning ECC off. Note it tends to have a slightly bigger
> impact on smaller systems than larger ones.
>
> File system - ALMOST certainly this is the cause of your problem (assuming
> of course you have set the K20 cards to Persistance and compute exclusive
> modes - see http://ambermd.org/gpus/#Running to guarantee nobody else is
> using the same GPU that you are). The JAC benchmark writes in aggregate
> less data to mdcrd than the FactorIX or Cellulose benchmarks but it write
> much more frequently. Thus if you have a remote filesystem that has poor
> latency then you will see terrible performance. This also happens on a lot
> of supercomputers such as BlueWaters and ORNL-Titan and is particularly
> acute for parallel filesystems like Lustre. I would advise always writing
> to a local filesystem if you can. You can test this a little further by
> setting ntwx=0 which will disable trajectory writes and set ntpr to a
> higher value to reduce mdout writes. If you see consistent performance in
> this situation then that suggests that your filesystem is the problem and
> just isn't up to the job. If you still see such poor performance then I
> would check to make sure the GPU is not being oversubscribed with other
> jobs and if you still see issues then please let us know the exact machine
> specs etc.
>
> In terms of the mdinfo and mdout being different. They shouldn't be,
> except that mdinfo is written every time and mdout write is triggered for
> which more than 60 seconds have elapsed since the last write while the
> mdout info is written at the conclusion of the job. Ultimately they should
> be very similar but only in the case where your calculation runs for a
> reasonable amount of time. In your example mdinfo is written only after
> 1000 steps and so represents an initial snapshot of the calculation speed
> at the very beginning while the mdout file is over all the steps. So
> really the sampling error on the mdinfo is very high here. Although again
> both are WAY less than 60 seconds so increase nstlim by 25x or so and try
> repeating things and you should get much more reliably results.
>
> I plan to update the benchmarks shortly to use a much larger nstlim value.
> It is mostly historical where when it was put together nothing took less
> than 2 or 3 minutes to complete the run. Now Moore's law has caught up so
> things need to be adjusted.
>
> All the best
> Ross
>
>
>
>
> On 5/20/13 12:00 PM, "Shan-ho Tsai" <tsai.hal.physast.uga.edu> wrote:
>
>>
>> Dear All,
>>
>> We have Amber12 with bugfixes 1 to 15 installed
>> with GPU support (gcc 4.4.7 and CUDA toolkit 4.2.9)
>> on our Linux cluster.
>>
>> We ran the GPU benchmarks available at
>> http://ambermd.org/gpus/benchmarks.htm
>> on our K20X GPU cards and got the following
>> observations (tests run on 1 K20X card):
>>
>> 1. The 2 Cellulose tests and the 2 Factor_IX tests
>> had comparable performance as the values reported
>> at the URL above. However, for a few days, the JAC
>> tests had very poor performance (one of such runs
>> is called run1 below). E.g. (ns/day values):
>>
>> run1 run2 value_from_URLabove
>> JAC_PRODUCTION_NVE 12.64 81.19 89.13
>> JAC_PRODUCTION_NPT 60.35 67.93 71.80
>>
>> These tests were run from a mounted file system and
>> our GPU cards have ECC turned on. That might
>> account for the slower timings for our run2, but
>> the run1 had much poorer performance.
>>
>> 2. Then we repeated the benchmark tests from a local
>> file system (hard disk on the host). The results of
>> all tests were compatible with the results reported
>> on the URL above.
>>
>> Questions:
>> =================
>>
>> 1. Can a slow file system affect the JAC tests so much
>> more than the Cellulose and the Factor_IX tests?
>>
>> 2. Why is the timing reported by mdinfo and mdout
>> different?
>>
>> For example, for run1 of the JAC_PRODUCTION_NVE test
>> mdinfo shows:
>>
>> | Average timings for last 1000 steps:
>> | Elapsed(s) = 13.67 Per Step(ms) = 13.67
>> | ns/day = 12.64 seconds/ns = 6833.13
>> |
>> | Average timings for all steps:
>> | Elapsed(s) = 13.67 Per Step(ms) = 13.67
>> | ns/day = 12.64 seconds/ns = 6833.13
>>
>>
>>
>> And mdout shows:
>>
>> | Final Performance Info:
>> | -----------------------------------------------------
>> | Average timings for last 9000 steps:
>> | Elapsed(s) = 18.13 Per Step(ms) = 2.01
>> | ns/day = 85.77 seconds/ns = 1007.29
>> |
>> | Average timings for all steps:
>> | Elapsed(s) = 31.80 Per Step(ms) = 3.18
>> | ns/day = 54.34 seconds/ns = 1589.87
>> | -----------------------------------------------------
>>
>> | Setup CPU time: 3.53 seconds
>> | NonSetup CPU time: 19.93 seconds
>> | Total CPU time: 23.46 seconds 0.01 hours
>>
>> | Setup wall time: 18 seconds
>> | NonSetup wall time: 32 seconds
>> | Total wall time: 50 seconds 0.01 hours
>>
>> Why are these two sets of timings so different for the same
>> run?
>>
>> Thank you very much for any suggestions.
>>
>> Regards,
>> Shan-Ho
>>
>> -----------------------------
>> Shan-Ho Tsai
>> GACRC/EITS, University of Georgia, Athens GA
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 20 2013 - 13:30:03 PDT