Hi Ross,
Thank you so much for your prompt and detailed response.
It all makes perfect sense. We had enabled the Persistence
Mode and set the Compute Mode to Exclusive_Process. But we
have been having occasional storage latency issues on one
mounted file system. Following your suggestion, I just ran
a few tests with a larger NSTLIM and the results are
consistent with the values reported in the URL.
I really appreciate your detailed explanation and kind
suggestions.
Thank you so much!
Best regards,
Shan-Ho
-----------------------------
Shan-Ho Tsai
GACRC/EITS, University of Georgia, Athens GA
On Mon, 20 May 2013, Ross Walker wrote:
> Hi Shan-ho,
>
> The JAC test is very short and so a lot of the variation you might be
> seeing might just be that the benchmark isn't long enough. Try increasing
> NSTLIM to a value large enough that the benchmark runs for at least 10
> minutes or so. Anything under 60 seconds total duration is not going to
> give valid statistics, especially in the mdinfo timing file.
>
> ECC is definitely the explanation for run2 differences from the URL value.
> I would suggest turning ECC off. Note it tends to have a slightly bigger
> impact on smaller systems than larger ones.
>
> File system - ALMOST certainly this is the cause of your problem (assuming
> of course you have set the K20 cards to Persistance and compute exclusive
> modes - see http://ambermd.org/gpus/#Running to guarantee nobody else is
> using the same GPU that you are). The JAC benchmark writes in aggregate
> less data to mdcrd than the FactorIX or Cellulose benchmarks but it write
> much more frequently. Thus if you have a remote filesystem that has poor
> latency then you will see terrible performance. This also happens on a lot
> of supercomputers such as BlueWaters and ORNL-Titan and is particularly
> acute for parallel filesystems like Lustre. I would advise always writing
> to a local filesystem if you can. You can test this a little further by
> setting ntwx=0 which will disable trajectory writes and set ntpr to a
> higher value to reduce mdout writes. If you see consistent performance in
> this situation then that suggests that your filesystem is the problem and
> just isn't up to the job. If you still see such poor performance then I
> would check to make sure the GPU is not being oversubscribed with other
> jobs and if you still see issues then please let us know the exact machine
> specs etc.
>
> In terms of the mdinfo and mdout being different. They shouldn't be,
> except that mdinfo is written every time and mdout write is triggered for
> which more than 60 seconds have elapsed since the last write while the
> mdout info is written at the conclusion of the job. Ultimately they should
> be very similar but only in the case where your calculation runs for a
> reasonable amount of time. In your example mdinfo is written only after
> 1000 steps and so represents an initial snapshot of the calculation speed
> at the very beginning while the mdout file is over all the steps. So
> really the sampling error on the mdinfo is very high here. Although again
> both are WAY less than 60 seconds so increase nstlim by 25x or so and try
> repeating things and you should get much more reliably results.
>
> I plan to update the benchmarks shortly to use a much larger nstlim value.
> It is mostly historical where when it was put together nothing took less
> than 2 or 3 minutes to complete the run. Now Moore's law has caught up so
> things need to be adjusted.
>
> All the best
> Ross
>
>
>
>
> On 5/20/13 12:00 PM, "Shan-ho Tsai" <tsai.hal.physast.uga.edu> wrote:
>
>>
>> Dear All,
>>
>> We have Amber12 with bugfixes 1 to 15 installed
>> with GPU support (gcc 4.4.7 and CUDA toolkit 4.2.9)
>> on our Linux cluster.
>>
>> We ran the GPU benchmarks available at
>> http://ambermd.org/gpus/benchmarks.htm
>> on our K20X GPU cards and got the following
>> observations (tests run on 1 K20X card):
>>
>> 1. The 2 Cellulose tests and the 2 Factor_IX tests
>> had comparable performance as the values reported
>> at the URL above. However, for a few days, the JAC
>> tests had very poor performance (one of such runs
>> is called run1 below). E.g. (ns/day values):
>>
>> run1 run2 value_from_URLabove
>> JAC_PRODUCTION_NVE 12.64 81.19 89.13
>> JAC_PRODUCTION_NPT 60.35 67.93 71.80
>>
>> These tests were run from a mounted file system and
>> our GPU cards have ECC turned on. That might
>> account for the slower timings for our run2, but
>> the run1 had much poorer performance.
>>
>> 2. Then we repeated the benchmark tests from a local
>> file system (hard disk on the host). The results of
>> all tests were compatible with the results reported
>> on the URL above.
>>
>> Questions:
>> =================
>>
>> 1. Can a slow file system affect the JAC tests so much
>> more than the Cellulose and the Factor_IX tests?
>>
>> 2. Why is the timing reported by mdinfo and mdout
>> different?
>>
>> For example, for run1 of the JAC_PRODUCTION_NVE test
>> mdinfo shows:
>>
>> | Average timings for last 1000 steps:
>> | Elapsed(s) = 13.67 Per Step(ms) = 13.67
>> | ns/day = 12.64 seconds/ns = 6833.13
>> |
>> | Average timings for all steps:
>> | Elapsed(s) = 13.67 Per Step(ms) = 13.67
>> | ns/day = 12.64 seconds/ns = 6833.13
>>
>>
>>
>> And mdout shows:
>>
>> | Final Performance Info:
>> | -----------------------------------------------------
>> | Average timings for last 9000 steps:
>> | Elapsed(s) = 18.13 Per Step(ms) = 2.01
>> | ns/day = 85.77 seconds/ns = 1007.29
>> |
>> | Average timings for all steps:
>> | Elapsed(s) = 31.80 Per Step(ms) = 3.18
>> | ns/day = 54.34 seconds/ns = 1589.87
>> | -----------------------------------------------------
>>
>> | Setup CPU time: 3.53 seconds
>> | NonSetup CPU time: 19.93 seconds
>> | Total CPU time: 23.46 seconds 0.01 hours
>>
>> | Setup wall time: 18 seconds
>> | NonSetup wall time: 32 seconds
>> | Total wall time: 50 seconds 0.01 hours
>>
>> Why are these two sets of timings so different for the same
>> run?
>>
>> Thank you very much for any suggestions.
>>
>> Regards,
>> Shan-Ho
>>
>> -----------------------------
>> Shan-Ho Tsai
>> GACRC/EITS, University of Georgia, Athens GA
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
>
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Mon May 20 2013 - 13:30:03 PDT