Re: [AMBER] Duplicate random seeds with ig=-1 Code enhancement to think about, perhaps from Chris Moth on 2017-06-20 (Amber Archive Jun 2017)

From: Chris Moth <cmoth08.gmail.com>
Date: Tue, 20 Jun 2017 14:27:08 -0500

Thanks for thinking about it.

To geek out a notch further.....

Aside from the the fast cluster scheduler, the microseconds _are_
uniformly distributed :) At least, if you repeatedly cal gettimeofday()
(and nothing else), and stuff the results in an array, I seem to get all
the microseconds back on fast hardware :) (which contradicts the
reports google returns from a few years ago).

Following the math at this site - which seems right:

https://betterexplained.com/articles/understanding-the-birthday-paradox/

Birthdays first:

With 23 people there are 253 pairs: (23 x 22 / 2) = 253

Change of 2 people having different birthdays: (1 - 1/365) = (364/365)

Making 253 comparisons and having them all be different is (364 / 365) ^
253 - which has probability of 0.4995

Jobs on the cluster next:

Extending the same math to 100 jobs with 1000000 randomly distributed
microsecond start times :)

With 100 jobs there are (100 * 99 / 2) = 4950 pairs

Chance of 2 jobs having same microsecond assignment = (1-1/1000000) =
(999999/1000000)

Making 4950 comparisons and have them all be different is
(999999/1000000) ^ 4950 = .9951

OK - oops - it is one in 200 submissions of (100 jobs) that should see a
duplicate microsecond assignment :) But still.......

And, if the scheduler launches them all in 1/100 of a second (which ours
does not... yet) the picture gets worse for duplicates.

With the current algorithm, the problem should only get "worse" as
hardware improves, more parallel runs are requested, and their start
times cluster closer and closer together on clusters.

OK - back to real work :)

On 06/20/2017 01:45 PM, Adrian Roitberg wrote:
> Well, that is interesting...
>
> I always thought there was a small chance of this happening. However,
> since this requires that many jobs gets into the nodes at either the
> SAME microsecond count or that the gettimeofday() from different nodes
> be synchronized 'just so' to get you in trouble. I do not believe this
> is the same as people's birthdays, since those are uniformly
> distributed, etc etc.
>
> The fact that the gettimeofday() is not accurate to the microseconds is
> important for timing processes, but should be as important for what we
> use it for.
>
> Anyways, of your options below, I would not go with a or b, since those
> are bound to be heavily machine dependent.
>
> c looks like a good idea, but you need to create patches for sander,
> pmemd and pmemd.cuda
>
> Thanks for looking into this !
>
> Adrian
>
> On 6/20/17 2:29 PM, Chris Moth wrote:
>> I have just had the ?interesting? experience that "ig=-1" does not
>> always generate unique random seeds, and I thought I should share that
>> experience.....
>>
>> I have used "ig = -1" to randomize seeds for some time. I used it
>> without hesitation as I worked through the ASMD tutorial here:
>>
>> http://ambermd.org/tutorials/advanced/tutorial26/
>>
>> However, intriguingly, on our high performance cluster at Vanderbilt,
>> when I submit 100 jobs-at-a-time (an ASMD "stage"), I am seeing
>> duplicate ig values returned every few hundred runs.
>>
>> This could be imaginably attributable to a combination of factors:
>>
>> 1) Just as in a room of 23 people, there is a 50% chance that 2 will
>> share the same birthday... in a collection of 100 MD jobs, there is
>> around an approx 1% chance that 2 will share the same microsecond start
>> time (Apologies in advance if I butchered some math.)
>> 2) A high performance cluster, launching multiple simultaneous jobs "at
>> once", could imaginably turn a 1% chance into a 10% chance on highly
>> synchronized nodes.
>> 3) The resolution of the gettimeofday() function (called from
>> pmemd_clib.c) could be significantly lower than one microsecond in
>> practice (if google is to be believed)
>>
>> https://www.google.com/#q=resolution+of+gettimeofday
>>
>> It admittedly a nuisance issue. The choices are:
>>
>> a) Ignore the issue entirely. Statistically, it's likely not too
>> important if only one in every 200 md runs is a duplicate run.
>>
>> b) Set random seeds with environment variables available in the high
>> performance cluster or ASMD job launch ecosystem *i**nstead* of
>> "trusting" ig=-1 (examles: task IDs, (ASMD stage*10000 + ASMD_run),
>> etc) (in which case we should update the ASMD tutorial - so that ig=-1
>> at least has a caution around it)
>>
>> c) Modify the pmemd code to enhance randomness of ig=1, by adding
>> entropy. (The current 0 to 999999 range is only using 20 bits of the 31
>> that could be used).
>>
>> In case "c" is interesting.... read on....
>>
>> Below is a sketch of code that honors the current microsecond concept,
>> but adds another 1000 possibilities based on the contents of
>> /dev/urandom on a linux system. Portability issues are rightly of great
>> concern to the community. You could activate code like this in response
>> to a new "ig = -2" possibility, or in response to install-time's
>> "./configure"'s reporting that /dev/urandom is available. The code
>> below does not require any new third party libraries (like a "better"
>> entropy generation scheme, or a guid generator - might require)... and I
>> think it will work on any linux system I am aware of from the last decade.
>>
>> Again, this code below is not intended to be _the_ solution - just some
>> food-for-thought if the team should consider enhancing randomness beyond
>> the current 0-999999 limited sys clock. You might want "ig = -3" (say)
>> to init all 31 bits of the seed from /dev/urandom.........
>>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/time.h>
>> #include <assert.h>
>>
>> main(int argc, const char** argv)
>> {
>> struct timeval my_tv;
>> int entropyRead;
>> int entropyBits;
>>
>> int entropyFile = open("/dev/urandom",O_RDONLY);
>> assert(entropyFile != -1);
>>
>> entropyRead = read(entropyFile,&entropyBits,sizeof(entropyBits));
>> assert (entropyRead == sizeof(entropyBits));
>> close(entropyFile);
>> entropyBits &= 0x7fffff; // Mask off sign bit
>> entropyBits %= 1000;
>>
>> // What you do today in pmemd_clib.c
>> gettimeofday(&my_tv,NULL);
>>
>> printf("Today's random seed: %09d\n",(int)my_tv.tv_usec);
>> printf("Enhanced random seed: %09d\n",(int)my_tv.tv_usec +
>> entropyBits * 1000000);
>> return 0;
>>
>> }
>>
>>
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jun 20 2017 - 12:30:03 PDT