Re: [AMBER] Duplicate random seeds with ig=-1 Code enhancement to think about, perhaps from Bill Ross on 2017-06-20 (Amber Archive Jun 2017)

From: Bill Ross <ross.cgl.ucsf.edu>
Date: Tue, 20 Jun 2017 12:34:45 -0700

Can't folks just put a sleep in for e.g. node_num * 1 sec before getting
the seed?

Worst case collect seeds over a few days before you start the runs. :-)

Bill

On 6/20/17 12:27 PM, Chris Moth wrote:
> Thanks for thinking about it.
>
> To geek out a notch further.....
>
> Aside from the the fast cluster scheduler, the microseconds _are_
> uniformly distributed :) At least, if you repeatedly cal gettimeofday()
> (and nothing else), and stuff the results in an array, I seem to get all
> the microseconds back on fast hardware :) (which contradicts the
> reports google returns from a few years ago).
>
> Following the math at this site - which seems right:
>
> https://betterexplained.com/articles/understanding-the-birthday-paradox/
>
> Birthdays first:
>
> With 23 people there are 253 pairs: (23 x 22 / 2) = 253
>
> Change of 2 people having different birthdays: (1 - 1/365) = (364/365)
>
> Making 253 comparisons and having them all be different is (364 / 365) ^
> 253 - which has probability of 0.4995
>
>
> Jobs on the cluster next:
>
> Extending the same math to 100 jobs with 1000000 randomly distributed
> microsecond start times :)
>
> With 100 jobs there are (100 * 99 / 2) = 4950 pairs
>
> Chance of 2 jobs having same microsecond assignment = (1-1/1000000) =
> (999999/1000000)
>
> Making 4950 comparisons and have them all be different is
> (999999/1000000) ^ 4950 = .9951
>
> OK - oops - it is one in 200 submissions of (100 jobs) that should see a
> duplicate microsecond assignment :) But still.......
>
> And, if the scheduler launches them all in 1/100 of a second (which ours
> does not... yet) the picture gets worse for duplicates.
>
> With the current algorithm, the problem should only get "worse" as
> hardware improves, more parallel runs are requested, and their start
> times cluster closer and closer together on clusters.
>
> OK - back to real work :)
>
>
>
> On 06/20/2017 01:45 PM, Adrian Roitberg wrote:
>> Well, that is interesting...
>>
>> I always thought there was a small chance of this happening. However,
>> since this requires that many jobs gets into the nodes at either the
>> SAME microsecond count or that the gettimeofday() from different nodes
>> be synchronized 'just so' to get you in trouble. I do not believe this
>> is the same as people's birthdays, since those are uniformly
>> distributed, etc etc.
>>
>> The fact that the gettimeofday() is not accurate to the microseconds is
>> important for timing processes, but should be as important for what we
>> use it for.
>>
>> Anyways, of your options below, I would not go with a or b, since those
>> are bound to be heavily machine dependent.
>>
>> c looks like a good idea, but you need to create patches for sander,
>> pmemd and pmemd.cuda
>>
>> Thanks for looking into this !
>>
>> Adrian
>>
>> On 6/20/17 2:29 PM, Chris Moth wrote:
>>> I have just had the ?interesting? experience that "ig=-1" does not
>>> always generate unique random seeds, and I thought I should share that
>>> experience.....
>>>
>>> I have used "ig = -1" to randomize seeds for some time. I used it
>>> without hesitation as I worked through the ASMD tutorial here:
>>>
>>> http://ambermd.org/tutorials/advanced/tutorial26/
>>>
>>> However, intriguingly, on our high performance cluster at Vanderbilt,
>>> when I submit 100 jobs-at-a-time (an ASMD "stage"), I am seeing
>>> duplicate ig values returned every few hundred runs.
>>>
>>> This could be imaginably attributable to a combination of factors:
>>>
>>> 1) Just as in a room of 23 people, there is a 50% chance that 2 will
>>> share the same birthday... in a collection of 100 MD jobs, there is
>>> around an approx 1% chance that 2 will share the same microsecond start
>>> time (Apologies in advance if I butchered some math.)
>>> 2) A high performance cluster, launching multiple simultaneous jobs "at
>>> once", could imaginably turn a 1% chance into a 10% chance on highly
>>> synchronized nodes.
>>> 3) The resolution of the gettimeofday() function (called from
>>> pmemd_clib.c) could be significantly lower than one microsecond in
>>> practice (if google is to be believed)
>>>
>>> https://www.google.com/#q=resolution+of+gettimeofday
>>>
>>> It admittedly a nuisance issue. The choices are:
>>>
>>> a) Ignore the issue entirely. Statistically, it's likely not too
>>> important if only one in every 200 md runs is a duplicate run.
>>>
>>> b) Set random seeds with environment variables available in the high
>>> performance cluster or ASMD job launch ecosystem *i**nstead* of
>>> "trusting" ig=-1 (examles: task IDs, (ASMD stage*10000 + ASMD_run),
>>> etc) (in which case we should update the ASMD tutorial - so that ig=-1
>>> at least has a caution around it)
>>>
>>> c) Modify the pmemd code to enhance randomness of ig=1, by adding
>>> entropy. (The current 0 to 999999 range is only using 20 bits of the 31
>>> that could be used).
>>>
>>> In case "c" is interesting.... read on....
>>>
>>> Below is a sketch of code that honors the current microsecond concept,
>>> but adds another 1000 possibilities based on the contents of
>>> /dev/urandom on a linux system. Portability issues are rightly of great
>>> concern to the community. You could activate code like this in response
>>> to a new "ig = -2" possibility, or in response to install-time's
>>> "./configure"'s reporting that /dev/urandom is available. The code
>>> below does not require any new third party libraries (like a "better"
>>> entropy generation scheme, or a guid generator - might require)... and I
>>> think it will work on any linux system I am aware of from the last decade.
>>>
>>> Again, this code below is not intended to be _the_ solution - just some
>>> food-for-thought if the team should consider enhancing randomness beyond
>>> the current 0-999999 limited sys clock. You might want "ig = -3" (say)
>>> to init all 31 bits of the seed from /dev/urandom.........
>>>
>>> #include <stdio.h>
>>> #include <unistd.h>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <sys/time.h>
>>> #include <assert.h>
>>>
>>> main(int argc, const char** argv)
>>> {
>>> struct timeval my_tv;
>>> int entropyRead;
>>> int entropyBits;
>>>
>>> int entropyFile = open("/dev/urandom",O_RDONLY);
>>> assert(entropyFile != -1);
>>>
>>> entropyRead = read(entropyFile,&entropyBits,sizeof(entropyBits));
>>> assert (entropyRead == sizeof(entropyBits));
>>> close(entropyFile);
>>> entropyBits &= 0x7fffff; // Mask off sign bit
>>> entropyBits %= 1000;
>>>
>>> // What you do today in pmemd_clib.c
>>> gettimeofday(&my_tv,NULL);
>>>
>>> printf("Today's random seed: %09d\n",(int)my_tv.tv_usec);
>>> printf("Enhanced random seed: %09d\n",(int)my_tv.tv_usec +
>>> entropyBits * 1000000);
>>> return 0;
>>>
>>> }
>>>
>>>
>>> _______________________________________________
>>> AMBER mailing list
>>> AMBER.ambermd.org
>>> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jun 20 2017 - 13:00:02 PDT