Re: [AMBER] Duplicate random seeds with ig=-1 Code enhancement to think about, perhaps

From: Adrian Roitberg <roitberg.ufl.edu>
Date: Tue, 20 Jun 2017 15:38:41 -0400

One can, but again, that is system dependent, so we do not want to
depend on that.


On 6/20/17 3:34 PM, Bill Ross wrote:
> Can't folks just put a sleep in for e.g. node_num * 1 sec before getting
> the seed?
>
> Worst case collect seeds over a few days before you start the runs. :-)
>
> Bill
>
>
> On 6/20/17 12:27 PM, Chris Moth wrote:
>> Thanks for thinking about it.
>>
>> To geek out a notch further.....
>>
>> Aside from the the fast cluster scheduler, the microseconds _are_
>> uniformly distributed :) At least, if you repeatedly cal gettimeofday()
>> (and nothing else), and stuff the results in an array, I seem to get all
>> the microseconds back on fast hardware :) (which contradicts the
>> reports google returns from a few years ago).
>>
>> Following the math at this site - which seems right:
>>
>> https://betterexplained.com/articles/understanding-the-birthday-paradox/
>>
>> Birthdays first:
>>
>> With 23 people there are 253 pairs: (23 x 22 / 2) = 253
>>
>> Change of 2 people having different birthdays: (1 - 1/365) = (364/365)
>>
>> Making 253 comparisons and having them all be different is (364 / 365) ^
>> 253 - which has probability of 0.4995
>>
>>
>> Jobs on the cluster next:
>>
>> Extending the same math to 100 jobs with 1000000 randomly distributed
>> microsecond start times :)
>>
>> With 100 jobs there are (100 * 99 / 2) = 4950 pairs
>>
>> Chance of 2 jobs having same microsecond assignment = (1-1/1000000) =
>> (999999/1000000)
>>
>> Making 4950 comparisons and have them all be different is
>> (999999/1000000) ^ 4950 = .9951
>>
>> OK - oops - it is one in 200 submissions of (100 jobs) that should see a
>> duplicate microsecond assignment :) But still.......
>>
>> And, if the scheduler launches them all in 1/100 of a second (which ours
>> does not... yet) the picture gets worse for duplicates.
>>
>> With the current algorithm, the problem should only get "worse" as
>> hardware improves, more parallel runs are requested, and their start
>> times cluster closer and closer together on clusters.
>>
>> OK - back to real work :)
>>
>>
>>
>> On 06/20/2017 01:45 PM, Adrian Roitberg wrote:
>>> Well, that is interesting...
>>>
>>> I always thought there was a small chance of this happening. However,
>>> since this requires that many jobs gets into the nodes at either the
>>> SAME microsecond count or that the gettimeofday() from different nodes
>>> be synchronized 'just so' to get you in trouble. I do not believe this
>>> is the same as people's birthdays, since those are uniformly
>>> distributed, etc etc.
>>>
>>> The fact that the gettimeofday() is not accurate to the microseconds is
>>> important for timing processes, but should be as important for what we
>>> use it for.
>>>
>>> Anyways, of your options below, I would not go with a or b, since those
>>> are bound to be heavily machine dependent.
>>>
>>> c looks like a good idea, but you need to create patches for sander,
>>> pmemd and pmemd.cuda
>>>
>>> Thanks for looking into this !
>>>
>>> Adrian
>>>
>>> On 6/20/17 2:29 PM, Chris Moth wrote:
>>>> I have just had the ?interesting? experience that "ig=-1" does not
>>>> always generate unique random seeds, and I thought I should share that
>>>> experience.....
>>>>
>>>> I have used "ig = -1" to randomize seeds for some time. I used it
>>>> without hesitation as I worked through the ASMD tutorial here:
>>>>
>>>> http://ambermd.org/tutorials/advanced/tutorial26/
>>>>
>>>> However, intriguingly, on our high performance cluster at Vanderbilt,
>>>> when I submit 100 jobs-at-a-time (an ASMD "stage"), I am seeing
>>>> duplicate ig values returned every few hundred runs.
>>>>
>>>> This could be imaginably attributable to a combination of factors:
>>>>
>>>> 1) Just as in a room of 23 people, there is a 50% chance that 2 will
>>>> share the same birthday... in a collection of 100 MD jobs, there is
>>>> around an approx 1% chance that 2 will share the same microsecond start
>>>> time (Apologies in advance if I butchered some math.)
>>>> 2) A high performance cluster, launching multiple simultaneous jobs "at
>>>> once", could imaginably turn a 1% chance into a 10% chance on highly
>>>> synchronized nodes.
>>>> 3) The resolution of the gettimeofday() function (called from
>>>> pmemd_clib.c) could be significantly lower than one microsecond in
>>>> practice (if google is to be believed)
>>>>
>>>> https://www.google.com/#q=resolution+of+gettimeofday
>>>>
>>>> It admittedly a nuisance issue. The choices are:
>>>>
>>>> a) Ignore the issue entirely. Statistically, it's likely not too
>>>> important if only one in every 200 md runs is a duplicate run.
>>>>
>>>> b) Set random seeds with environment variables available in the high
>>>> performance cluster or ASMD job launch ecosystem *i**nstead* of
>>>> "trusting" ig=-1 (examles: task IDs, (ASMD stage*10000 + ASMD_run),
>>>> etc) (in which case we should update the ASMD tutorial - so that ig=-1
>>>> at least has a caution around it)
>>>>
>>>> c) Modify the pmemd code to enhance randomness of ig=1, by adding
>>>> entropy. (The current 0 to 999999 range is only using 20 bits of the 31
>>>> that could be used).
>>>>
>>>> In case "c" is interesting.... read on....
>>>>
>>>> Below is a sketch of code that honors the current microsecond concept,
>>>> but adds another 1000 possibilities based on the contents of
>>>> /dev/urandom on a linux system. Portability issues are rightly of great
>>>> concern to the community. You could activate code like this in response
>>>> to a new "ig = -2" possibility, or in response to install-time's
>>>> "./configure"'s reporting that /dev/urandom is available. The code
>>>> below does not require any new third party libraries (like a "better"
>>>> entropy generation scheme, or a guid generator - might require)... and I
>>>> think it will work on any linux system I am aware of from the last decade.
>>>>
>>>> Again, this code below is not intended to be _the_ solution - just some
>>>> food-for-thought if the team should consider enhancing randomness beyond
>>>> the current 0-999999 limited sys clock. You might want "ig = -3" (say)
>>>> to init all 31 bits of the seed from /dev/urandom.........
>>>>
>>>> #include <stdio.h>
>>>> #include <unistd.h>
>>>> #include <sys/types.h>
>>>> #include <sys/stat.h>
>>>> #include <fcntl.h>
>>>> #include <sys/time.h>
>>>> #include <assert.h>
>>>>
>>>> main(int argc, const char** argv)
>>>> {
>>>> struct timeval my_tv;
>>>> int entropyRead;
>>>> int entropyBits;
>>>>
>>>> int entropyFile = open("/dev/urandom",O_RDONLY);
>>>> assert(entropyFile != -1);
>>>>
>>>> entropyRead = read(entropyFile,&entropyBits,sizeof(entropyBits));
>>>> assert (entropyRead == sizeof(entropyBits));
>>>> close(entropyFile);
>>>> entropyBits &= 0x7fffff; // Mask off sign bit
>>>> entropyBits %= 1000;
>>>>
>>>> // What you do today in pmemd_clib.c
>>>> gettimeofday(&my_tv,NULL);
>>>>
>>>> printf("Today's random seed: %09d\n",(int)my_tv.tv_usec);
>>>> printf("Enhanced random seed: %09d\n",(int)my_tv.tv_usec +
>>>> entropyBits * 1000000);
>>>> return 0;
>>>>
>>>> }
>>>>
>>>>
>>>> _______________________________________________
>>>> AMBER mailing list
>>>> AMBER.ambermd.org
>>>> http://lists.ambermd.org/mailman/listinfo/amber
>> _______________________________________________
>> AMBER mailing list
>> AMBER.ambermd.org
>> http://lists.ambermd.org/mailman/listinfo/amber
>
> _______________________________________________
> AMBER mailing list
> AMBER.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

-- 
Dr. Adrian E. Roitberg
University of Florida Research Foundation Professor
Department of Chemistry
University of Florida
roitberg.ufl.edu
352-392-6972
_______________________________________________
AMBER mailing list
AMBER.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber
Received on Tue Jun 20 2017 - 13:00:03 PDT
Custom Search