-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uniform_real_distribution<double> produces identical values too often #61
Comments
On Nov 26, 2019, at 12:12, danieltuzes ***@***.***> wrote:
Sometimes I get the same random number from a uniform distribution using a Mersenne Twister engine even I properly used the engine and iterated it. I know that the number of possible states of the engine is finite and number of possible generated values is also finite, but this is not the case now.
Using boost's implementation, 1e6 number of uniformly distributed random values are generated on the range [0; 1e7). That means that there way more possible values than required number of random values. However, I get quite often the same values, sometimes more than 100 times in this range. How is it possible?
Congratulations! You have rediscovered that most of the available generators (both in boost and std::random) are not capable of full double precision.
The specific generator, mt19937, natively produces 32-bit numbers, so when they are promoted to doubles via (x)/2^32 * 10^7 there are only
a few billion distinct values. The number of collisions that your test produces, about 100,
for a 10^6 samples seems about right based on the birthday paradox - but would be nice to check vs the exact formula.
If you need a true 64-bit generator which passes all statistical tests, use my MIXMAX, it is a drop-in addition to [boost][random]
at this point in time, at
https://github.com/kotika/random
Best Regards,
Kostas Savvidis
============================================================================================
Institute of Nuclear and Particle Physics
NCSR Demokritos
http://inspirehep.net/author/profile/K.G.Savvidy.1
https://github.com/kotika/random
https://mixmax.hepforge.org
|
How would you than describe the cases I mentioned at It is the distribution generator? I.e., if I use std's MT engine and pass it to std's uniform distribution, I get the proper distribution, but if I pass it to boost's uniform distribution, I get the issue? Statistics are available on SO. Could you please suggest a way (a complete example with the same features I posted above), using boost, to generate uniformly distributed double values on a range? Bc it seems ATM that it is not possible with boost's (random and) uniform distribution, but with std's (random and) uniform distribution. Or did you mean that it is impossible with boost ATM, and I should use copy your code code and forget about boost's implementation? |
On Nov 26, 2019, at 16:16, danieltuzes ***@***.***> wrote:
Could you please suggest a way (a complete example with the same features I posted above), using boost, to generate uniformly distributed double values on a range?
I dont think there can possibly be a problem with boosts distribution generation. It is a result of limited resolution of 32-bit, 31-bit, and 24-bit generators.
Here is a faster performing program which shows that there is no problem with mt19937_64 and mixmax.
Cheers,
Kostas
===================================================================================================================
#include <iostream>
#include <chrono>
#include <algorithm> // std::sort
#include <vector> // std::vector
#include <array>
#include <boost/random/mersenne_twister.hpp> // random number generator
#include <boost/random/ranlux.hpp>
#include <boost/random/mixmax.hpp>
#include <boost/random/uniform_real_distribution.hpp> // uniform distribution generator
using namespace std;
int main()
{
size_t seed = static_cast<int> (std::chrono::system_clock::now().time_since_epoch().count());
cout << "seed = " << seed << endl;
boost::random::mixmax engine(seed); // the random number generator engine
// boost::random::mt19937_64 engine(seed); // the random number generator engine
boost::random::uniform_real_distribution<double> u(0, 1e7); // uniformly distributed double values on the range [0; 1e7)
cout.precision(20);
const unsigned int p=1000000;
array<double,p> history; // stores the generated values for comparison
for (size_t i = 0; i < p; ++i)
{
history[i] = (u(engine));
}
std::sort (history.begin(), history.end());
int k=0;
for (size_t j = 0; j < p-1; ++j)
if (history[j] == history[j+1])
{cout << "Equal values ("<< history[j] <<") at ID = " << j << endl; k++; }
cout << "Found k=" << k << " equal values."<< endl;
}
|
I cannot test your code because my boost doesn't come with mixmax, so Your suggestion provides another way to generate numbers from the random number generator engine but uses the same distribution. As I pointed out, there is possibly no error with the engine but with the distribution. |
@kotika You are right about that it is only an effect of the too discrete values produced by the distribution, as it can be deduced from the figure. The figure shows that std's distribution (using boost's MT engine) follows well the expected cumulative distribution while boost's solution, the solid green line doesn't. The blue line is obtained by shifting the green line by the units q = 2.32831e-10 in x, supposing that an inner floor was applied somewhere, therefore the y points measured at x = x0, x1, ... correspond to x = x0+q, x1+q, ... But as I measured, the same boost MT engine with std's distribution produces a much better distribution. Both with the std from MSVS and g++. |
On Nov 26, 2019, at 19:24, danieltuzes ***@***.***> wrote:
@kotika <https://github.com/kotika> You are right about that it is only an effect of the too discrete values produced by the distribution, as it can be deduced from the figure <https://imgur.com/a/OjmFff0>.
The figure shows that std's distribution (using boost's MT engine) follows well the expected cumulative distribution while boost's solution, the solid green line doesn't. The blue line is obtained by shifting the green line by the units q = 2.32831e-10 in x, supposing that an inner floor was applied somewhere, therefore the y points measured at x = x0, x1, ... correspond to x = x0+q, x1+q, ...
I have not looked at GCC or MSVC implementations, but my first guess is that they are using generate_canonical
which is capable to produce good quality distributions from inferior generators. (at a huge performance cost)
My suggestion is to use a better generator, either MT64 or my MIXMAX.
MIXMAX for Boost is at https://github.com/kotika/random <https://github.com/kotika/random>
MIXMAX is also available for GSL, for ROOT, and inside specialized simulation packages Geant4 and PYTHIA.
Best Regards,
Kostas
|
It's boost's distribution, but I don't see off hand where the issue is. I tried the following code:
On MSVC, and there are thousands of collisions, but strangely the collision count on any given value never rises above 2. If I replace I have scanned through the 2 codebases - |
@peteroupc Have you tried your suggestion? Was it working? |
@kotika Once again, the MT engine is just fine. Boost's MT engine is good enough bc std's distribution produces good results with it. Std's MT engine is also good for the same reason. Boost's distribution is bad, bc on both good engine it produces bad results, bc the produced small numbers are not evenly distributed even for a 32bit float. You can still use a 64bit engine that hides this problem, but one should make it clear than that uniform_real_distribution doesn't produce even float-uniformly distributed values from a 32bit engine. |
I took it back, sorry. |
On Nov 26, 2019, at 23:09, danieltuzes ***@***.***> wrote:
one should make it clear than that uniform_real_distribution doesn't produce even float-uniformly distributed values from a 32bit engine.
There is no way to produce one random floating point value with double precision (53bits) from one random 32bit value.
The way GCC and MSVC chosen to do it is use two 32bit values and paste them together with generate_canonical.
This is the reason you were seeing those small differences between std and boost,
and the every-other-number-is-close phenomenon:
(std) (boost) (boost)
3354420.9766563926823 == 3354420.976247638464 + 1/2^32 * 1755586.0406719148159
8773544.1042856499553 == 8773544.1024415194988 + 1/2^32 * 7920481.4555123448372
So the problem is "solved" but at the cost of drawing from the generator twice to produce one double.
The true and permanent solution
is to deprecate old generators and use better high-resolution generators, which like I said, means at the moment: MT64 and MIXMAX.
Cheers,
Kostas
… On Nov 26, 2019, at 12:12, danieltuzes ***@***.***> wrote:
Note
A strange behavior can be seen if one compares the generated values with the one std::random generates. Example for values from random::boost for seed 4561565448989 is
1755586.0406719148159
3354420.976247638464 <--
3630764.0071026980877
3488445.2889673411846 <--
7920481.4555123448372
8773544.1024415194988 <--
while standard library generates
3354420.9766563926823 <--
3488445.2898126943037 <--
8773544.1042856499553 <--
|
Could you please give this as an answer on my SO question so I can accept it? Or do you mind if I copy it instead of you? |
I want to point out that For example, if
This is a suggestion that I have not tried, but I believe it may help reduce the risk of using random engines with 32-bit or smaller outputs.
The quality of a PRNG does not depend on the size of random numbers it outputs (e.g., 32- or 64-bit numbers), but rather it depends on its state size, its period, and whether the numbers behave like independent uniform random numbers. |
On Nov 27, 2019, at 17:06, Peter Occil ***@***.***> wrote:
This is a suggestion that I have not tried, but I believe it may help reduce the risk of using random engines with 32-bit outputs.
This "suggestion" is what gcc and MSVC implement, as I pointed out.
It costs more in performance than using a better generator.
Using a better generator is free, so do it. For example, my generator, MIXMAX
gives 61 bit integers, which is enough to produce 53bit
double precision numbers, is as fast to give you those 61 bits as MT gives a 32bit int,
is smaller in memory than MT and passes all statistical tests unlike MT.
(MT64 is also good enough, but at least on my machine, it is 3 times slower)
Cheers,
Kostas
|
Maybe this has been discovered already, but I got the same result using boost's Once I made sure |
If I generate uniformly distributed numbers with
boost::random::uniform_real_distribution<double>
it produces identical values too often. For further information, please read the original issue on SO and an example code on wandbox. I post my problem here as somebody suggested that it could be a bug. I am still not convinced about that but I am also not sure if there is any bug in my code that produces this strange behavior.I copy the question from SO here but follow the link for further updates and suggestions.
Problem description
Sometimes I get the same random number from a uniform distribution using a Mersenne Twister engine even I properly used the engine and iterated it. I know that the number of possible states of the engine is finite and number of possible generated values is also finite, but this is not the case now.
Using boost's implementation, 1e6 number of uniformly distributed random values are generated on the range [0; 1e7). That means that there way more possible values than required number of random values. However, I get quite often the same values, sometimes more than 100 times in this range. How is it possible?
Code
A simple code is provided to reproduce the situation. On both platforms I get the same problem:
Question
Is there a bug in the code that produces the same values? Or is it a bug in boost?
For my task it is important to generate numbers with uniform distribution. Finding identical values is one of the easiest tests but there are many more and I am sure I don't want to do quality analysis on a well-known library like Boost. I didn't want to use the standard library, because it is not guaranteed that two different compilers will give the same sequence for the same seed values, but it was a requirement for the task. What kind of a solution can you suggest?
Note
A strange behavior can be seen if one compares the generated values with the one
std::random generates
. Example for values fromrandom::boost
for seed 4561565448989 iswhile standard library generates
That is, every second generated value in the boost's sequence is very close to a corresponding value in the standard library's implementation. When two values in the boost-sequence are equal, the values in the standard-library-sequence are not equal, but close to each other. The similarity holds for MSVS and g++ compilers too, which have the right to have different implementation for Mersenne Twister and distributions.
It is the distribution generator
If the standard library's MT engine is used, but boost's distribution, the problem persists. But if the engine is the one from boost and the distribution is standard, the problem disappears. The problem is, as Peter pointed out, that the uniform distribution is platform depend for which I use boost.
The text was updated successfully, but these errors were encountered: