Updated 2012-06-10 13:18:22 by RLE

Many applications require random numbers.

This includes -

  • Games of chance: You want simulated dice to behave as unpredictable as their real counterpart.
  • Monte Carlo simulations: Algorithms are tested with random input.
  • Cryptography: Random numbers make "one time pads" perfectly secure.

The deterministic nature of computers makes it impossible for computer software to come up with truly random numbers.

Hardware devices exist that produce random numbers from unpredictable, or unpredictably chaotic, physical effects such as atmospheric noise, radioactive decay, or even the movement of liquids in a common lava lamp. Some operating systems (e.g., Linux) derive reasonably random numbers from user behavior, such as the time between keyboard inputs (i.e., something like "nanoseconds between hitting two keys"), or mouse movement.

True Random Numbers downloads true random numbers from a free internet service and touches the subject of using FM radio noise on a soundcard, which is really pretty uncorrelated and random and can generate quite some data per second.

Without access to such a device, the best a software can do is to use pseudo-random numbers (PRNs) which are produced by mathematical algorithms, called pseudo-random number generators (PRNGs). PRNGs generate an infinite sequence of pseudo-random numbers. There is generally the option to initialize a PRNG with a known "seed" value, so that, if initialized with the same seed, the same sequence of numbers is generated.

Because of their importance in various fields, there is a large body of literature on the subject. The canonical reference is Donald E. Knuth's The Art of Computer Programming, Volume 2: Seminumerical Algorithms.

Two of the most important properties that sequences of random numbers must, or should, satisfy are unpredictability and unbiased. The sequence of integer numbers {1, 2, 3, 4, ...} is perfectly unbiased but predictable. The sequence of daily outside temperature readings is unpredictable but biased, as the range of numbers is narrow, and the readings between two days are usually close. Because of such subtleties, the (probabilistic) proof that a sequence of numbers is reasonably random is very hard.

SG: The preceding paragraph requires some clarifications:

  • Monte Carlo simulations are worthless if they're run with true random numbers. All scientific processes (including Monte Carlos) must be repeatable. It must be possible to change some parameter in a simulation and re-run the simulation with the exact same random numbers. Or to run the simulation with known sets of random numbers to find what distinguishes those sets that yield certain particular results. Which is why the pinnacle of MC is an extremely good PRNG, not some kind of "true RNG".
  • For cryptographic purposes, a good, modern PRNG is practically almost always better than any "true random numbers". That's because true random numbers are very difficult to unbias. Also PRNGs are good to arbitrarily many digits, while real measured atmospheric noise is always limited by the measurement equipment. And great care has to be taken to make sure the individual measurements are indeed independent of each other. Which is why many advanced tests for randomness find good, modern PRNGs to be "more random" than so-called "true random numbers".

Also see randomizing a list

Random numbers allow for perfectly secure, unbreakable cryptography, using "one-time pads." In use for decades, first on paper, now on computers, a one-time pad contains a sequence of random numbers. Only two copies of a one-time pad exist, one is with the sender of a message, and one with the recipient.

In order to encrypt a message, the sender takes each letter of the plaintext message, and transposes it with one random number. In the most simple case, assuming only letters, a random number between 0 and 25 is added to the ordinal of the letter, modulo 26, to produce one letter of the encrypted message.

The receiver uses the same pad, applying the inverse transformation (by subtracting the random number, modulo 26), to decrypt the message.

If the numbers are truly random, and if each pad is only used once, this scheme is unbreakable. If a pad is used more than once, an attacker can leverage frequency analysis to determine the contents of the pad and gain access to the messages. If the numbers are not perfectly random, then but produced by a predictable source such as a pseudo-random number generator, decryption of the message is equivalent to breaking the generator's algorithm and parameters.

The problem with one-time pads is the amount of logistics involved, as the pads must be produced, and then distributed to sender and recipient via secure channels. And whoever gets his hand on the pads can produce or read messages -- like the clerk who makes a copy before forwarding the pads.

A common substitute for true one-time pads is to use a pseudo-random number generator, exploiting the property that a PRNG generates the same sequence of pseudo-random numbers when initialized with the same seed value. If sender and receiver agree on a PRNG algorithm, then they only need to share the (small) seed value to generate arbitrary-length "pseudo-one-time pads" using a secret channel.

Eavesdroppers then have two vectors of attack. First, they can exploit weaknesses (predictabilities) in the PRNG algorithm, and second they can reverse-engineer the seed value.

Tcl provides a pseudo-random number generator (called rand) as part of the expr command:
 expr rand()

returns a random number between 0 and 1.

Tcl's pseudo-random number generator highlights a weakness that applies to all pseudo-random number generators: it has to be initialized, or "seeded." The initial seed value fully determines the sequence of numbers that is generated. If the generator is seeded with the same value over and over, it will keep producing the same random numbers. Also, by design, it never produces the same number twice in a row -- unlike dice, which can come up with the same number many times in a row without being biased.

Tcl's random number generator is as bad as many, and worse than some, in that it only accepts a single 31 bit integer as a seed value. In other words, there are only 2^31 possible random number sequences, and they repeat after a period of 2^31 (in fact, numbers never repeat in less than this period).

Applying the lessons learned from the discussion of one-time pads above, Alice and Bob can choose Tcl's pseudo-random number generator. When Alice sends a message, she can then encrypt it using the seed value "42", and sends the encrypted message to Bob. She then communicates the seed value that she used over a secure channel. The message could be on a CD sent by mail, and the "42" is exchanged in a whisper during an inconspicious meeting.

However, if an attacker gets his hands on the message, and the attacker knows that Tcl's PRNG is being used, he could make a simple brute-force attempt over all 2^31 seed values, checking if the decryption yields legible output (easily done by computer).

Even when used outside the context of one-time pads, it is highly desirable that pseudo-random numbers are unpredictable. For example, if you are running an online Casino operation, you don't want your human players to have the ability to predict the cards that your computer players hold.

This leaves us with two problems to solve for better pseudo-random number generators:

  • First, we need a better algorithm; one that allows for seed values greater than 2^32, so that it becomes impractical to brute-force the entire seed value space.
  • Second, we need unpredictable seed values.

Obviously, there is a certain chicken and egg problem: in order to have unpredictable pseudo-random numbers, a pseudo-random number generator needs to be initialized with an unpredictable seed.

In fact, that is a good way to think about PRNGs: they provide a large amount of good pseudo-random numbers from a small amount of (pseudo-) random numbers. Coming full circle, PRNGs are frequently used even when a true random number generator is available. Many true random number generators are relatively slow (think of the lava lamp example, it takes a lot of time to "collect entropy"), so only a few random numbers are collected, and then used as a seed value to generate millions or billions of pseudo-random numbers.

Candidates for a better algorithm are

2004-Dec-10 drh: I recommend rc4. It is well-known and well studied. And it is easy to stir in new randomness obtain (for example) from %x and %y of mouse bindings or from the low-order bits of clock clicks taken after various events.

2004-Dec-13 rwm: A lot of systems just defer to the FIPS approved PRNG [1] algorithms. The digital signature standard mentioned uses sha-1 as a mixing function.

Thoughts on generating reasonably good pseudo-random seed values

Many applications cheat by using something like the time of day (e.g., [clock seconds]) as a seed value. However, that is about the worst thing that can be done. Because [clock seconds] returns a number in a very narrow range, an attacker can make a good prediction and narrow down the value space that needs to be tested by a brute-force search. Using a higher resolution timer, such as [clock clicks], improves the issue by a cryptographically insignificant margin only.

KPV In trying to get a truly random number (for a seed, etc.) a good way to think is in terms of random bits. There are many traits of a computer or procedures the user can perform that can get you some degree of good randomness. If you combine enough of these, you have a good start. For example, [clock seconds] is guessable to within a few seconds thus supplying one or two bits of randomness. Other traits, such as process id, parent process id, disk free space, etc., and user tasks, such as typing or moving the mouse, all can supply few bits of good randomness. The trick is combining enough of these random bits can produce a good random number.

The scheme that FPX currently uses to compute a seed value goes as follows:
 set seed [clock clicks]
 # the user has to complete a dialog box here
 append seed [clock clicks]
 append seed [clock seconds] [pid]
 append seed [winfo id .] [winfo geometry .] [winfo pointerxy .]
 set hashedseed [binary format H* [sha1::sha1 $seed]]

The $hashseed value is then used to initialize the PRNG. (Note that the implementations of the Mersenne Twister and ISAAC algorithms mentioned above accept an arbitrary-length binary string as seed value.)

I did not fully analyze the above code for randomness yet. Each contributing value should be examined for its randomness properties. I.e., the lower and upper bounds that an attacker (local vs. remote) can estimate on each value should be analyzed, or in other words, the number of "unpredictably random" bits that each value contributes to the seed. Ideally, the sum of random bits in the seed should be at least 128, a number that is clearly not achieved by the above code.

One idea that I was tossing around was to display a sketchpad canvas, and to ask the user to draw some random figures. The pixels' coordinates could then be used to produce a seed. However, I wanted the above to be non-intrusive.

Discussion? Better ideas?

LES Maybe I am very naïve, but I really believe that a programmer can come up with a very good level of randomness. For example: if you prompt the user for six letters, assign an arbitrary (secret) numeric value to each letter, add the three first values and multiply the other three, then add and/or multiply everything by [clock seconds], then add or multiply it by [clock format [clock sec] -format %d], then apply some really arcane operation on the result so that only two or three digits are left, boy, isn't that random enough? Note the extreme idiosyncrasy of my formula. I can come up with ten other similarly bizarre formulas in five or ten minutes. I really don't believe that someone could crack something like that.

FPX Do not mistake obscurity for security. Your code above depends on the secrecy of the algorithm, and that is impossible to achieve. Any attacker can examine the code, so your assumption of "secret numeric values" does not hold. All the remaining transformations are reversible, and the current value of [clock seconds] is not hard to guess. So your numbers are definitely not secure. Whether they provide a reasonable source of randomness (remember, unpredictable and unbiased) should be subject of analysis. Do not assume randomness without in-depth analysis.

2004/12/10 sheila My latest issue of Science News has an article on random number generation [2] (link goes to a pay-wall on 2012-06-10) and a visualization of a good random number generator

vs a bad one

(image link is 404 on 2012-06-10)

Good PRNGs are used for other purposes than just cryptography, e.g. in high energy physics simulations.

see where random numbers flop for a source example of the predictability of random numbers in tcl, even when properly srand()'d

Initially written by FPX.

diehard

EKB Pure-tcl shuffled generators are available at Sequential correlations and the output of rand. These have a longer (but, IIRC, not calculable) period than the linear congruential generator on which they are based.

Lars H: Actually, one shouldn't expect those shuffled generators to have a longer period that the generator they shuffle (even though it is not unreasonable that the period might come out as a small multiple of the base period). Instead the shuffling should be seen as a way to destroy patterns (sequential correlaions and such) that turn up in the common low-quality PRNGs (that tend to be built into programming languages).

EKB I stand corrected! I should point out to others that Lars H has made a detailed argument on the Sequential correlations and the output of rand. Well worth reading.

EKB FYI, TIP #310 is "Add a New Pseudo-Random Number Generator".

APN For Windows, TWAPI 2.1.4 includes an interface to the Win32 CryptGenRandom function - (supposedly) a cryptographically secure generator.