Proper scaling of randoms

Mon, 6 May 2002 16:49:55 -0700 (PDT)

This message derives an integer interval mapping function that is not
only provably within the given range but also "uniform". This is the
function to use when mapping random integers from one interval to a
smaller one.

Problem: given an integer n within [0..maxn], design a scaling
function sc(n) that returns an integer within [s..t], t>=s.
The function sc(n) must be 'monotonic': 
	0<=a < b ==> sc(a) <= sc(b)
and it must map the ends of the interval:
	sc(0) -> s and sc(maxn) -> t.

One such function has been designed by Simon Thompson in his book:
	sct(n) = n `div` ((maxn+1) `div` (t-s+1)) + s

Another function has been derived in the previous message on this
list:
	sc1(n) = (n*(t-s)) `div` maxn + s

The sct() function has a serious drawback: it is not guaranteed that
sct(n) <= t (more about it below). The sc1() function is provably
within the given range, and monotonic. Alas, it is not "uniform."

When it comes to mapping of random integers, the mapping function must
possess an extra property: it must "map uniformly." That is, given
random numbers within [0..maxn] such that Prob(n==i) - const, we must
aim at Prob(sc(n)==i') - const forall i' in [s..t]. The mapping
function sc1(n) does not have this property.

Any mapping function sc(n) from [0..maxn] to [s..t] for (t-s+1) <
(maxn+1) maps several integers from the source interval to one integer
of the target interval. An integer of the target interval therefore denotes an
equivalence class of the integers from the source interval.
The uniformity property will be satisfied only if the equivalence
classes have the same size. This is only possible if (maxn+1) `rem`
(t-s+1) == 0.

Given this condition, the original Thompson's algorithm works.  If
(t-s+1) divides (maxn+1) exactly (i.e., exists denom
integer. denom*(t-s+1) == maxn+1), we have
    j*denom <= n <= j*denom+denom-1 ==> sct(n) = j+s, all j=0..(t-s) 
Thus each equivalence class has the size denom, and mapping is
perfectly uniform. As a consequence, s <= sct(n) <= t, given our
assumption (maxn+1) `rem` (t-s+1) == 0.

If this assumption does not hold, sct(n) may be greater than t (which
Dimitre Novatchev has demonstrated). Alas, in this case there is no
perfectly uniform mapping. The Thompson's algorithm misbehaves. The
worst case for the algorithm is when t = s + (maxn+1)/2. For example,
this occurs when we try to map [0..65535] to [0..32768]. In this case,
t-s+1 = 32769, and denom = 65536 `div` 32769 == 1, thus the mapping
becomes sct(n) = n. In almost half of the time, this mapping exceeds
the upper limit 32768.

OTH, the mapping sc1(n) = n*(maxn+1)/2 `div` maxn + s behaves in this
pathological case rather well. Not only sc1(n) <= t (this is always
guaranteed) but also the mapping is uniform: all equivalence classes
of the source interval [0..65535] except the last two classes have the
size of two.

Thus the best solution should be to "average" sc1(n) and sct(n).
This "averaging" is done by the formula
	sca = (n*(t-s) + (n*(denom-1) `div` denom)) `div` maxn + s
		where denom = (maxn+1) `div` (t-s+1)

It is easy to see that sca(0) -> s.
sca(maxn) = (maxn*(t-s) + (maxn*(denom-1) `div` denom)) `div` maxn + s
          = (maxn*(t-s) + p) `div` maxn + s
	    {where 0<= p = (maxn*(denom-1) `div` denom) <= maxn-1 }
	  = t-s+s
	  = t
It is easy to show that sca(n) is monotonic (as defined above).

If maxn+1 is evenly divisible by (t-s+1) (that is, denom*(t-s+1) ==
maxn+1), we have
    j*denom <= n <= j*denom+denom-1 ==> sca(n) = j+s, all j=0..(t-s) 

This requires a proof.
We have:
	sca(j*denom) 
      = (j*denom*(t-s) + (j*denom*(denom-1) `div` denom)) `div` maxn + s
      = (j*denom*(t-s) + j*(denom-1)) `div` maxn + s
      = j*(denom*(t-s) + denom-1) `div` maxn + s
      = j*maxn `div` maxn + s
      = j+s

	sca(j*denom+denom-1) 
      = ((j*denom+denom-1)*(t-s) + ((j*denom+denom-1)*(denom-1) `div` denom)) `div` maxn + s
      = ((j*denom+denom-1)*(t-s) + (j+1)*(denom-1)-1) `div` maxn + s
      = (j*maxn + (denom-1)*(t-s+1)) `div` maxn + s
      = j+s
The required result then follows by the monotonicity property.

In the most pathological case of maxn= 65535, s=0, t= 32768.
We have denom = 1 and sca(n) becomes sc1(n), which, in this case,
behaves very well.

Thus the sca mapping is the optimal one. It is guaranteed to keep the
result within the specified bounds at all times. When the perfect
uniformity of mapping is possible, it is always achieved.

The implementation isn't too complex either.

> scaleSequence :: Int -> Int -> [Int] -> [Int]
> scaleSequence s t
>   = map scale
>     where
>       scale n = (n*range + ((n*(denom-1)) `div` denom)) `div` maxn + s
>       range   = t - s
>       maxn    = modulus - 1
>       modulus = 65536
>	denom   = (maxn+1) `div` (t-s+1)

Main> any(>966) (scaleSequence 1 966 [0..65535])
False
Main> any(==966) (scaleSequence 1 966 [0..65535])
True 
Main> any(>32768) (scaleSequence 0 32768 [0..65535])
False 

> test = let sq = [32768-1..32768+1600] in
>   let result =
>        map (\x -> let t = length x in (t `seq` t)) $! groupBy (\(a,_) (b,_) -> a == b) $!
>        sortBy (\(a,_) (b,_) -> compare a b) $! 
>        zip (scaleSequence 0 32768 sq) $! sq
>   in (maximum $! result, minimum $! result)
==> (2,1) -- e.g., the mapping is roughly uniform