Claude Heiland-Allen claude at goto10.org
Wed Dec 22 15:09:29 CET 2010

```Hi everyone,

I've been working on  Haskell bindings for  libqd for 
double-double and quad-double arithmetic, and have been struggling to
implement  RealFloat, in particular  decodeFloat, mostly because
of its postcondition but also some issues relating to variable precision:

---8<---
If decodeFloat x yields (m,n), then x is equal in value to m*b^^n, where
b is the floating-point radix, and furthermore, either m and n are both
zero or else b^(d-1) <= m < b^d, where d is the value of floatDigits x.
In particular, decodeFloat 0 = (0,0).
---8<---

(BTW: that should probably really be "... <= abs m < ..."; perhaps this
code could be added to the report and/or documentation, if it is correct:

validDecode :: RealFloat f => f -> Bool
validDecode f = case decodeFloat f of
(0,0) -> True
(m, e) -> b^(d-1) <= abs m && abs m < b^d
where
d = floatDigits f

)

The double-double and quad-double types do not have a fixed precision,
though they do have a fixed minimum precision: this means that
decodeFloat will (in general) lose some precision.  This is because for
example in:

data DoubleDouble = DoubleDouble !CDouble !CDouble
dd = DoubleDouble a b

the semantics are that "dd = a + b", with |a|>>|b|; coupled with the
IEEE implicit leading 1 bit, this means that there may be large gaps
between exponents: for example: "1 + 0.5**100 :: DoubleDouble".

So far I've got a "mostly working" implementation thus:

decodeFloat !(DoubleDouble a b) =
case (decodeFloat a, decodeFloat b) of
((0, 0), (0, 0)) -> (0, 0)
((0, 0), (m, e)) -> (m `shiftL` f, e - f)
((m, e), (0, 0)) -> (m `shiftL` f, e - f)
((m1, e1), (m2, e2)) ->
let fixup m e =
if m < mMin
then fixup (m `shiftL` 1) (e - 1)
else if m >= mMax
then fixup (m `shiftR` 1) (e + 1)
else (m, e)
mMin = 1 `shiftL` (ff - 1)
mMax = 1 `shiftL` ff
ff = floatDigits (0 :: DoubleDouble)
g = e1 - e2 - f
in  fixup ((m1 `shiftL` f) + (m2 `shiftR` g)) (e1 - f)
where
f = floatDigits (0 :: CDouble)

This does meet the postcondition as specified (which leads to breakage
in other RealFloat methods), but has a recursion with no termination
proof (so far), and is lossy in general:

> let check f = uncurry encodeFloat (decodeFloat f) == f
> check (1 + 0.5 ** 120 :: DoubleDouble)
False

It does however seem to meet a weaker condition:

> let check2 f = (decodeFloat . (`asTypeOf` f) . uncurry encodeFloat
. decodeFloat \$ f) == decodeFloat f
> check2 (1 + 0.5 ** 120 :: DoubleDouble)
True

Questions:

1. Is this weaker condition likely to be good enough in practice?
2. Can my implementation of decodeFloat be simplified?

Thanks for any insights,

Claude

 http://crd.lbl.gov/~dhbailey/mpdist/
 http://crd.lbl.gov/~dhbailey/dhbpapers/arith15.pdf