[Haskell-cafe] Re: sort and lazyness (?)

Fri Dec 19 09:58:52 EST 2008

Bayley, Alistair wrote:
>> Currently, I'm experiencing what I would call "strange behaviour":
>>
>> I've got a data-type
>>
>> data Fraction = Fraction Int Int
>>
>> to hold rational numbers (maybe there's already some built-in 
>> type for this in Haskell,
> 
> http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Ratio.html

Thanks for the pointer, I knew there would be something already there :)

>> This list has up to 3 million elements.  If I do
>>
>> main = print $ length $ points
>>
>> main = print $ length $ map (\(x, _) -> x == Fraction 1 2) points
>>
>> main = print $ length $ reverse points
>>
>> However, trying to do
>>
>> import List
>> main = print $ length $ sort points
>>
>> makes memory usage go up and the program does not finish in 15m, also 
>> spending most time waiting for swapped out memory.  What am I doing 
>> wrong, why is sort this expensive in this case?  I would assume that 
>> computing and holding the whole list does not take too much memory, 
>> given its size and data type; doing the very same calculation in C 
>> should be straight forward.  And sort should be O(n * log n) for time 
>> and also not much more expensive in memory, right?
> 
> Not having looked at your code, I think you are benefiting from
> fusion/deforestation in the first three main functions. In this case,
> although you may appear to be evaluating the entire list, in fact the
> list elements can be discarded as they are generated, so functions like
> length and reverse can run using constant space, rather than O(n) space.

How does reverse work in constant space?  At the moment I can't imagine 
it doing so; that's why I tried it, but of course you could be right.

> The sort function, however, requires that the entire list is retained,
> hence greater memory usage. I also think you are optimistic in the
> memory requirements of your 3 million element list. A list of Ints will
> take a lot more than 4 bytes per element (on 32-bit machines) because
> there's overhead for the list pointers, plus possibly the boxes for the
> Ints themselves. I think there are 3 machine words for each list entry
> (pointer to this element, pointer to next element, info-table pointer),
> and 2 words for each Int, but I'm just guessing:
> http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObje
> cts

Of course that's the case, but the list being 3 million elements, and 
not, say 100 (which would still fit into memory for a simple C array of 
ints) I thought would make it possible.  Otherwise, how can one handle 
such amounts in data anyway?  Only using arrays?

> You might get some mileage by suggesting to GHC that your Fraction type
> is strict e.g.
> 
>> data Fraction = Fraction !Int !Int
> 
> which might persuade it to unbox the Ints, giving some space savings.

I already tried so, but this doesn't change anything to the performance. 
  I will however try now to use the provided rational type, maybe this 
helps.

Thanks for the answers,
Daniel