[Haskell-cafe] advice on space efficient data structure with
efficient snoc operation
Edward Kmett
ekmett at gmail.com
Tue Apr 7 10:45:07 EDT 2009
I'm in the process of adding a Data.Sequence.Unboxed to unboxed-containers.
I hope to have it in hackage today or tomorrow, should its performance work
out as well as Data.Set.Unboxed.
Be warned the API will likely be shifting on you for a little while, while I
figure out a better way to manage all of the instances.
-Edward Kmett
On Tue, Apr 7, 2009 at 7:49 AM, Manlio Perillo <manlio_perillo at libero.it>wrote:
> Hi.
>
> I'm still working on my Netflix Prize project.
>
> For a function I wrote, I really need a data structure that is both space
> efficient (unboxed elements) and with an efficient snoc operation.
>
> I have pasted a self contained module with the definition of the function
> I'm using:
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3453
>
>
> The movie ratings are loaded from serialized data, and the result is
> serialized again, using the binary package:
>
> transcodeIO :: IO ()
> transcodeIO = do
> input <- L.hGetContents stdin
> let output = encodeZ $ transcode $ decodeZ input
> L.hPut stdout output
>
> (here encodeZ and decodeZ are wrappers around Data.Binary.encode and
> Data.Binary.decode, with support to gzip compression/decompression)
>
>
> This function (transcodeIO, not transcode) takes, on my
> Core2 CPU T7200 @ 2.00GHz:
>
> real 30m8.794s
> user 29m30.659s
> sys 0m10.313s
>
> 1068 Mb total memory in use
>
>
> The problem here is with snocU, that requires a lot of copying.
>
> I rewrote the transcode function so that the input data set is split in N
> parts:
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=3453#a3456
>
> The mapReduce function is the one defined in the Real World Haskell.
>
>
> The new function takes (using only one thread):
>
> real 18m48.039s
> user 18m30.901s
> sys 0m6.520s
>
> 1351 Mb total memory in use
>
>
> The additional required memory is probably caused by unionsWith, that is
> not strict.
> The function takes less time, since array copying is optimized.
> I still use snocU, but on small arrays.
>
> GC time is very high: 54.4%
>
>
> Unfortunately I can not test with more then one thread, since I get
> segmentation faults (probably a bug with uvector packages).
>
> I also got two strange errors (but this may be just the result of the
> segmentation fault, I'm not able to reproduce them):
>
> tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 ||
> (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)'
> failed.
>
>
> internal error: removeThreadFromQueue: not found
> (GHC version 6.8.2 for i386_unknown_linux)
> Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
>
>
>
> Now the question is: what data structure should I use to optimize the
> transcode function?
>
> IMHO there are two solutions:
>
> 1) Use a "lazy" array.
> Something like ByteString.Lazy, and what is available in
> storablevector package.
>
> Using this data structure, I can avoid the use of appendU.
>
> 2) Use an "unboxed" list.
>
> Something like:
> http://mdounin.ru/hg/nginx-vendor-current/file/tip/src/core/ngx_list.h
>
> That is: a linked list of unboxed arrays, but unlike the lazy array
> solution, a snoc operation avoid copying if there is space in the
> current array.
>
> I don't know if this is easy/efficient to implement in Haskell.
>
>
> Any other suggestions?
>
>
> Thanks Manlio Perillo
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20090407/0e1f0142/attachment.htm
More information about the Haskell-Cafe
mailing list