[Haskell-beginners] Using Cassava with medium sized files (~50MB)

Antoine Genton genton.antoine at gmail.com
Sun Mar 20 15:21:11 UTC 2016


Hello,
I tried to load a ~50MB csv file in memory with cassava but found that my
program was incredibly slow. After doing some profiling, I realized that it
used an enormous amount of memory in the heap:

24,626,540,552 bytes allocated in the heap
6,946,460,688 bytes copied during GC
2,000,644,712 bytes copied maximum residency (14 sample(s))
319,728,944 bytes maximum slop
3718 MB total memory in use (0MB lost due to fragmentation)
...
%GC time 84.0% (94.3% elapsed)

Seeing that, I have the feeling that my program lacks strictness and
accumulates thunks in memory. I tried two versions, on using Data.Csv, and
one using Data.Csv.Streaming. Both are giving the same result. What am I
doing wrong?

Here are the two sources:
1/
import Data.Csv
import Data.ByteString.Lazy as BL
import qualified Data.Vector as V

main :: IO ()
main = do
   csv <- BL.readFile "tt.csv"
   let !res = case decode NoHeader csv of Right q -> q ::
V.Vector(V.Vector(ByteString))
   print $ res V.! 0


--------------------------------
2/
import Data.Csv.Streaming
import Data.ByteString.Lazy as BL
import qualified Data.Vector as V
import Data.Foldable

main :: IO ()
main = do
   csv <- BL.readFile "tt.csv"
   let !a = decode NoHeader csv :: Records(V.Vector(ByteString))
   let !xx = V.fromList $ [V.fromList([])] :: V.Vector(V.Vector(ByteString))
   let !res = Data.Foldable.foldr' V.cons xx a
   print $ res V.! 0


The goal of the program is ultimately to have the csv loaded in memory as a
Vector of Vector of ByteString for further processing later on.


Thank you for your help,

Antoine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/beginners/attachments/20160320/aa7b728e/attachment.html>


More information about the Beginners mailing list