[Haskell-cafe] Why is this so inefficient?

Tue Feb 5 13:20:54 EST 2008

I've switched to this, which gets rid of the ByteString instances
fairly quickly.  None of them make it into the final map.  I'm still
not getting any faster response out of it, and it also complains that
my stack size is too small for anything about 128K records or more.

import qualified Data.ByteString.Char8 as S
import qualified Data.Map as M
import qualified Data.List as L

transposeRow cols row = zipWith (:) (map (read . S.unpack) $ row) cols

transposeCSV :: [[S.ByteString]] -> M.Map String [Float]
transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet)
    where spreadsheet = L.foldl' transposeRow emptySheet rows
          emptySheet = take (length header) $ repeat []

dataFromFile :: String -> IO (M.Map String [Float])
dataFromFile filename = do
    f <- S.readFile filename
    return . transposeCSV . map (S.split ',' $!) . S.lines $ f

On Tue, Feb 5, 2008 at 1:15 PM, Don Stewart <dons at galois.com> wrote:
> If the  strings are relatively short, there can be a bottleneck
>  in the current Ord instance for Bytestrings. When lots of them
>  are in a map, the ffi calls to memcmp dominate.
>
>  I've a fix for this (do it all in Haskell for small strings), and
>  can push it if someone complains some more.
>
>  jefferson.r.heard:
>
>
> > I thought this was fairly straightforward, but where the marked line
>  > finishes in 0.31 seconds on my machine, the actual transpose takes
>  > more than 5 minutes.  I know it must be possible to read data in
>  > haskell faster than this.  I'm trying to read a 100MB comma delimited
>  > file.  I've tried both CSV modules, and these take even longer to read
>  > the file.  Is there some general best-practice for reading and parsing
>  > large amounts of data that I'm not aware of?
>  >
>  > I have tried, by the way, a couple of things, including putting a bang
>  > (!) before row in transposeRow and using foldr instead of foldl', but
>  > that didn't change anything other than force me to increase the stack
>  > size to 100M on the command line.
>  >
>  > I'm running in the profiler now, and I'll update this, but I thought I
>  > would check and see if my head was on remotely straight to begin with.
>  >
>  > -- Jeff
>  >
>  > ---
>  > module ColumnMajorCSV where
>  >
>  > import qualified Data.ByteString.Char8 as S
>  > import qualified Data.Map as M
>  > import qualified Data.List as L
>  >
>  > transposeRow cols row = zipWith (:) row cols
>  >
>  > transposeCSV :: [[S.ByteString]] -> M.Map String [S.ByteString]
>  > transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet)
>  >     where spreadsheet = L.foldl' transposeRow emptySheet rows
>  >           emptySheet = take (length header) $ repeat []
>  >
>  > dataFromFile :: String -> IO (M.Map String [S.ByteString])
>  > dataFromFile filename = do
>  >     f <- S.readFile filename
>  >     print . length . map (S.split ',' $!) . S.lines $ f
>  >  -- finishes in 0.31 seconds
>  >     return . transposeCSV . map (S.split ',' $!) . S.lines $ f  --
>  > this takes 5 minutes and 6 seconds
>  > _______________________________________________
>  > Haskell-Cafe mailing list
>  > Haskell-Cafe at haskell.org
>  > http://www.haskell.org/mailman/listinfo/haskell-cafe
>

-- 
I try to take things like a crow; war and chaos don't always ruin a
picnic, they just mean you have to be careful what you swallow.

-- Jessica Edwards