[Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

Jim Burton jim at sdf-eu.org
Fri Jun 15 12:13:03 EDT 2007

I need to remove newlines from csv files (within columns, not at the end of
entire lines). This is prior to importing into a database and was being done
at my workplace by a java class for quite a while until the files processed
got bigger and it proved to be too slow. (The files are up to ~250MB at the
moment) It was rewritten in PL/SQL, to run after the import, which was an
improvement, but it still has our creaky db server thrashing away. (You may
have lots of helpful suggestions in mind, but we can't clean the data at
source and AFAIK we can't do it incrementally because there is no timestamp
or anything on the last change to a row from the legacy db.) 

We don't need a general solution - if a line ends with a delimiter we can be
sure it's the end of the entire line because that's the way the csv files
are generated.

I had a quick go with ByteString (with no attempt at robustness etc) and
although I haven't compared it properly it seems faster than what we have
now. But you can easily make it faster, surely! Hints for improvement please
(e.g. can I unbox anything, make anything strict, or is that handled by
ByteString, is there a more efficient library function to replace the

module Main
import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as B

--remove newlines in the middle of 'columns'
clean :: Char -> [B.ByteString] -> [B.ByteString]
clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else
(B.append x $ head ys):(tail ys)) [] 

main = do args <- getArgs
          if length args < 2
           then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]"
           else do bs <- B.readFile (args!!0)
                   let d = if length args == 3 then head (args!!2) else '"'
                   B.writeFile (args!!1) $ (B.unlines . clean d . B.lines)


View this message in context: http://www.nabble.com/Sneaking-haskell-in-the-workplace----cleaning-csv-files-tf3928931.html#a11142895
Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

More information about the Haskell-Cafe mailing list