<br><br><div><span class="gmail_quote">On 15/06/07, <b class="gmail_sendername">Jim Burton</b> <<a href="mailto:firstname.lastname@example.org">email@example.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>I need to remove newlines from csv files (within columns, not at the end of<br>entire lines). This is prior to importing into a database and was being done<br>at my workplace by a java class for quite a while until the files processed
<br>got bigger and it proved to be too slow. (The files are up to ~250MB at the<br>moment) It was rewritten in PL/SQL, to run after the import, which was an<br>improvement, but it still has our creaky db server thrashing away. (You may
<br>have lots of helpful suggestions in mind, but we can't clean the data at<br>source and AFAIK we can't do it incrementally because there is no timestamp<br>or anything on the last change to a row from the legacy db.)
<br><br>We don't need a general solution - if a line ends with a delimiter we can be<br>sure it's the end of the entire line because that's the way the csv files<br>are generated.<br><br>I had a quick go with ByteString (with no attempt at robustness etc) and
<br>although I haven't compared it properly it seems faster than what we have<br>now. But you can easily make it faster, surely! Hints for improvement please<br>(e.g. can I unbox anything, make anything strict, or is that handled by
<br>ByteString, is there a more efficient library function to replace the<br>fold...?).<br><br>module Main<br> where<br>import System.Environment (getArgs)<br>import qualified Data.ByteString.Char8 as B<br><br>--remove newlines in the middle of 'columns'
<br>clean :: Char -> [B.ByteString] -> [B.ByteString]<br>clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else<br>(B.append x $ head ys):(tail ys)) <br><br>main = do args <- getArgs<br> if length args < 2
<br> then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]"<br> else do bs <- B.readFile (args!!0)<br> let d = if length args == 3 then head (args!!2) else '"'
<br> B.writeFile (args!!1) $ (B.unlines . clean d . B.lines)<br>bs<br><br></blockquote></div><br>Hi,<br>I haven't compiled this, but you get the general idea:<br><br>import qualified Data.ByteString.Lazy.Char8
as B<br>-- takes a bytestring representing the file, concats the lines<br>-- then splits it up into "real" lines using the delimiter<br>clean :: Char -> B.ByteString -> [B.ByteString]<br>clean' d = B.split
d . B.concat . B.lines<br><br><br clear="all"><br>-- <br>Sebastian Sylvan<br>+44(0)7857-300802<br>UIN: 44640862