[Haskell-cafe] Parse CSV / TSV file in Haskell - Unicode Characters
Volker Strobel
volker.strobel87 at gmail.com
Fri Oct 24 01:35:43 UTC 2014
Hi,
I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell.
However, I get problems if there are "strange" (Unicode) characters in my
CSV file. I'll get a parse error (endOfInput) then.
According to the command-line tool "file", my file has a "UTF-8 Unicode
text" decoding. My Haskell code looks like this:
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L
import Data.Text.Encoding as E
-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V
import Data.Char -- ord
csvFile :: FilePath
csvFile = "myFile.txt"
-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
decDelimiter = fromIntegral (ord '\t')
}
main :: IO ()
main = do
csvData <- L.readFile csvFile
case EL.decodeUtf8' csvData of
Left err -> print err
Right dat ->
case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ (category :: String ,
user :: String ,
date :: String,
time :: String,
message :: String) -> do
print message
I tried using decodingUtf8', preprocessing (filtering) the input with
predicates from Data.Char, and much more. However the endOfFile error
persists.
My CSV-file looks like this:
a - - - RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a - - - Uhm .. wat dan ook ????!!!! 👋
Or more literally:
a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋
The problem chars are the 👋 and • (and in my complete file, there are many
more of similar characters). What can I do, so that cassava / Data.Csv can
read my file properly?
I've also posted this question at StackOverflow a few days ago:
http://stackoverflow.com/questions/26499831/parse-csv-tsv-file-in-haskell-unicode-characters
Best,
Volker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20141024/16100a01/attachment.html>
More information about the Haskell-Cafe
mailing list