[Haskell-cafe] Parse CSV / TSV file in Haskell - Unicode Characters

Johan Tibell johan.tibell at gmail.com
Fri Oct 24 05:59:07 UTC 2014


I just replied to your question on SO.

On Fri, Oct 24, 2014 at 3:35 AM, Volker Strobel <volker.strobel87 at gmail.com>
wrote:

> Hi,
>
> I'm trying to parse a tab-delimited file using cassava/Data.Csv in
> Haskell. However, I get problems if there are "strange" (Unicode)
> characters in my CSV file. I'll get a parse error (endOfInput) then.
>
> According to the command-line tool "file", my file has a "UTF-8 Unicode
> text" decoding. My Haskell code looks like this:
>
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE OverloadedStrings #-}
>
> import qualified Data.ByteString as C
> import qualified System.IO.UTF8 as U
> import qualified Data.ByteString.UTF8 as UB
> import qualified Data.ByteString.Lazy.Char8 as DL
> import qualified Codec.Binary.UTF8.String as US
> import qualified Data.Text.Lazy.Encoding as EL
> import qualified Data.ByteString.Lazy as L
>
> import Data.Text.Encoding as E
>
> -- Handle CSV / TSV files with ...
> import Data.Csv
> import qualified Data.Vector as V
>
> import Data.Char -- ord
>
> csvFile :: FilePath
> csvFile = "myFile.txt"
>
> -- Set delimiter to \t (tabulator)
> myOptions = defaultDecodeOptions {
>               decDelimiter = fromIntegral (ord '\t')
>             }
>
> main :: IO ()
> main = do
>   csvData <- L.readFile csvFile
>   case EL.decodeUtf8' csvData of
>    Left err -> print err
>    Right dat ->
>      case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
>        Left err -> putStrLn err
>        Right v -> V.forM_ v $ \ (category :: String ,
>                                user :: String ,
>                                date :: String,
>                                time :: String,
>                                message :: String) -> do
>          print message
>
> I tried using decodingUtf8', preprocessing (filtering) the input with
> predicates from Data.Char, and much more. However the endOfFile error
> persists.
>
> My CSV-file looks like this:
>
> a   -   -   -   RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
> a   -   -   -   Uhm .. wat dan ook ????!!!! 👋
>
> Or more literally:
>
> a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
> a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋
>
> The problem chars are the 👋 and • (and in my complete file, there are
> many more of similar characters). What can I do, so that cassava / Data.Csv
> can read my file properly?
>
> I've also posted this question at StackOverflow a few days ago:
> http://stackoverflow.com/questions/26499831/parse-csv-tsv-file-in-haskell-unicode-characters
>
> Best,
> Volker
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20141024/b8445cea/attachment.html>


More information about the Haskell-Cafe mailing list