[Haskell-cafe] Text.JSON and utf8

Martin Hilbig lists at mhilbig.de
Mon Feb 11 14:56:04 CET 2013


hi,

tl;dr: i propose this patch to Text/JSON/String.hs and would like to
know why it is needed:

@@ -375,7 +375,7 @@
    where
    go s1 =
      case s1 of
-      (x   :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs)
+      (x   :xs) | x < '\x20' -> '\\' : encControl x (go xs)
        ('"' :xs)              -> '\\' : '"'  : go xs
        ('\\':xs)              -> '\\' : '\\' : go xs
        (x   :xs)              -> x    : go xs


i recently stumbled upon CouchDB telling me i'm sending invalid json.

i basically read lines from a utf8 file with german umlauts and send
them to CouchDB using Text.JSON and Database.CouchDB.

   $ file lines.txt
   lines.txt: UTF-8 Unicode text

lets take 'ö' as an example. i use LANG=de_DE.utf8

ghci tells

 > 'ö'
'\246'

 > putChar '\246'
ö

 > putChar 'ö'
ö

 > :m + Text.JSON Database.CouchDB
 > runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ 
toJSObject [("test","ö")])
*** Exception: HTTP/1.1 400 Bad Request
Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
Date: Mon, 11 Feb 2013 13:24:49 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 48
Cache-Control: must-revalidate

couchdb log says:

   Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8 
string.\n"}},<<"{\"test\":\"<F6>\"}">>}

this is indeed hex ö:

 > :m + Numeric
 > putChar $ toEnum $ fst $ head $ readHex "f6"
ö

if i apply the above patch and reinstall JSON and CouchDB the doc
creation works:

 > runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ 
toJSObject [("test", "ö")])
Right someRev

but i dont get back the ö i expected:

 > Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO 
(Maybe (Doc,Rev,JSObject String))
 > let Ok y = valFromObj "test" =<< readJSON x :: Result String
 > y
"\195\188"
 > putStrLn y
ü

apperently with curl everything works fine:

$ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}'
{"ok":true,"id":"foo","rev":"someOtherRev"}
$ curl localhost:5984/db/foo
{"_id":"bars","_rev":"someOtherRev","test":"ö"}

so how can i get my precious ö back? what am i doing wrong or does 
Text.JSON need another patch?

another question: why does encControl in Text/JSON/String.hs handle the
cases x < '\x100' and x < '\x1000' even though they can never be
reached with the old predicate in encJSString (x < '\x20')

finally: is '\x7e' the right literal for the job?

thanks for reading

have fun
martin



More information about the Haskell-Cafe mailing list