[Haskell-cafe] Text.JSON and utf8

Iavor Diatchki iavor.diatchki at gmail.com
Sat Feb 16 18:59:15 CET 2013


Hello Martin,

the change that you propose seems to already be in json-0.7.  Perhaps you
just need to 'cabal update' and install the most recent version?

About your other question:  I have not used CouchDB but a common mistake is
to mix up strings and bytes.  Perhaps the `getDoc` function does not do
utf-8 decoding and so it is giving you back list of bytes (as a String)?

In general, the JSON package only converts between JSON and String, and is
agnostic to what encoding is used to represent the strings.   There are
other packages that convert Strings into bytes (e.g.,
http://hackage.haskell.org/package/utf8-string), so typically you want to
encode the string to bytes before you export it (say to CouchDB), and
decode it back into a string just after you've imported it.

-Iavor





On Mon, Feb 11, 2013 at 5:56 AM, Martin Hilbig <lists at mhilbig.de> wrote:

> hi,
>
> tl;dr: i propose this patch to Text/JSON/String.hs and would like to
> know why it is needed:
>
> @@ -375,7 +375,7 @@
>    where
>    go s1 =
>      case s1 of
> -      (x   :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs)
> +      (x   :xs) | x < '\x20' -> '\\' : encControl x (go xs)
>        ('"' :xs)              -> '\\' : '"'  : go xs
>        ('\\':xs)              -> '\\' : '\\' : go xs
>        (x   :xs)              -> x    : go xs
>
>
> i recently stumbled upon CouchDB telling me i'm sending invalid json.
>
> i basically read lines from a utf8 file with german umlauts and send
> them to CouchDB using Text.JSON and Database.CouchDB.
>
>   $ file lines.txt
>   lines.txt: UTF-8 Unicode text
>
> lets take 'ö' as an example. i use LANG=de_DE.utf8
>
> ghci tells
>
> > 'ö'
> '\246'
>
> > putChar '\246'
> ö
>
> > putChar 'ö'
> ö
>
> > :m + Text.JSON Database.CouchDB
> > runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ toJSObject
> [("test","ö")])
> *** Exception: HTTP/1.1 400 Bad Request
> Server: CouchDB/1.2.1 (Erlang OTP/R15B03)
> Date: Mon, 11 Feb 2013 13:24:49 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: 48
> Cache-Control: must-revalidate
>
> couchdb log says:
>
>   Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8
> string.\n"}},<<"{\"test\":\"<**F6>\"}">>}
>
> this is indeed hex ö:
>
> > :m + Numeric
> > putChar $ toEnum $ fst $ head $ readHex "f6"
> ö
>
> if i apply the above patch and reinstall JSON and CouchDB the doc
> creation works:
>
> > runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ toJSObject
> [("test", "ö")])
> Right someRev
>
> but i dont get back the ö i expected:
>
> > Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO (Maybe
> (Doc,Rev,JSObject String))
> > let Ok y = valFromObj "test" =<< readJSON x :: Result String
> > y
> "\195\188"
> > putStrLn y
> ü
>
> apperently with curl everything works fine:
>
> $ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}'
> {"ok":true,"id":"foo","rev":"**someOtherRev"}
> $ curl localhost:5984/db/foo
> {"_id":"bars","_rev":"**someOtherRev","test":"ö"}
>
> so how can i get my precious ö back? what am i doing wrong or does
> Text.JSON need another patch?
>
> another question: why does encControl in Text/JSON/String.hs handle the
> cases x < '\x100' and x < '\x1000' even though they can never be
> reached with the old predicate in encJSString (x < '\x20')
>
> finally: is '\x7e' the right literal for the job?
>
> thanks for reading
>
> have fun
> martin
>
> ______________________________**_________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/**mailman/listinfo/haskell-cafe<http://www.haskell.org/mailman/listinfo/haskell-cafe>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20130216/7646627a/attachment.htm>


More information about the Haskell-Cafe mailing list