[Haskell-cafe] Is it possible to have constant-space JSON decoding?
oleg at okmij.org
oleg at okmij.org
Sat Dec 15 16:02:01 CET 2012
Johan Tibell posed an interesting problem of incremental XML parsing
while still detecting and reporting ill-formedness errors.
> What you can't have (I think) is a function:
>
> decode :: FromJSON a => ByteString -> Maybe a
>
> and constant-memory parsing at the same time. The return type here
> says that we will return Nothing if parsing fails. We can only do so
> after looking at the whole input (otherwise how would we know if it's
> malformed).
The problem is very common: suppose we receive an XML document over a
network (e.g., in an HTTP stream). Network connections are inherently
unreliable and can be dropped at any time (e.g., because someone
tripped over a power cord). The XML document therefore can come
truncated, for example, missing the end tag for the root
element. According to the XML Recommendations, such document is
ill-formed, and hence does not have an Infoset (In contrast, invalid
but well-formed documents do have the Infoset). Strictly speaking, we
should not be processing an XML document until we verified that it is
well-formed, that is, until we parsed it at all and have checked that
all end tags are in place. It seems we can't do the incremental XML
processing in principle.
I should mention first that sometimes people avoid such a strict
interpretation. If we process telemetry data encoded in XML, we may
consider a document meaningful even if the root end tag is missing. We
process as far as we can.
Even if we take the strict interpretation, it is still possible
to handle a document incrementally so long as the processing is
functional or the side effects can be backed out (e.g., in a
transaction). This message illustrates exactly such an incremental
processing that always detects ill-formed XML, and, optionally,
invalid XML. It is possible after all to detect ill-formedness
errors and process the document without loading it all in memory
first -- using as little memory as needed to hold the state of the
processor -- just a short string in our example.
Our running example is an XML document representing a finite map:
a collection of key-value pairs where key is an integer:
<map>
<kv><key>1</key><value>v1</value></kv>
<kv><key>2</key><value>v2</value></kv>
<kv><key>bad</key><value>v3</value></kv>
The above document is both ill-formed (missing the end tag)
and invalid (one key is bad: non-read-able). We would like to write
a lookup function for a key in such an encoded map. We should report
an ill-formedness error always. The reporting of validation
errors may vary. The function
xml_lookup :: Monad m => Key -> Iteratee Char m (Maybe Value)
xml_lookup key = id .| xml_enum default_handlers .| xml_normalize
.| kv_enum (lkup key)
always reports the validation errors. The function is built
by composition from smaller, separately developed and tested
pipeline components: parsing of a
document to the XMLStream, normalization, converting the XMLStream to
a stream of (Key,Value) pairs and finally searching the stream.
A similar function that replaces kv_enum with kv_enum_pterm
terminates the (Key,Value) conversion as soon as its client iteratee
finished. In that case if the lookup succeeds before we encounter the bad
key, no error is reported. Ill-formedness errors are raised always.
The code
http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs
shows the examples of both methods of validation error reporting.
This code also illustrates the library of parsing combinators, which
represent the element structure (`DTD').
More information about the Haskell-Cafe
mailing list