[Haskell-cafe] Parse HTML that is contain javascript

Chris Smith cdsmith at gmail.com
Tue Dec 24 22:26:33 UTC 2013


Another option is the xmlhtml package, which I wrote and is used by Heist.

An important factor in this decision will be what range of input you need
to accept, and what you want as a result.  A fully compliant HTML5 parser
will parse most input, but the resulting data will be somewhat complex.  On
the other hand, xmlhtml will accept a smaller subset of HTML5 (but will
handle your sample input here just fine) and produce a much simpler
output.  TagSoup, which someone else recommended, will accept even more,
and produce flatter output, but I don't know how it would perform on this
input.
On Dec 24, 2013 2:58 PM, "Patrick Hurst" <lightquake at amateurtopologist.com>
wrote:

>
>
> On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery <allbery.b at gmail.com>wrote:
>
>> On Tue, Dec 24, 2013 at 2:20 PM, akira kawata <a.kawashiro at gmail.com>wrote:
>>>
>>> Did you mean HaXmL?
>>>
>>
>> Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should*
>> be XML compatible, although it's very rare to find proper well-formed HTML
>> these days....
>>
>>
> This is actually not true; for example, not closing your <br> tags is
> perfectly valid HTML5 but invalid XML, and you can use > literals in script
> tags. The CDATA-inside-comments hack isn't necessary and hasn't been for
> years. You should try to parse HTML as HTML.
>
> That being said, if html-conduit works for you, use it; if not, try
> TagSoup, which doesn't try to structure your data into a DOM.
>
> <html>
>> <p> hogehoge </p>
>> <script>if(window.mw){
>> mw.loader.state({"<script>":"</script>","user":"ready","
>> user.groups":"ready"});
>> }
>> </script>
>> </html>
>
>
> It's worth noting that the browser will probably interpret the quoted
> </script> as the end-of-script marker; Chrome did when I copied this into
> an HTML file and saved it. You need to replace it with "</scr" + "ipt>" or
> something similar. I'm a little surprised html-conduit doesn't interpret
> </script> as end-of-script.
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131224/721b90bb/attachment.html>


More information about the Haskell-Cafe mailing list