[Haskell-cafe] Parse HTML that is contain javascript

Patrick Hurst lightquake at amateurtopologist.com
Tue Dec 24 19:58:16 UTC 2013


On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery <allbery.b at gmail.com>wrote:

> On Tue, Dec 24, 2013 at 2:20 PM, akira kawata <a.kawashiro at gmail.com>wrote:
>>
>> Did you mean HaXmL?
>>
>
> Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should*
> be XML compatible, although it's very rare to find proper well-formed HTML
> these days....
>
>
This is actually not true; for example, not closing your <br> tags is
perfectly valid HTML5 but invalid XML, and you can use > literals in script
tags. The CDATA-inside-comments hack isn't necessary and hasn't been for
years. You should try to parse HTML as HTML.

That being said, if html-conduit works for you, use it; if not, try
TagSoup, which doesn't try to structure your data into a DOM.

<html>
> <p> hogehoge </p>
> <script>if(window.mw){
> mw.loader.state({"<script>":"</script>","user":"ready","
> user.groups":"ready"});
> }
> </script>
> </html>


It's worth noting that the browser will probably interpret the quoted
</script> as the end-of-script marker; Chrome did when I copied this into
an HTML file and saved it. You need to replace it with "</scr" + "ipt>" or
something similar. I'm a little surprised html-conduit doesn't interpret
</script> as end-of-script.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131224/abb425a5/attachment.html>


More information about the Haskell-Cafe mailing list