[Haskell-cafe] Parse HTML that is contain javascript
Andras Slemmer
0slemi0 at gmail.com
Tue Dec 24 20:02:36 UTC 2013
> I'm a little surprised html-conduit doesn't interpret </script> as
end-of-script.
It does interpret it as end-of-script. As far as i know that is the correct
behaviour
On 24 December 2013 19:58, Patrick Hurst
<lightquake at amateurtopologist.com>wrote:
>
>
> On Tue, Dec 24, 2013 at 1:42 PM, Brandon Allbery <allbery.b at gmail.com>wrote:
>
>> On Tue, Dec 24, 2013 at 2:20 PM, akira kawata <a.kawashiro at gmail.com>wrote:
>>>
>>> Did you mean HaXmL?
>>>
>>
>> Pick an XML parser. CDATA is an XML construct. Well-formed HTML *should*
>> be XML compatible, although it's very rare to find proper well-formed HTML
>> these days....
>>
>>
> This is actually not true; for example, not closing your <br> tags is
> perfectly valid HTML5 but invalid XML, and you can use > literals in script
> tags. The CDATA-inside-comments hack isn't necessary and hasn't been for
> years. You should try to parse HTML as HTML.
>
> That being said, if html-conduit works for you, use it; if not, try
> TagSoup, which doesn't try to structure your data into a DOM.
>
> <html>
>> <p> hogehoge </p>
>> <script>if(window.mw){
>> mw.loader.state({"<script>":"</script>","user":"ready","
>> user.groups":"ready"});
>> }
>> </script>
>> </html>
>
>
> It's worth noting that the browser will probably interpret the quoted
> </script> as the end-of-script marker; Chrome did when I copied this into
> an HTML file and saved it. You need to replace it with "</scr" + "ipt>" or
> something similar. I'm a little surprised html-conduit doesn't interpret
> </script> as end-of-script.
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20131224/24797ef2/attachment.html>
More information about the Haskell-Cafe
mailing list