[Haskell-cafe] ANN: islink check if an HTML element is a link (useful for web scraping)

Marios Titas redneb8888 at gmail.com
Tue Oct 7 18:59:50 UTC 2014

Hello everybody,

I'd like to announce the first public release of islink. It's library
that basically provides a list of combinations of HTML tag names and
attributes that correspond to links to external resources. This includes
things like ("a", "href"), ("img", "src"), ("script", "src") etc. It
also comes with a convenience function to check if a particular pair
(tag, attribute) corresponds to a link. This can be useful for web

Here's an example how to use it to extract all (external) links from an
HTML document (with the help of hxt):

{-# LANGUAGE Arrows #-}

import Text.Html.IsLink
import Text.XML.HXT.Core

-- returns a list of tuples containing the tag name, attribute name,
-- attribute value of all links
getAllLinks :: FilePath -> IO [(String, String, String)]
getAllLinks path = runX $ doc >>> multi getLink
    doc = readDocument [withParseHTML yes, withWarnings no] path

getLink :: ArrowXml a => a XmlTree (String, String, String)
getLink = proc node -> do
    tag <- getName -< node
    attrbNode <- getAttrl -< node
    attrb <- getName -< attrbNode
    val <- xshow getChildren -< attrbNode
    isLinkA -< (tag, attrb, val)
    isLinkA = isLink `guardsP` this
    isLink (tag, attrb, _) = isLinkAttr tag attrb

More information about the Haskell-Cafe mailing list