[Haskell-cafe] ANN: islink 0.1.0.0: check if an HTML element is a link (useful for web scraping)

Marios Titas redneb8888 at gmail.com
Tue Oct 7 18:59:50 UTC 2014


Hello everybody,

I'd like to announce the first public release of islink. It's library
that basically provides a list of combinations of HTML tag names and
attributes that correspond to links to external resources. This includes
things like ("a", "href"), ("img", "src"), ("script", "src") etc. It
also comes with a convenience function to check if a particular pair
(tag, attribute) corresponds to a link. This can be useful for web
scraping.

Here's an example how to use it to extract all (external) links from an
HTML document (with the help of hxt):

{-# LANGUAGE Arrows #-}

import Text.Html.IsLink
import Text.XML.HXT.Core

-- returns a list of tuples containing the tag name, attribute name,
-- attribute value of all links
getAllLinks :: FilePath -> IO [(String, String, String)]
getAllLinks path = runX $ doc >>> multi getLink
  where
    doc = readDocument [withParseHTML yes, withWarnings no] path

getLink :: ArrowXml a => a XmlTree (String, String, String)
getLink = proc node -> do
    tag <- getName -< node
    attrbNode <- getAttrl -< node
    attrb <- getName -< attrbNode
    val <- xshow getChildren -< attrbNode
    isLinkA -< (tag, attrb, val)
  where
    isLinkA = isLink `guardsP` this
    isLink (tag, attrb, _) = isLinkAttr tag attrb


More information about the Haskell-Cafe mailing list