[Haskell-cafe] ANN: islink 0.1.0.0: check if an HTML element is a link (useful for web scraping)
Marios Titas
redneb8888 at gmail.com
Tue Oct 7 18:59:50 UTC 2014
Hello everybody,
I'd like to announce the first public release of islink. It's library
that basically provides a list of combinations of HTML tag names and
attributes that correspond to links to external resources. This includes
things like ("a", "href"), ("img", "src"), ("script", "src") etc. It
also comes with a convenience function to check if a particular pair
(tag, attribute) corresponds to a link. This can be useful for web
scraping.
Here's an example how to use it to extract all (external) links from an
HTML document (with the help of hxt):
{-# LANGUAGE Arrows #-}
import Text.Html.IsLink
import Text.XML.HXT.Core
-- returns a list of tuples containing the tag name, attribute name,
-- attribute value of all links
getAllLinks :: FilePath -> IO [(String, String, String)]
getAllLinks path = runX $ doc >>> multi getLink
where
doc = readDocument [withParseHTML yes, withWarnings no] path
getLink :: ArrowXml a => a XmlTree (String, String, String)
getLink = proc node -> do
tag <- getName -< node
attrbNode <- getAttrl -< node
attrb <- getName -< attrbNode
val <- xshow getChildren -< attrbNode
isLinkA -< (tag, attrb, val)
where
isLinkA = isLink `guardsP` this
isLink (tag, attrb, _) = isLinkAttr tag attrb
More information about the Haskell-Cafe
mailing list