[Haskell-cafe] HXT: how to get sibling element

Никитин Лев leon.v.nikitin at pravmail.ru
Thu Mar 15 11:28:08 CET 2012


Hello, haskellers.

Suppose we have this xml doc (maybe, little stupid):

<div>
  <span>Some story</span>
  <span>Description</span>: This story about...
  <span>Author</span>: Tom Smith
</div>

In the end I whant to get list: [("Title", "Some story"), ("Description","This story about..."), ("Author", "Tom Smith")],
or, maybe this: Book  "Some story" [("description","This story about..."), ("Author", "Tom Smith")] (Book = Book String [(String, String)].

First span is a special case then others and I undestand how to process it:

===============

import Text.XML.HXT.Core
import Text.XML.HXT.Curl
import Text.XML.HXT.HTTP

pageURL = "http://localhost/test.xml"

main = do
    r <- runX (configSysVars [withCanonicalize no, withValidate no, withTrace 0, withParseHTML no] >>>
              readDocument [withErrors no, withWarnings no, withHTTP []] pageURL >>>
              getChildren >>> isElem >>> hasName "div" >>> listA (getChildren >>> hasName "span") >>> getTitle <+> getSections)
   putStrLn "Статьи:"
    putStr "<"
    mapM_ putStr $ map (\i -> (fst i) ++ ": " ++ (snd i) ++ "| ") r
    putStrLn ">"

getTitle = arr head >>> getChildren >>> getText >>> arr trim >>> arr ("Title",)

getSections = arr tail >>> unlistA >>> ((getChildren >>> getText >>> arr trim) &&& (getChildren >>> getText >>> arr trim))

ltrim [] = []
ltrim (' ':x) = ltrim x
ltrim ('\n':x) = ltrim x
ltrim ('\r':x) = ltrim x
ltrim ('\t':x) = ltrim x
ltrim x = x

rtrim = reverse . ltrim . reverse

trim = ltrim . rtrim

===================

And I' get list:  [("Title", "Some story"), ("Description","Description"), ("Author", "Author")]

(Maybe, there is a better way to get this list?)

But I cannot find a way to get text that followes some span.

I suppose that I have to use function from  Data.Tree.NavigatableTree.XPathAxis, but I don't "puzzle out" how to do it.

Please, help me.




More information about the Haskell-Cafe mailing list