[Haskell-cafe] CYK-style parsing and laziness
Daniel Fischer
daniel.is.fischer at web.de
Wed May 23 17:56:15 EDT 2007
Am Mittwoch, 23. Mai 2007 17:55 schrieb Steffen Mazanek:
> Hello,
>
> I have two questions regarding a Cocke, Younger, Kasami parser.
>
> Consider this program:
>
> type NT = Char -- Nonterminal
> type T = Char -- Terminal
> -- a Chomsky production has either two nonterminals or one terminal on its
> right-hand side
> type ChomskyProd = (NT, Either T (NT, NT))
> -- a grammar consists of a startsymbol, nonterminal symbols, terminal
> symbols and productions
> type Grammar = (NT, [NT], [T], [ChomskyProd])
>
> parse::Grammar->[T]->Bool
> parse (s, nts, ts, prods) w = s `elem` gs n 1
> where
> n = length w
> table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]]
> gs 1 j = [nt|p<-prods,termProd p,
> let (nt, Left t)=p, w!!(j-1)==t]
> gs i j = [nt|k<-[1..i-1],p<-prods,
> not (termProd p),
> let (nt, Right (a, b))=p,
> a `elem` table!!(k-1)!!(j-1), b `elem`
> table!!(i-k-1)!!(j+k-1)]
>
> The sets gs i j contain all nonterminal symbols from which the substring of
> w starting at index
> j of length i can be derived.
>
> Please have a look at the last line of the algorithm. In my first attempt I
> just referred to
> gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted
> that this way the
> sets gs i j are evaluated multiple times. Is there a better and more
> readable way to prevent
> multiple evaluation of these sets?
I'm not sure about readability, but you get once-only evaluation by
memoizing them (as you did), using an array is faster, and an array of sets is
faster than an array of lists. For inputs of approximately 300 characters,
your version takes about 180s here to parse the stupid grammar, my version
below about 6.5s, using an array of lists for memoisation takes about 20s
(all still horribly slow, I'm afraid). For a harder grammar (
S -> a | A E | B F
A -> b | A H | S B
B -> b | D I | A B
C -> b
D -> a
E -> C C
F -> C G
G -> D D
H -> A C
I -> B A
), an array of lists is hardly better than a list of lists, an array of sets
*far* better (the longer the input, the larger the gain, apparently).
setparse :: Grammar -> [T] -> Bool
setparse (s, nts, ts, prods) w
= s `member` (table!(n,1))
where
n = length w
sortFst = sortBy (\(a,_) (b,_) -> compare a b)
(tps, ntps) = partition termProd prods
termMap :: Map T (Set NT)
termMap = Map.fromAscListWith Set.union $
sortFst [(t, singleton nt) | (nt, Left t) <- tps]
findTerm :: T -> Set NT
findTerm = flip (Map.findWithDefault Set.empty) termMap
table :: Array (Int,Int) (Set NT)
table = array ((1,1),(n,n)) $
zip [(1,j) | j <- [1 .. ]] (map findTerm w)
++ [((i,j), Set.fromList [nt | (nt, Right (a, b)) <- ntps
, k <- [1 .. i-1]
, a `member` (table!(k,j))
, b `member` (table!(i-k,j+k))])
| i <- [2 .. n], j <- [1 .. n-i+1]]
>
> The second question regards lazy evaluation. Consider the stupid grammar
> S->S S
> S->A A
> A->a
>
> that generates a^(2n).
> The performance of the algorithm drops very fast even for small n, probably
> because the gs i j
> are getting very large.
They don't for the 'stupid' grammar above, all the gs have at most one
element.
Rather, there are so many to be checked (at least if the parse fails, all
possibilities must be considered, for successful parses, it's conceivable
that not all need be checked):
to determine gs i j, for each (nt)production, we must check (i-1) pairs, for
each i there are n-i+1 j's, altogether
n*(n^2-1)/2*(number of (nt)productions)
checks to do. So by the sheer number it's not surprising that the faster array
lookup and Set-membership test give a massive speedup, but the algorithm
remains O(n^3) or worse.
HTH,
Daniel
> Is there a trick to get lazy evaluation into play here? It is sufficient to
> find only one occurence
> of the start symbol in gs n 1.
>
> Best regards,
>
> Steffen
More information about the Haskell-Cafe
mailing list