[Haskell-cafe] CYK-style parsing and laziness
Steffen Mazanek
haskell at steffen-mazanek.de
Wed May 23 11:55:11 EDT 2007
Hello,
I have two questions regarding a Cocke, Younger, Kasami parser.
Consider this program:
type NT = Char -- Nonterminal
type T = Char -- Terminal
-- a Chomsky production has either two nonterminals or one terminal on its
right-hand side
type ChomskyProd = (NT, Either T (NT, NT))
-- a grammar consists of a startsymbol, nonterminal symbols, terminal
symbols and productions
type Grammar = (NT, [NT], [T], [ChomskyProd])
parse::Grammar->[T]->Bool
parse (s, nts, ts, prods) w = s `elem` gs n 1
where
n = length w
table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]]
gs 1 j = [nt|p<-prods,termProd p,
let (nt, Left t)=p, w!!(j-1)==t]
gs i j = [nt|k<-[1..i-1],p<-prods,
not (termProd p),
let (nt, Right (a, b))=p,
a `elem` table!!(k-1)!!(j-1), b `elem`
table!!(i-k-1)!!(j+k-1)]
The sets gs i j contain all nonterminal symbols from which the substring of
w starting at index
j of length i can be derived.
Please have a look at the last line of the algorithm. In my first attempt I
just referred to
gs k j and gs (i-k) (j+k) what looks a lot more intuitive. However I noted
that this way the
sets gs i j are evaluated multiple times. Is there a better and more
readable way to prevent
multiple evaluation of these sets?
The second question regards lazy evaluation. Consider the stupid grammar
S->S S
S->A A
A->a
that generates a^(2n).
The performance of the algorithm drops very fast even for small n, probably
because the gs i j
are getting very large.
Is there a trick to get lazy evaluation into play here? It is sufficient to
find only one occurence
of the start symbol in gs n 1.
Best regards,
Steffen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20070523/1045e7a0/attachment.htm
More information about the Haskell-Cafe
mailing list