[Haskell-cafe] Re: CYK-style parsing and laziness
apfelmus
apfelmus at quantentunnel.de
Wed May 23 13:32:23 EDT 2007
Steffen Mazanek wrote:
> I have two questions regarding a Cocke, Younger, Kasami parser.
>
> type NT = Char -- Nonterminal
> type T = Char -- Terminal
> -- a Chomsky production has either two nonterminals or one terminal on its
> right-hand side
> type ChomskyProd = (NT, Either T (NT, NT))
> -- a grammar consists of a startsymbol, nonterminal symbols, terminal
> symbols and productions
> type Grammar = (NT, [NT], [T], [ChomskyProd])
>
> parse::Grammar->[T]->Bool
> parse (s, nts, ts, prods) w = s `elem` gs n 1
> where
> n = length w
> table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]]
> gs 1 j = [nt|p<-prods,termProd p,
> let (nt, Left t)=p, w!!(j-1)==t]
> gs i j = [nt|k<-[1..i-1],p<-prods,
> not (termProd p),
> let (nt, Right (a, b))=p,
> a `elem` table!!(k-1)!!(j-1), b `elem`
> table!!(i-k-1)!!(j+k-1)]
>
> The sets gs i j contain all nonterminal symbols from which the substring of
> w starting at index j of length i can be derived.
>
> Please have a look at the last line of the algorithm. In my first attempt I
> just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive.
> However I noted that this way the sets gs i j are evaluated multiple
times.
> Is there a better and more readable way to prevent multiple evaluation of
> these sets?
The key point of the dynamic programming algorithm is indeed to memoize
the results gs i j for all pairs of i and j. In other words, the insight
that yields a fast algorithm is that for solving the subproblems gs i j
(of which there are n^2), solution to smaller subproblems of the same
form are sufficient. Tabulation is a must.
Of course, you can still choose how to represent the table. There's a
nice higher order way to do that
tabulate :: (Int -> Int -> a) -> (Int -> Int -> a)
gs = tabulate gs'
where
gs' 1 j = ... uses gs x y for some x y ...
gs' i j = ... ditto ...
The function tabulate takes a function on an N x N grid and well,
tabulates it. Actually, it doesn't return the table but already the
function that indexes into the table.
Your current code uses lists to represent the table, i.e.
tabulate f = f'
where
table = [[f i j | j<-[1..n-i+1]] | i<-[1..n]]
f' = \i j -> table !! i !! j
Note that this depends on sharing the table over multiple invocations
of f'. In other words,
tabulate f i j = table !! i !! j
where table = ...
would defeat it's purpose.
Of course, accessing the (i,j)-th element of the table takes O(i+j)
time. Another options is to use (boxed, immutable) arrays for the table
which offer O(1) access. There's an example for this on the wiki page
http://haskell.org/haskellwiki/Edit_distance
> The second question regards lazy evaluation. Consider the stupid grammar
> S->S S
> S->A A
> A->a
>
> that generates a^(2n).
>
> The performance of the algorithm drops very fast even for small n, probably
> because the gs i j are getting very large.
That would be strange. I mean, no gs i j may have more than two
elements, namely S or A. The other key point of the CYK algorithm is
that the sets gs i j are indeed sets and may only contain as many
elements as there are nonterminals.
In your case however, nonterminals get chosen multiple times (for each
production and, worse, for each k !). Be sure to keep every nonterminal
only once. Again, you can choose whether to represent the sets of
nonterminals by lists or with Data.Set or whatever.
Here's a one (suboptimal) way to do it with lists
gs = tabulate gs'
where
gs' 1 j = [nt | p<-prods, termProd p,
let (nt, Left t)=p, w !! (j-1)==t ]
gs' i j = [nt | p<-prods, not (termProd p)
let (nt, Right (a, b))=p,
not $ null
[k | k<-[1..i-1],
a `elem` gs k j,
b `elem` gs (i-k) (j+k)]]
The gs i j are still not sets, some nonterminals may appear multiple
times. But they have at most length prods elements instead of
previously length prods * n elements.
> Is there a trick to get lazy evaluation into play here?
> It is sufficient to find only one occurence of the start
> symbol in gs n 1.
The fact that the gs i j have to be sets is not related to laziness.
(Although the code above exploits that (not $ null [k | ...]) returns
True as soon as possible thanks to lazy evaluation).
Regards,
apfelmus
More information about the Haskell-Cafe
mailing list