[Haskell-cafe] Re: CYK-style parsing and laziness

Sat May 26 04:57:45 EDT 2007

Steffen Mazanek wrote:
> apfelmus wrote
>> The key point of the dynamic programming algorithm is indeed to memoize
>> the results gs i j for all pairs of i and j. In other words, the insight
>> that yields a fast algorithm is that for solving the subproblems gs i j
>> (of which there are n^2), solution to smaller subproblems of the same
>> form are sufficient. Tabulation is a must.
>
> I underst and this, however I thought it would be possible to use the
> automatic collapsing of the termgraphs in some way.

Well, some things have to be left to the programmer :), especially the
choice of trading space for time.

Note that there are very systematic and natural ways to derive dynamic
programming algorithms in functional languages. In a sense, much of the
work of R. Bird centers this topic. The book "Algebra of Programming"

http://web.comlab.ox.ac.uk/oucl/research/pdt/ap/pubs.html#Bird-deMoor96:Algebra

is one of the cornerstones.

The systematic derivation of dynamic programming algorithms has been
rediscovered in a more direct but less general fashion

   http://bibiserv.techfak.uni-bielefeld.de/adp/

> Of course, you can still choose how to represent the table. There's a
>> nice higher order way to do that
>>
>>   tabulate :: (Int -> Int -> a) -> (Int -> Int -> a)
>>
>>   gs = tabulate gs'
>>      where
>>      gs' 1 j = ... uses  gs x y  for some  x y ...
>>      gs' i j = ... ditto ...
> 
> 
> Thank you for this explanation. Your approach is not very concise either
> but it does not pollute the algorithm so much.

Oh? It's concise enough for me :) The nice thing about an explicit
'tabulate' is that you can separate the table and the entry calculations
completely.

> That would be strange. I mean, no gs i j may have more than two
>> elements, namely S or A. The other key point of the CYK algorithm is
>> that the sets gs i j are indeed sets and may only contain as many
>> elements as there are nonterminals.
> 
> 
> ....  You are right, of course. I have tried a nub before the list
> comprehension however this is evaluated too late.

Yes, the nub has to eliminate storing a nonterminal "for every k". But
this can be done in advance by noting that we're only interested in
whether there exists at least one k

  [nt | ..., not $ null [k | k<-[1..i-1],
                             a `elem` gs k j,
                             b `elem` gs (i-k) (j+k)]]

and don't want to emit a nonterminal for which k

  [nt | ..., k <- [1..i-1],
             a `elem` gs k j,
             b `elem` gs (i-k) (j+k)]

> I should really use sets, however, I would miss the list comprehension
> syntactic sugar sooo much. Is there something similar for real Data.Set?

Not that I knew of, but you can always use

  fromList [nt | ...]

Note that the Data.Set can be thought of as being part of the
tabulation. In effect, you really deal with a 3-dimensional truth table

  gs i j nt = substring starting at j of length i can be derived
              by nonterminal nt

Here's an implementation of the tabulation with Data.Set

  tabulate :: (Int -> Int -> NT -> Bool) -> (Int -> Int -> NT -> Bool)
  tabulate gs' = \i j nt -> nt `member` table ! (i,j)
     where
     table = array bnds [(ij, mkSet $ gs' i j) | ij@(i,j) <- range bnds]
     mkSet chi = fromList [nt | nt <- nts, chi nt]

And here's the memoized function

  gs = tabulate gs'
     where
     gs' 1 j nt = any [True | Left t <- productions nt,
                                  w !! (j-1) == t]
     gs' i j nt = any [True | Right (a,b) <- productions nt,
                                  gs k     j     a,
                                  gs (i-k) (j+k) b]

It assumes a function (dependent on the grammer at hand)

   productions :: NT -> [Either T (NT, NT)]

that returns the terminal and nonterminal productions for a given
nonterminal.

Regards,
apfelmus