[Haskell-cafe] Parsing a bunch of strings without backtracking
Olaf Klinke
olf at aatal-apotheke.de
Thu Mar 9 20:51:28 UTC 2017
The tree with internal nodes of type Char and leaves of type v is a representation of a finite state machine, a common tool in parsing certain grammars. I'm not sure your type Compact v captures exactly this, since a value v may be followed by more Chars down in the tree. You'd want
type Compact v = [Either (More v) v] -- akin to Data.Tree.Forest
data More v = More Char (Compact v) -- akin to Data.Tree.Tree
What you may want to look at is a so-called trie [1,2].
In textbooks, it is common to assume that there is a charcter '$' not element of your alphabet. Then '$' is appended to each string before compressing all into a tree structure. This way, one can put both a string and a prefix of it into the same tree.
A special kind of trie is the suffix trie. It holds all suffixes of a string and facilitates fast full-text search. More efficient variants are extensively used in bioinformatics [3].
Olaf
[1] http://hackage.haskell.org/package/word-trie-0.3.0/docs/Data-Trie.html
[2] https://en.wikipedia.org/wiki/Trie
[3] http://mummer.sourceforge.net
More information about the Haskell-Cafe
mailing list