[Haskell-cafe] Finding longest common prefixes in a list
Twan van Laarhoven
twanvl at gmail.com
Fri Jan 20 19:57:27 CET 2012
On 20/01/12 18:45, Gwern Branwen wrote:
> Recently I wanted to sort through a large folder of varied files and
> figure out what is a 'natural' folder to split out, where natural
> means something like>4 files with the same prefix.
My idea for an algorithm would be: build a trie for the input strings,
and then look for the deepest subtries with more than one child.
For example, a trie containing the strings
chorus-kiminoshiranaimonogatari.ogg
chorus-mrmusic.ogg
choucho-lastnightgoodnight.ogg
looks like:
<root> (3 items)
c (3 items)
h (3 items)
o (3 items)
r (2 items)
u (2 items)
s (2 items)
- (2 items)
k (1 item)
i (1 item)
minoshiranaimonogatari.ogg
m (1 item)
r (1 item)
music.ogg
u (1 item)
c (1 item)
ho-lastnightgoodnight.ogg
Where actually the lines with more than one character are also subtrees
of subtrees of subtrees.
Here is some example code (untested):
import qualified Data.Map as Map
-- A trie datatype
data Trie a = Trie { numLeafs, numDescendant :: !Int
, children :: Map.Map a (Trie a) }
-- The empty trie
empty :: Trie a
empty = Trie 0 0 Map.empty
-- A trie that contains a single string
singleton :: Ord a => [a] -> Trie a
singleton [] = Trie 1 1 Map.empty
singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs)
-- Merge two tries
merge :: Ord a => Trie a -> Trie a -> Trie a
merge (Trie l d c) (Trie l' d' c')
= Trie (l+l') (d+d') (Map.unionWith merge c c')
fromList :: Ord a => [[a]] -> Trie a
fromList = foldr merge empty . map singleton
toList :: Ord a => Trie a -> [[a]]
toList (Trie l _ c)
= replicate l []
++ [ x:xs | (x,t) <- Map.toList c, xs <- toList t ]
data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a }
atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a]
atLeastThisManyDescendants minD trie@(Trie l d t)
| d < minD = []
| null forChildren = [Prefix [] trie]
| otherwise = forChildren
where
forChildren = [ Prefix (x:pfx) names
| (x,t) <- Map.toList c
, Prefix pfx names <- atLeastThisManyDescendants n t ]
Twan
More information about the Haskell-Cafe
mailing list