[Haskell-cafe] Finding longest common prefixes in a list

Gwern Branwen gwern0 at gmail.com
Tue Jan 24 02:12:20 CET 2012


On Sat, Jan 21, 2012 at 8:18 AM, Twan van Laarhoven <twanvl at gmail.com> wrote:
> Notice that there are lots of "miku-X" prefixes found. This is probably not
> what you want. What exactly do you want the algorithm to do? For example, ""
> is obviously a prefix of every string, but it is not very long. On the other
> hand, each string is a prefix of itself, but that prefix is shared by only
> one string (usually).
>
> By the way, the sort and compare adjacent pairs approach corresponds to
> "atLeastThisManyDescendants 2".

Ah, now the code makes sense to me. It's longer, but it is a heck of a
lot more principled and readable, so I'm happy to replace my version
with yours. It's not too hard to convert it into a CLI filter with
optional depth (default of 2, replicating original behavior):

import qualified Data.Map as Map
import System.Environment (getArgs)
import Data.List (sortBy)
import Data.Ord (comparing)

main :: IO ()
main = do arg <- getArgs
          let n = if null arg then 2 else read (head arg) :: Int
          interact (unlines .  chunk n . lines)

chunk :: Int -> [String] -> [String]
chunk n = map prefix . sortByLength . atLeastThisManyDescendants n . fromList
          where sortByLength :: [CommonPrefix Char] -> [CommonPrefix Char]
                sortByLength = sortBy (comparing (numDescendant . names))
.....

And the results seem kosher (printing just the prefixes is probably
the best idea, but wouldn't be too hard to switch to printing full
filenames - just filter the original file list with the extracted
prefix from each CommonPrefix):

$ ls music/vocaloid/| runhaskell lcp.hs 5
miku-s
miku-t
miku-r
rin-
miku-a
gumi-
luka-
$ ls music/vocaloid/| runhaskell lcp.hs 4
miku-h
miku-m
miku-n
miku-p
miku-s
miku-t
miku-r
rin-
miku-a
gumi-
luka-
$ ls music/vocaloid/| runhaskell lcp.hs # with 2
chorus-
gumi-mo
gumi-s
kaito-
luka-emon
luka-t
miku-acolorlinkingworld-
miku-akayaka
miku-cleantears-remind2011natsu-
miku-dan
miku-ele
miku-galaxyodyssey-
miku-ha
miku-inn
miku-jemappelle-motion-
miku-kz-
miku-lo
miku-m at rk-
miku-plustellia-壁の彩度-
miku-ro
miku-se
miku-ta
miku-the
miku-tinyparadise-
miku-ジラートP-birthdayofeden-
miku-杯本選
miku-般若心経
niconicochorus-
yuki-
len-
luka-di
miku-re:package-
miku-n
rin-

-- 
gwern
http://www.gwern.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lcp.hs
Type: text/x-haskell
Size: 7735 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120123/046a59bb/attachment.hs>


More information about the Haskell-Cafe mailing list