[Haskell-cafe] Re: Slower with ByteStrings?

Tue May 29 10:26:31 EDT 2007

Mirko Rahn wrote:
>>> from the letters of that word.  A letter can be used at most as many
>>> times as it appears in the input word.  So, "letter" can only match
>>> words with 0, 1, or 2 t's in them.
> 
>>    frequencies = map (\x -> (head x, length x)) . group . sort
>>    superset xs = \ys -> let y = frequencies ys in
>>         length y == lx &&
>>         and (zipWith (\(c,i) (d,j) -> c == d && i >= j) x y)
>>       where
>>       x  = frequencies xs
>>       lx = length x
> 
> As far as I understand the spec, this algorithm is not correct:
> 
> superset "ubuntu" "tun" == False
> 
> Is at least one 'b' necessary, yes or no?

Oops, you are indeed right, the answer should be "no". I thought I'd
came away without primitive recursion, but here's a correct version

  superset xs = superset' x . sort ys
    where
    x = sort xs

    _      `superset`  []     = True
    []     `superset`  _      = False
    (x:xs) `superset'` (y:ys)
        | x == y    = xs `superset` ys
        | x <  y    = xs `superset` (y:ys)
        | otherwise = False

> If the answer is no, the
> following algorithm solves the problem and is faster then the one above:
> 
> del y = del_acc []
>     where del_acc _ []              = mzero
>       del_acc v (x:xs) | x == y = return (v++xs)
>       del_acc v (x:xs)          = del_acc (x:v) xs
> 
> super u = not . null . foldM (flip del) u
> 
> main = interact $ unlines . filter ("ubuntu" `super`) . lines

The algorithm is correct but it's not faster, xs `super` ys  takes
O(n*m) time whereas superset takes O(n * log n + m * log m) time given a
proper sorting algorithm. Here, n = length xs and m = length ys.

Actually, both algorithms are essentially the same except for the
sorting that allows to drop some equality tests.

(Note that memoizing x = sort xs over different ys speeds things up a
bit for the intended application. This way, (sort "ubuntu") is only
computed once and the running time over many ys approaches O(n + m*log m).)

Regards,
apfelmus

PS: Some exercises for the interested reader:
1) Still, the algorithm super has an advantage over superset. Which one?
2) Put xs into a good data structure and achieve a O(m * log n) time for
multiple ys.
3) Is this running time always better than the aforementioned O(n +
m*log m)? What about very large m > n?