Efficient way to code "(symbol,multiplicity)-counting" - how?

Alastair Reid reid@cs.utah.edu
28 May 2002 17:34:40 +0100


>  > > type Multiplicity a = [(a,Int)]
> 
> Can we say "Multiplicity a" *is* "[(a,Int)]", or do we say
> "Multiplicity a" *is_a_distinct_yet_identical_copy_of* "[(a,Int)]"?

Type synonyms are like typedef or #define in C: they create a fresh
name for an already existing type.  Use newtype if you want to create
fresh types.

> Does it make a difference whether you write "filter (==a) as" or
> "filter (a==) as"?

The two versions of the predicate translate to:

  (== a) ~~~> flip (==) a  ==  \ x -> (x == a)
  (a ==) ~~~> (==) a       ==  \ x -> (a == x)

These are equivalent since any sensible instance of == is reflexive.

There's probably a marginal gain in efficiency from using (a ==) when
the compiler doesn't inline the definition of filter and no gain at
all when it does.

>> counts :: Eq a => [a] -> Multiplicity a
>> counts as = [ (a, length (filter (==a) as)) | a <- List.nub as ]

> What do you think of the following as an alternative definition of
> counts?
> 
>    counts [] = []
>    counts (a:as) = (a,b+1) : counts cs where (b,cs)=strip a as
> 
>    strip :: Eq a => a -> [a] -> (Int,[a])
>    strip a [] = (0,[])
>    strip a (b:bs) = if (a==b) then (c+1,ds) else (c,b:ds)
>        where (c,ds) = strip a bs
> 
> I am trying to work out how to code fast and memory efficient
> haskell.  Is the above a good approach?

Note that you're writing a slightly different function - the functions
give different results for:

  counts ['a','b','a','a']

but that difference probably won't affect your code.


As for coding style, if you care about performance this much, you
should use GHC.  GHC should do a pretty good job at optimizing code
like this (which is a slightly more concise version of yours).

  counts [] = []
  counts (a:as) = (a, length xs):counts ys
   where
    (xs,ys) = takeWhile (a==) as

>  > Symbols are elements of a methematical structure that have a first
>  > element, a final element and an increment function.
>  >
>  > [I'm making this a 1st class structure because I want to be able to
>  > share the symbols structure between multiple invocations of next.  I
>  > will use this structure a lot like the way I would a typeclass -
>  > except that I will explicitly create my own instance.]
> 
> I'm a little unsure about what you are saying here.  Am I right in
> thinking a 1st class structure is one that may be thought of as data?

Yes, that's what I meant.

> What is the alternative here?  Are you saying that by defining such a
> structure, you can calculate the concepts once, and then pass them
> around, rather than calculating them at each step of the process?

Yes.  

I could have searched for the last element in the list each time round
the loop (erm, I mean 'on each recursive call to incF') but that would
have been horribly inefficient.

Or I could have searched for the last element in the list once for
each time incF was non-recursively invoked:

  incF ms x = incF' x
   where
    first = ...
    last = ...

    incF' x = ... incF'...

but that would be a bit inefficient too.

>  > The nxt function is a bit inefficient.  We're hampered here by
>  > polymorphism: if all you can do is an equality test, you can't do
>  > better than a linear time lookup.  A binary tree could be used instead
>  > of the zip if we had an Ord instance; an array if we had an Ix
>  > instance.
> 
> But we can assume that the "digits" are ordered, this ordering given
> by the order in which they occur in the multiplicity.  Is there a way
> of using this to make the digits an Ord instance?  And if so, how do you
> do the binary tree?


-- http://www.haskell.org/ghc/docs/latest/set/finitemap.html

import FiniteMap

mkTree :: Ord a => [(a,b)] -> FiniteMap a b
mkTree abs = listToFM abs

getTree :: Ord a => a -> FiniteMap a b -> -> Maybe b
getTree a t = lookupFM t a



Arrays (http://www.haskell.org/onlinelibrary/array.html) would work if
you can _efficiently_ turn your index values into Ints but if you can
do that, you probably have an Enum instance

  http://www.haskell.org/onlinereport/basic.html
  http://www.haskell.org/onlinereport/standard-prelude.html#$tEnum

>  > > testF f = putStr $ showFigures (take 110 f)
> 
> What does the "$" do in the above?

In

  http://www.haskell.org/onlinereport/standard-prelude.html#$v$D

you'll se it has this pointless-looking definition:
 
  f $ x = f x

Inlining this in testF gives:

> testF f = putStr (showFigures (take 110 f))

which shows that I'm using it to avoid having too many parentheses.

>  > We've also ignored the importance of the multiplicity constraint.
>  > We can enforce this by discarding any result of incF2 which fails
>  > the constraint.
>  >
>  > > incF3 :: Eq a => Multiplicity a -> Symbols a -> Figure a -> Figure a
>  > > incF3 m s f = head (filter (countok m) (tail (iterate (incF2 m s) f)))
> 
> The filter combined with a check that the multiplicity constraint is
> satisfied will work, but how efficient is it?  I am guessing that it
> will depend how many are rejected.  If most are rejected then it's
> probably inefficient, but if only a few are, then it's the best
> way.  Are there any other pros and cons with this approach?

I think that's the only con.  The pro is that it's easy to write.

> I am thinking that maybe a more efficient algorithm, in the case where
> lots are expected to be rejected in the above, would be one involving
> a dynamically changing multiplicity.  Ie, when a symbol is chosen, the
> multiplicity is modified to reduce the corresponding multiplicity by
> 1.  Of course, maybe I'm just thinking too much in the imperative
> framework still --- where the multiplicity would be represented as an
> array of values that could be reassigned.  The problem seems to be
> that lazy lists are not good when you want to do "random access
> updates", which is roughly what we want to do with a multiplicity
> list.  Are there well known Haskell solutions to this kind of issue?

Yup, that's the algorithm I was thinking of.  The structure of the
recursion would be something like:

  incF3 m s f = ... incF3 m' s f'
   where
    m' = remove x m

If the multiplicity is small an association list [(a,b)] or binary
tree will do - update it by copying.  For medium sized multiplicities,
an array is probably a good bet - again, update by copying.  I think
there's an implementation of arrays with efficient update kicking around
in hslibs

If the multiplicity is large, you can use mutable arrays

  http://www.haskell.org/ghc/docs/latest/set/sec-marray.html

at the cost of having to learn what monads are.

It should also be possible to provide immutable arrays with constant
time update and lookup (so you would not have to learn about monads)
but I don't see such a beast in the HS libraries
(http://www.haskell.org/haddock/libraries/index.html)

> By the way, the reason I wanted my counting to wrap back to "[]" after
> getting to the maximum figure, is so the function "next" would always
> work.  But your email suggested to me an alternative solution.  I could
> just use the "Maybe" data type!  Ie, trying to do a next on the maximum
> figure just gives you the Maybe "Nothing".

I was wondering why you did the wraparound...

In fact, I was wondering why [] was in the list - it didn't seem to
belong either.

-- 
Alastair Reid        reid@cs.utah.edu        http://www.cs.utah.edu/~reid/