[Haskell-cafe] Efficient object identity (aka symbols as data)

Thu May 26 16:14:57 CEST 2011

2011/5/26 Jacek Generowicz <jacek.generowicz at cern.ch>:
>
> On 2011 May 26, at 11:16, Christopher Done wrote:
>
>> On 26 May 2011 10:45, Jacek Generowicz <jacek.generowicz at cern.ch> wrote:
>>>
>>> What is the Haskell approach to efficient comparison and lookup of
>>> objects
>>> by their identity?
>>
>> Often you just provide your own and implement Eq.
>>
>>> I should be able to run the program on data that becomes available at run
>>> time.
>>
>> Typically you define an id generator and zip anything coming from the
>> input stream up with that generator.
>
> Makes sense.
>
>>> Whatever algorithm I choose to use for the optimization, will have to do
>>> lots of comparisons of Groups and Persons where their *identity* is all
>>> that
>>> matters: you don't need to look inside the objects.
>>
>> To achieve this abstraction the usual way is just implementing Eq:
>>
>> instance Eq Person where
>>  Person{personId=id1} == Person{personId=id2} = id1 == id2
>
> Any comments on the relative efficiency of the above as compared to
>
> A == B in the context of
>
> data Foo = A | B | C | D | ... lots more ...
>
> ?
>
> (I imagine that a Sufficiently Smart Compiler could reduce (==) :: Person
> Person to just integer comparison.)

I'm pretty sure GHC will do that for you.

An approach similar to the one by Chris Done is to use a
bi-directional map between Persons and Ints along the lines of

data IdMap a = IdMap (IntMap a) (Map a Int)

You can then associate a unique Int with each of your Persons and use
this during your comparison. For associating Ints to the Persons, a
simple fold or a State monad computation suffice. For the lookups on
the further properties of Persons, an additional argument or the
Reader monad will do. If you use a State monad and a single operation
that associates an Int to a Person, then you additionally get the
guarantee (inside a monadic computation) that no two Persons will be
associated with the same Int.

Efficiency-wise, you'll have O(log(n)) association,  O(min(n,W))
access time, and O(1) comparison time with a very low constant factor.
See the IntMap documentation for the O(min(n,W)) explanation.
Additionally, the code is pure with all the nice properties that come
with it.

By the way this problem is very similar to the one of observable
sharing. See this thread:
http://www.haskell.org/pipermail/haskell-cafe/2008-February/039639.html

best regards,
Simon