set representation question

Tom Pledger Tom.Pledger at peace.com
Wed Nov 12 15:00:38 EST 2003


Hal Daume III writes:
 :
 | *all* i care about is being able to quickly calculate the size of
 | the intersection of two sets.  these sets are, in general, very
 | sparse, which means that the intersections tend to be small.
 | 
 | for example, i might have two sets reprsented by the arrays:
 | 
 |  {0,1,10,346,398,1039,3289,3853,9811,89231,50913}
 |  {0,3,98,183,398,1038,5319,7642,9811,13893,93123}
 | 
 | and all i need to be able to do is respond with "3" very very
 | quickly.  using sorted arrays, this takes O(n+m) time (where n and
 | m are the sizes of the arrays).
 | 
 | i'm willing to pay some up front cost for creating a data structure
 | that could do this more quickly (i.e., in logarithmic time).
 | 
 | is such a thing possible?  "does anyone have a haskell
 | implementation of" such a thing?  :P

Hi.

The total time (including the up front time for building the data
structure) can't go below O(n+m), because if it did, you'd be
neglecting to look at some of the elements at all.

To move some of the effort up front, I suspect you're looking at a
compression problem in disguise.  For example, if your data weren't so
sparse, you might try something like this:

    IntSet = [(Int, Int)]    -- ordered list of non-overlapping ranges

    setToList s = concat [[a..b] | (a, b) <- s]

    intersectionSize :: IntSet -> IntSet -> Int
    intersectionSize s1@((a, b):s1') s2@((c, d):s2')
        = rangeSize (max a c, min b d) +
          case compare b d of
              LT -> intersectionSize s1' s2
              EQ -> intersectionSize s1' s2'
              GT -> intersectionSize s1  s2'
    intersectionSize _ _ = 0

So... how *compressible* are your data?

- Tom



More information about the Haskell mailing list