set representation question

ajb at spamcop.net ajb at spamcop.net
Tue Nov 11 20:56:40 EST 2003


G'day all.

Quoting Hal Daume III <hdaume at ISI.EDU>:

> i'm looking for a representation for a set of natural numbers.  right now,
> my representation is sorted array, which works well.  *all* i care about
> is being able to quickly calculate the size of the intersection of two
> sets.  these sets are, in general, very sparse, which means that the
> intersections tend to be small.

I haven't had time to analyse the complexity, but you might want to try
this representation, which uses a digital trie on the low-order bits of
the number.

If you're using Ints rather than Integers, this representation is quite
speedy, and took fewer reductions in Hugs after pre-evaluating the trees
than the equivalent list merge algorithm (which is not the same as an
array merge, and is also not necessarily an indicator of anything) on
your sample data.  If you're using Integers, you're out of luck, because
this code does arithmetic during the search, which allocates a lot of
transient heap.

Patricia tries may make the representation a bit more compact, but I
doubt they would speed up the intersection operation.  Using unboxed Ints
might help, and so might converting countIntersection to accumulator form.

Cheers,
Andrew Bromage

data DigitalTrie
    = Node DigitalTrie DigitalTrie
    | Leaf Int
    | Empty
	deriving (Show)

dtInsert :: Int -> DigitalTrie -> DigitalTrie
dtInsert x Empty
    = Leaf x
dtInsert x t@(Leaf i)
    | x == i
	= t
    | x `mod` 2 == 0
	= if i `mod` 2 == 0
	  then Node (dtInsert (x `div` 2) (Leaf (i `div` 2))) Empty
	  else Node (Leaf (x `div` 2)) (Leaf (i `div` 2))
    | otherwise
	= if i `mod` 2 == 0
	  then Node (Leaf (i `div` 2)) (Leaf (x `div` 2))
	  else Node Empty (dtInsert (x `div` 2) (Leaf (i `div` 2)))
dtInsert x (Node t1 t2)
    | x `mod` 2 == 0
	= Node (dtInsert (x `div` 2) t1) t2
    | otherwise
	= Node t1 (dtInsert (x `div` 2) t2)

flatten :: DigitalTrie -> [Int]
flatten t
    = flatten' 0 1 t []
    where
	flatten' p m Empty = id
	flatten' p m (Leaf i) = (:) (i*m+p)
	flatten' p m (Node t1 t2)
	    = flatten' p (m*2) t1 . flatten' (p+m) (m*2) t2

countIntersection :: DigitalTrie -> DigitalTrie -> Int
countIntersection Empty _ = 0
countIntersection (Leaf i) t = findIn i t
countIntersection _ Empty = 0
countIntersection t (Leaf i) = findIn i t
countIntersection (Node l1 l2) (Node r1 r2)
    = countIntersection l1 r1 + countIntersection l2 r2

findIn :: Int -> DigitalTrie -> Int
findIn x Empty = 0
findIn x (Leaf i) = if x == i then 1 else 0
findIn x (Node t1 t2)
    | x `mod` 2 == 0
	= findIn (x `div` 2) t1
    | otherwise
	= findIn (x `div` 2) t2

t1 = foldr dtInsert Empty [0,1,10,346,398,1039,3289,3853,9811,89231,50913]
t2 = foldr dtInsert Empty [0,3,98,183,398,1038,5319,7642,9811,13893,93123]

test :: Int
test = countIntersection t1 t2


More information about the Haskell mailing list