happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

Sat Oct 2 18:49:54 EDT 2010

Hi, Thomas.

> Thanks Jeremy, I just wrote up my own little analysis (below) while you
> were responding.  I'll look for the kd-tree work; if I see discussion
> (and am stupid enough to heap more work onto my plate) then I might get
> involved.

You can find the repository for the dynamic kd-tree implementation here [1].
I'm currently rewriting large parts of the core algorithms (balancing and multi
key traversal) and the implementation is far from being complete/usable.  Once
I'm done with these changes it's time for some serious benchmarking.  The
kd-tree implementation does seem to scale, as the last working version
outperformed Data.Map w.r.t. space and time when considering large data sets
(~1000000 elements). For single-key queries on small data sets, however, IxSet
is currently still faster while memory consumption is about the same.

I think the main advantage of using a kd-tree vs multiple Data.Maps, is that a
query involving multiple keys can still happen in O(log n) time, as the tree
needs to be traversed only once. Also, when an element is modified, most of the
m Data.Maps need to be rebuilt (i.e. O(m*n*log n)) because the indexing
information might be out of date. (This might have been optimized in recent
versions of happstack-ixset.) For the kd-tree we can get away with rebalancing a
subtree of some size k which takes O(k*log k) time.

  Peter

[1] http://darcs.monoid.at/kdtree/

> -----
>
> So another glance tells me there is a list of maps (one element for each
> index method) and it uses Data.Map under the hood.  So I have O(m lg n)
> where m is the number of index methods and n is the number of elements.
>
> Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for
> the constructor, 1 for the 'Size' and one for the pointer indirection
> for each additional field), so the space is "6 * n * m * w" where w is
> the word size.  This means indexing by 5 methods for 1M entries takes
> about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~
> 282MB needed.
>
> Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8
> - or about 192 GB of indexing + 28GB of data for 220GB total.  Obviously
> I shouldn't be talking about keeping a live data set of 28GB in memory
> let alone indexing it all, but I was just curious about the ratio (220MB
> for 1M points, which is just one heavy user).
>
>
>
> On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:
>> In the current version of IxSet, the performance of querying on just
>> the Lon would be essentially the same as if you just had a "Data.Map
>> Lon Point". But the queries on the second index are current not so
>> great. There is work in progress to rewrite the internals of IxSet to
>> be based on a kd-tree, in which case your query should be pretty
>> efficient.
>>
>> So, that answer is pretty vague :) I am in the process of wrapping up
>> happstack 0.6 which has focused on fixing some performance issues with
>> happstack-server, and refactoring the code so that user API and
>> internals are more clearly separated and better documented.
>>
>> happstack 0.7 is all about happstack-state. A key aspect will be
>> nailing down some solid performance benchmarks instead of vague hand
>> waving :)
>>
>> The numbers you give are certainly within the scope of what we would
>> like 0.7 to be able to handle. Also, I should note that
>> happstack-state and happstack-ixset are independent from each other.
>> You can easily use something other than IxSet to store your points and
>> still use happstack-state.
>>
>> - jeremy
>>
>> On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson
>> <thomas.dubuisson at gmail.com> wrote:
>> >> That is pretty close to how it would look using happstack-state. Here
>> >> is a complete, runnable example which defines the types, a query,
>> >> creates/initializes the database, performs the query, and prints the
>> >> results.
>> > [snip]
>> >
>> > How is data stored in Happstack.State?  I see the "Component" instance
>> > uses "fromList" from happstack-ixset but can't find any information on
>> > the algorithm used or its efficiency (computationally or wrt space).
>> >
>> > If making this more concrete helps then here is a possible use:
>> >
>> > == GPS Points ==
>> > I have a GPS logger that logs every 10 seconds when I jog.  Jogging for
>> > an hour a day for the past 180 days has resulted in 64k points.
>> > Pretending I hosted a site for joggers (and all points were in the same
>> > DB) I could easily result in a billion points (< 20K users).  Would
>> > happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @>
>> > (Lat 45) @< (Lat 50)" perform reasonably?
>> >
>> > Cheers,
>> > Thomas
>> >
>> >
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>