happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

Sat Oct 2 19:22:33 EDT 2010

Hello,

Check out these threads:

http://groups.google.com/group/happs/browse_thread/thread/23d92e45c99f88b1
http://groups.google.com/group/happs/browse_thread/thread/0b0d0a9158c3ad73

There is nothing inherently wrong with keeping 28GB of real data in
memory.  It depends largely on what you are trying to optimize for..

For personal hobby projects, the cost of hosting the project seems to
be the biggest expense that people are concerned worth. (Which is
quite reasonable). As a result people are willing to increase
development time, and sacrifice performance in order lower hosting
costs. (Also reasonable).

For commercial projects, the costs look different. For example, if
buying a server with 1TB of RAM (around $80,000 these days) means they
can hire one less developer (at $150,000/year when you look at total
cost of salary, benefits, etc), then that is a savings of $70,000.
Additionally, in RAM storage provides much more predictable
performance. If you have ever attempted to use a site like reddit
after its memcached servers have been reset, it is clear how bad the
penalties for hitting disk are. RAM is significantly faster and than
disk and has no seek time. RAM can achieve 100-1000x less latency and
higher throughput.

Facebook reportedly keeps as much as 80-90% of their working dataset
in memcached. In 2008 they had 28TB of memcached servers. No idea what
they are up to now
(http://www.facebook.com/note.php?note_id=39391378919).

Of course, there are datasets which are simply too big to be stored in
RAM. For example, google's search index. I am not that familiar with
their approach, but I believe they go the opposite extreme. They
likely assume that every query is going to hit the disk, and optimize
the system so that it can provide acceptable response even if nothing
is cached ? (I am totally guessing here).

The focus of the new IxSet internals is to help bring the indexing
overhead ratio down (and increase speed at the same time).

There are also some ideas, but not code, for how to build an
IxSet-like structure which stores the keys in RAM, but can store the
values of disk. That would, hopefully, give you the ease of use of
IxSet, but with lower memory requirements if your keys are
significantly smaller than the total value payload. The trade off
being that you get back into the ugly work of disk seeks :)

There was also some experimental work done where the values were
stored in RAM, but in their serialized byte format, under the
assumption that the serialized format is a lot more compact than the
normal representation. The trade off is that the values must be
deserialized everytime they are used, which requires more CPU. A more
complex version could be imagined, where the deserialized version is
cached for some period of time. Whether that is actually beneficial
can only be determined empirically I think...

So, clearly happstack-state is not the optimal choice for all haskell
web applications. And that is why it is a completely independent from
happstack-server. We do want to allow people to pick the best choice
for their application. Of course, if we want to provide off-the-shelf
components like a user account system -- then eventually we have to
nail down some specifics, such as the persistence layer. But those
choices would only apply to code that wants to use those optional
components.

At the same time, happstack-state is a very interesting and
challenging library to work on. The recent rise in popularity of
things like redis seems to validate happstack-state somewhat.

- jeremy

On Sat, Oct 2, 2010 at 3:09 PM, Thomas M. DuBuisson
<thomas.dubuisson at gmail.com> wrote:
> Thanks Jeremy, I just wrote up my own little analysis (below) while you
> were responding.  I'll look for the kd-tree work; if I see discussion
> (and am stupid enough to heap more work onto my plate) then I might get
> involved.
>
> Oops, didn't send...
>
> Cheers,
> Thomas
>
> -----
>
> So another glance tells me there is a list of maps (one element for each
> index method) and it uses Data.Map under the hood.  So I have O(m lg n)
> where m is the number of index methods and n is the number of elements.
>
> Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for
> the constructor, 1 for the 'Size' and one for the pointer indirection
> for each additional field), so the space is "6 * n * m * w" where w is
> the word size.  This means indexing by 5 methods for 1M entries takes
> about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~
> 282MB needed.
>
> Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8
> - or about 192 GB of indexing + 28GB of data for 220GB total.  Obviously
> I shouldn't be talking about keeping a live data set of 28GB in memory
> let alone indexing it all, but I was just curious about the ratio (220MB
> for 1M points, which is just one heavy user).
>
>
>
> On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:
>> In the current version of IxSet, the performance of querying on just
>> the Lon would be essentially the same as if you just had a "Data.Map
>> Lon Point". But the queries on the second index are current not so
>> great. There is work in progress to rewrite the internals of IxSet to
>> be based on a kd-tree, in which case your query should be pretty
>> efficient.
>>
>> So, that answer is pretty vague :) I am in the process of wrapping up
>> happstack 0.6 which has focused on fixing some performance issues with
>> happstack-server, and refactoring the code so that user API and
>> internals are more clearly separated and better documented.
>>
>> happstack 0.7 is all about happstack-state. A key aspect will be
>> nailing down some solid performance benchmarks instead of vague hand
>> waving :)
>>
>> The numbers you give are certainly within the scope of what we would
>> like 0.7 to be able to handle. Also, I should note that
>> happstack-state and happstack-ixset are independent from each other.
>> You can easily use something other than IxSet to store your points and
>> still use happstack-state.
>>
>> - jeremy
>>
>> On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson
>> <thomas.dubuisson at gmail.com> wrote:
>> >> That is pretty close to how it would look using happstack-state. Here
>> >> is a complete, runnable example which defines the types, a query,
>> >> creates/initializes the database, performs the query, and prints the
>> >> results.
>> > [snip]
>> >
>> > How is data stored in Happstack.State?  I see the "Component" instance
>> > uses "fromList" from happstack-ixset but can't find any information on
>> > the algorithm used or its efficiency (computationally or wrt space).
>> >
>> > If making this more concrete helps then here is a possible use:
>> >
>> > == GPS Points ==
>> > I have a GPS logger that logs every 10 seconds when I jog.  Jogging for
>> > an hour a day for the past 180 days has resulted in 64k points.
>> > Pretending I hosted a site for joggers (and all points were in the same
>> > DB) I could easily result in a billion points (< 20K users).  Would
>> > happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @>
>> > (Lat 45) @< (Lat 50)" perform reasonably?
>> >
>> > Cheers,
>> > Thomas
>> >
>> >
>
>
>