[Haskell-cafe] Fwd: Efficient records with arbitrarily many fields [was: Extensible states]

Mon Jul 6 18:08:43 UTC 2015

Alexander V Vershilov wrote:
> On 6 July 2015 at 10:51, Frank Staals <frank at fstaals.net> wrote:
>> Alexander V Vershilov <alexander.vershilov at gmail.com> writes:
>>
>>> Not sure that it will make sense for a cafe level, but here is my
>>> understanding of the current state. For the problem of "anonymous
>>> records", i.e. some datatype that may have different fields and their
>>> types we have following appoaches:
>>>
>>> 1. vinyl/hlist/etc. i.e. GADT base heterogeneous list. This one have
>>> O(N) random access and may not be a good fit for "anonymous record"
>>> use case
>>>
>>> 2. records: polymorphic data structures with type-level information
>>> about field names. Here we have a lot of datastructures one per number
>>> of fields multiply on 2 (lazy and strict variant). This one has O(1)
>>> random access.
>>>
>>> 3. fixed-vector-hetero/"wheel above". array based structures has O(1)
>>> access. I call code above a "wheel" because there were a number of
>>> similar implementations here and there.
>>
>> Somehow I feel the most promissing approach is missing from this list:
>> The 'Just do it while compiling!' paper[1] describes how to get O(log n)
>> query and update, by essentially using heterogeneous binomial trees. I
>> think I also saw some haskell implementation in a github gist somewhere,
>> but I could not find the link at the momement.
> 
> Yes, I tend to agree that this approach should be noted, I just was
> unaware of that article. As it seems it's better fit for functional
> paradigm that the "array" based and may have better properties
> when there are many updates and no fusion was implemented.
> 
>> Having said that. I think the Big-O notation used here is misleading,
>> since in 99.999% of all cases we have constant-sized records. Therefore,
>> all approaches give O(1) access/update in these cases. If one really
>> want to know what is fastest benchmarking is the only way to go here.
> 
> Benchmarking in the current case is always the answer, as for some
> use-cases memory overhead and locality may matter. And
> vector/data-structure approaches will give better results

I have read the paper. Of course whatever we can do at compile time we 
should do. They can't perform any magic, though, and neither do they promise 
any. If you want O(1) access you have to buy it with O(n) extension.

Anyway, asymptotic behavior is really not the issue here. I guess we can 
safely assume that the large majority of use cases will be with less than 10 
members and perhaps 99.9% will have less than 100. Penalizing those 99.9% of 
usages with high constant factors only so that a record with, say, 100000 
members has reasonable performance would be a mistake, IMO. This is a 
situation quite different from the one for finite maps, which we expect to 
work reasonably fast with large amounts of keys.

My guess is that the O(log(n)) tree solution in the paper is a bit too 
complex to have really good constant factors, but that's no more than a gut 
feeling. I agree that benchmarking is the only way to be sure.

So, instead of taking about O(1), let me restate and say that I would prefer 
to have records with "array-like" performance for the basic operations. I 
could live with "tree-like" if the constant factors are small enough.

Cheers
Ben
-- 
"Make it so they have to reboot after every typo." ― Scott Adams