[Haskell-cafe] Data.Map - visiting tree nodes withi a given key range ?

Mon Mar 16 09:51:05 UTC 2020

Maybe off topic, my work environment deals with datasets sized 20~200 
Giga Bytes, consisting of small time series arrays mostly, I offload the 
compression (and dedup) work to ZFS (by deploying a SmartOS storage 
server managing a dozen of spinning disks with several TB capacity.

Many computing nodes run a local FUSE mount viewing those data files 
over network, as if being part of a virtual large data file in local 
filesystem, and access (mostly reads, small fraction of writes) the data 
via mmap. This way, parallel processes run on multi CPU cores of a 
single computing node share the OS' kernel page for cache of the 
dataset, a program just assumes random access to the whole dataset as 
available at somewhere within its virtual address space.

Giving just enough physical RAM (in order to prevent thrashing) to the 
storage server and computing nodes (my env currently have a typical size 
of 128GB per node), this achieves both simplicity of programming and 
efficient use of processor/memory/storage resources.

This architecture should scale well to datasets of a few TBs.

On 2020/3/16 上午3:46, Olaf Klinke wrote:
> By the way, there are tools to retrieve a certain range from compressed data, which IMHO is a very cool feature of gzip.
>
> https://www.htslib.org/doc/bgzip.html
> https://www.htslib.org/doc/tabix.html
>
> Bioinformaticians use it (among other things) for fast retrieval of genomic annotation from data sets with on the order of 10^9 keys (in case of human genome). Would be nice if someone wrote a Haskell binding.
>
> Olaf
>
>> Hello,
>>
>> I have a question regarding 'Data.Map' api, filed an issue
>> https://github.com/haskell/containers/issues/708
>>
>> And may be I can ask here at the same time?
>>
>> I'm not sure why|Data.Map|doesn't have a key range based visiting API, I
>> figured out I can do it this way:
>>
>> |indexKeyRange :: IndexKey -> IndexKey -> Map IndexKey Object ->
>> [(IndexKey, Object)] indexKeyRange !minKey !maxKey = toList .
>> takeWhileAntitone (<= maxKey) . dropWhileAntitone (< minKey) |
>>
>> But wouldn't it save the computation needed to re-balance the
>> intermediate tree generated ? Or that re-balancing can be optimized out
>> actually ?
>>
>> I am creating an in-memory graph database, using|Data.Map.Strict.Map|as
>> business object indices with specified object attributes. The typical
>> scenario will be querying a small number of entries by key range, out of
>> possibly all business objects of a certain class globally, so the
>> implementation above would work, but not reasonable by far as it seems.
>>
>> I think a lazy list returned by mere node visiting (i.e. no new node
>> creation) would satisfy my needs, or I missed something ?
>>
>>
>> Thanks,
>>
>> Compl