[jhc] Re: Ids in Jhc.

Thu Feb 14 05:08:08 EST 2008

On Thu, Feb 14, 2008 at 4:36 AM, John Meacham <john at repetae.net> wrote:
> On Thu, Feb 14, 2008 at 01:23:08AM +0100, Lemmih wrote:
>  > I'd like to fix it in another way, if you have no objects.
>  >
>  > By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms
>  > are saved ~2344 times each (there's a total of 281338 atoms).
>  > Compressing the files with gzip reduces the negative effect on file
>  > size but the overhead of parsing 281k atoms is still there.
>  > On my system, testing with HelloWorld.hs, parsing those atoms took 16%
>  > of the runtime and 27% of the memory usage.
>  >
>  > I propose scrapping the current system of random atom ids in favor for
>  > a per-module map of symbols.
>  > Using an ADT for the ids would make the system more transparent. I'll
>  > have to investigate the performance impact of this.
>
>  What do you mean? just for saving binary files? it already converts them
>  to an ADT on the way out and back in since atom numbers can be different
>  between different runs of the program.

I mean something like: data Id = Phantom | Unnamed Int | Named Atom
Giving special meaning to numbers seems like a hack. Optimizations
should not come at the sacrifice of readability.

Indexing atoms by a random number is also something that can and
should be avoided.

>  If you mean using an ADT internally for ids,
>  stopping doing that is what brought memory usage from gigabytes to
>  megabytes. The text of the atoms only needs to be stored in a single
>  place, also the fact they are int's means they can be unboxed in
>  datatypes, meanding the garbage collector does not even have to care
>  about them, another huge gain.

Only keeping a single instance of each unique string in memory is
obviously the way to go. I'm not arguing against that.
However, given that there are only a total of ~9000 unique strings,
unboxing will most likely mean nothing.

>  Also, IntSet/IntMap are orders of magnitude faster than Set and Map and
>  sets/maps of ids are the bed and butter of optimization passes.

IntSet and IntMap are truly very fast. However, how that maps into CPU
time and memory usage is not known. Readability may have been
sacrificed for no more than a few percents improvement.

>  my only issue with them is that it is
>
>  type Id = Int
>  instead of
>  newtype Id = Id Int
>
>  I have been working on the binary format with my recent changes, it is
>  now much less of an issue than it used to be in terms of speed and
>  memory usage. I modified it to use bytestrings and split the file into
>  various 'chunks' that can be lazily loaded independently. so when it
>  only needs the dependency information that is all it needs to pull out.
>  and so forth. I also put a hard limit on the length of atoms to 256
>  bytes, which means I can chop 7 bytes off the storage of each one.

Shaving 7 bytes off seems weird when each atom is stored 100 times on
disk. Also, using a single byte to store length seems an awful like
only storing the last two digits in the year. I'm more interested in
algorithm optimizations. Optimizations that make the code easier to
use and read, and thereby avoiding silly things like storing duplicate
atoms on disk.

>  By far the biggest win was in Info/Info, which used to store the type of
>  each field alongside of it as a string, now it stores a hash of the type
>  as a number, this reduced the file size (after compression) by about 30%.

Info seems a bit scary to me. It is not garbage collected, so it is up
to the user to assure that references are released as early as
possible.

Btw, I'm trying to offer constructive criticism. I'm interested in
actually writing code, not just pointing out weak spots.

-- 
Cheers,
 Lemmih