[jhc] Atoms, Infos and unique ids.

Thu Feb 21 17:23:56 EST 2008

On Thu, Feb 21, 2008 at 10:03:08PM +0100, Lemmih wrote:
> Atoms in general are fine but they're not the right tool for this
> particular job. We rarely use atoms for keys (I have the profiling
> information to back this up) and, when we do, using (hash,string) is
> ridiculously fast. The costs of using atoms include: broken properties
> (get . put = id), inefficiencies and obfuscated code. Given that I
> have significantly improved the performance of Jhc, I hope it carries
> some weight when I say that there are NO performance reasons for using
> atoms, neither in CPU time or memory usage.

Well, the main win is the unboxed Int# inside of TVr and being able to
use IdMaps and being able to locally generate unique names based on
current ones without refering to and keeping a global table up to date.
(see docs/conventions.txt for information on some of the ways unique
names are generate from existing ones)

Plus, Atoms feel like the aethetic correct choice to me. Uninterpreted
unique strings with fast identity, exactly what I want out of an
identifier. Even if they were just newtype Atom = Atom String. (A
perfectly valid, if inefficient, implementation)

> >  Things got a lot better in terms of space once I made them a custom
> >  binary implementation, that saved 7 bytes an atom which is often as long
> >  as the string itself.
> 
> That's like giving ice cream to a kid with tuberculosis: It's a a nice
> thing to do but curing the disease would be better.
> 
> Those 7 bytes would be completely irrelevant if we only saved unique
> strings once. There are about 9000 unique strings. A 7 byte overhead
> would be 63k. We currently waste 10,000k by saving ~100 copies of each
> unique string.

but this seems completely orthogonal to using atoms in general, any
system allowing string ids will need to store them somehow.

> >  In any case, I would want any solution to be completely independent of
> >  the fact that atoms are being used as identifiers in a programming
> >  intermediate langauge.
> 
> I couldn't agree more. I say, let's take it one step further and
> assign unique ids to named variables in the same way we do with
> unnamed variables. Each variable would have a 'Anonymous | Named Name'
> tag that would be used for pretty-printing.

We rarely have a list of all names in scope though. or piping one around
would be unwieldy. the 'name's' of named variables have semantic meaning
as well (as described in docs/conventions.txt). Plus, I am wary of
non-abstract types in general after some bad growing pains.

> >  My Atom type is a generally useful library I use
> >  other places. I would think this would involve a custom Binary monad
> >  that distributed and collected an 'atom environment' of sorts that would
> >  then be stored in a different chunk in the file, then whenever one
> >  wanted to store/retrieve an atom, they would just add an index into the
> >  table.
> 
> The only other usage I've found was in Stats.hs. And that's definitely
> not justified.

Again, this is more for aethetics and clear code intent (and incidental
performance gains, no matter how small, are always good). Atoms are what
I mean so they are what I use.

> I was referring to cases like DataConstructors.hs and LambdaLift.hs.
> In DataConstructor.hs it is trivial to assign a new id since it only
> has to be locally unique.

Yes. If you are refering to the using [2,4 ..] there, this is one of the
last places that the Id abstraction escapes. once I clear that up I can
finally fully abstract Id. Yay.

> In LambdaLift.hs, on the other hand, I can't tell whether shadowing is
> OK, whether reusing the old id is OK, or where to get a set free
> variables if a unique id is required. Some documentation about what
> the code tries to do would be very helpful.

I believe the current lambda lifting algorithm requires fully globally
unique names for everything. shadowing would not be okay as definitons
might be lifted to the same level. it is tricky. I would have to reread
it to find out for sure. As with most optimization passes in jhc, it is
not my first implementation and I don't expect it to be the last. (for a
while I had 4 different strictness analyzers to choose from :) )

At the moment my development tree can't compile anything (without
-fno-rules)  because I am in the middle of completely rewriting the
rules mechanism. (they will no longer be carried in the Info nodes at
all) and supercombinators will be logically distinct from let bound
values. treating the whole program as a simple mutally recursive let
binding is quite elegant, but i have been holding onto that ideal for
way to long and have had to go through too many crazy contortions to
keep up that unrealistic world view. I already had to have special cases
in the type system (top level bindings can have superkind ##, local ones
cannot) among other places. So I'd like to hold off on any non localized
changes for the time being.

In any case, I don't want to do anything with Ids at the moment other
than making them more abstract, I am trying to get 0.5 out the door and
I am really uncomfortable making multiple fundamental changes at once,
especially to something as so very tricky as id naming. 

If you want crazy obfuscated code in pursuit of performance, look at
ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta
is actually what I target). avoiding things like explicit unboxed types
and try to keep all strangeness encapsulated behind abstract types with
well defined interfaces. ghc lobs around Int#'s like candy :)

BTW. I also have been working on a full honest to goodness manual for
jhc. my patches implementing it are just starting to appear in the
repository.  The rough idea is you can create comments anywhere (or in
separate files) that start with

{- at Section

and contain markdown formatted text and they will be stitched together
to form the manual. It is filling out nicely.

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈