[Haskell-cafe] Batteries included (Was: GHC is a monopoly compiler)

Mon Oct 3 06:39:22 UTC 2016

Am 03.10.2016 um 01:20 schrieb Richard A. O'Keefe:
>
> The Java *compiler* prefers StringBuilder:  when you write a string
> concatenation expression in Java the compiler creates a StringBuilder
> behind the scenes.  I'm counting a class as "preferred" if the
> compiler *has* to know about it and generates code involving it
> without the programmer explicitly mentioning it.

Then Haskell's preferred representation of additive types would be the 
updatable record.
Or machine integers are preferably stored in registers because that's 
where every new integer is created, RAM is second class...
I think that's stretching things too far.

There are more indicators against your theory:
1) During the lifetime of a program, the vast majority of textual data 
is stored in String objects. StringBuilders are just temporary and are 
discarded once the String object is built. (That's quantitative, not 
qualitative.)
2) The compiler does NOT have to know. Straight from the Java spec:
 > 15.18.1. [...] To increase the performance of repeated string
 > concatenation, a Java compiler may use the StringBuffer class or a
 > similar technique to reduce the number of intermediate String objects
 > that are created by evaluation of an expression.
Moreover, the entire paragraph is a non-authoritative remark.

>> Even then, Java has its preferred string representation nailed down
>> pretty strongly: a hidden array of 16-bit Unicode code points,
>> referenced by a descriptor object (the actual String), immutable.
>
> As already noted, that representation changed internally.

Yes, Java 7 changed that to prevent memory leaks from happening.

> And that change is actually relevant to this thread.

I have been thinking about that argument and do not think it is valid in 
a Java context. Java programmers are used to unexpected performance 
changes, mostly due to changes in the garbage collector.

It's also just a single function that changed behaviour, and definitely 
not the most common one even if it's pretty important.

> The representation that _used_ to be used was
>     (char[] array, offset, length, hash)
> Amongst other things,

Not really...

 > this meant that taking a substring cost
> O(1) time and O(1) space, because you just had to allocate and
> initialise a new "descriptor object" sharing the underlying
> array.

"You" never had. This all happened behind the scenes, an implementation 
detail.

> If you are working in a loop like
>    while (there is more input) {
>        read a chunk of input
>        split it into substrings
>        process some of the substrings
>    }
> the pre-Java-1.7 representation is perfect.
> If you *retain* some of the substrings, however, you
> retain the whole chunk.  That was easy to fix by
> doing
>        retain(new String(someSubstring))
> instead of
>        retain(someSubstring)
> but you had to *know* to do it.

Okay, now i get the point.
It's a pretty specialized kind of code though. Usually you don't care 
much about how much of some input you retain, because more than 50% of 
the input strings are retained anyway (if you even do retain strings).

It did have the potential for a memory leak, but now we're getting into 
a pretty special corner case here.

Plus it still does not change a bit about that String is the standard 
representation in Java, not StringBuffer nor byte[]. The programmer(!) 
isn't confused about selecting which one, and that was the point 
originally made.

Diving into implementation details just to prove that wrong isn't going 
to change that the impression that Java's string representations are 
confusing was just the result of first impressions without actual practice.

> (Another solution would be to have a smarter
> garbage collector that knew about string sharing and
> could compact strings.  I wrote such a collector for
> XPL many years ago.  It's quite easy to do a stop-and-
> copy garbage collector that does that.  But that's not
> the state of the art in Java garbage collection,

Agreed.

 > and
> I'm not sure how well string compaction would fit into
> a more advanced collector.)

Since Java's standard use case is long-running server programs, most if 
not all Java GCs are copying collectors nowadays. So, this would be a 
good fit in principle.
It might have unfavorable trade-offs with other use cases though. It's 
quite possible that they implemented this, benchmarked it, and found 
they couldn't get it up to competitive speed.

> The point is that there is no one-size-fits-all string
> representation; being given only one forces you to either
> write your own additional representation(s) or to use a
> representation which is not really suited to your
> particular purpose.

I haven't read anybody complain about Java's string representation yet.
That does not mean that nobody does (I'm pretty sure that there are 
complaints), it just doesn't concern people much in practice. Most Java 
programmers don't deal with this, they use a library like JAXML or 
Jackson for parsing (XML resp. JSON), get good-enough performance, and 
move on.
Some people used to complain that 16-bit characters are a waste of 
memory, but even that isn't considered a big problem - essentially, the 
alternatives are out of sight and out of mind.
(It would be interesting to see what happened in a language where the 
standard string representation is UTF-8. Given that Unicode requires a 
minimum of three bytes for a codepoint nowadays, the UTF-16 advantage of 
"character count = storage cell count" has vanished anyway.)