[Haskell-cafe] Batteries included (Was: GHC is a monopoly compiler)
Joachim Durchholz
jo at durchholz.org
Mon Oct 3 06:39:22 UTC 2016
Am 03.10.2016 um 01:20 schrieb Richard A. O'Keefe:
>
> The Java *compiler* prefers StringBuilder: when you write a string
> concatenation expression in Java the compiler creates a StringBuilder
> behind the scenes. I'm counting a class as "preferred" if the
> compiler *has* to know about it and generates code involving it
> without the programmer explicitly mentioning it.
Then Haskell's preferred representation of additive types would be the
updatable record.
Or machine integers are preferably stored in registers because that's
where every new integer is created, RAM is second class...
I think that's stretching things too far.
There are more indicators against your theory:
1) During the lifetime of a program, the vast majority of textual data
is stored in String objects. StringBuilders are just temporary and are
discarded once the String object is built. (That's quantitative, not
qualitative.)
2) The compiler does NOT have to know. Straight from the Java spec:
> 15.18.1. [...] To increase the performance of repeated string
> concatenation, a Java compiler may use the StringBuffer class or a
> similar technique to reduce the number of intermediate String objects
> that are created by evaluation of an expression.
Moreover, the entire paragraph is a non-authoritative remark.
>> Even then, Java has its preferred string representation nailed down
>> pretty strongly: a hidden array of 16-bit Unicode code points,
>> referenced by a descriptor object (the actual String), immutable.
>
> As already noted, that representation changed internally.
Yes, Java 7 changed that to prevent memory leaks from happening.
> And that change is actually relevant to this thread.
I have been thinking about that argument and do not think it is valid in
a Java context. Java programmers are used to unexpected performance
changes, mostly due to changes in the garbage collector.
It's also just a single function that changed behaviour, and definitely
not the most common one even if it's pretty important.
> The representation that _used_ to be used was
> (char[] array, offset, length, hash)
> Amongst other things,
Not really...
> this meant that taking a substring cost
> O(1) time and O(1) space, because you just had to allocate and
> initialise a new "descriptor object" sharing the underlying
> array.
"You" never had. This all happened behind the scenes, an implementation
detail.
> If you are working in a loop like
> while (there is more input) {
> read a chunk of input
> split it into substrings
> process some of the substrings
> }
> the pre-Java-1.7 representation is perfect.
> If you *retain* some of the substrings, however, you
> retain the whole chunk. That was easy to fix by
> doing
> retain(new String(someSubstring))
> instead of
> retain(someSubstring)
> but you had to *know* to do it.
Okay, now i get the point.
It's a pretty specialized kind of code though. Usually you don't care
much about how much of some input you retain, because more than 50% of
the input strings are retained anyway (if you even do retain strings).
It did have the potential for a memory leak, but now we're getting into
a pretty special corner case here.
Plus it still does not change a bit about that String is the standard
representation in Java, not StringBuffer nor byte[]. The programmer(!)
isn't confused about selecting which one, and that was the point
originally made.
Diving into implementation details just to prove that wrong isn't going
to change that the impression that Java's string representations are
confusing was just the result of first impressions without actual practice.
> (Another solution would be to have a smarter
> garbage collector that knew about string sharing and
> could compact strings. I wrote such a collector for
> XPL many years ago. It's quite easy to do a stop-and-
> copy garbage collector that does that. But that's not
> the state of the art in Java garbage collection,
Agreed.
> and
> I'm not sure how well string compaction would fit into
> a more advanced collector.)
Since Java's standard use case is long-running server programs, most if
not all Java GCs are copying collectors nowadays. So, this would be a
good fit in principle.
It might have unfavorable trade-offs with other use cases though. It's
quite possible that they implemented this, benchmarked it, and found
they couldn't get it up to competitive speed.
> The point is that there is no one-size-fits-all string
> representation; being given only one forces you to either
> write your own additional representation(s) or to use a
> representation which is not really suited to your
> particular purpose.
I haven't read anybody complain about Java's string representation yet.
That does not mean that nobody does (I'm pretty sure that there are
complaints), it just doesn't concern people much in practice. Most Java
programmers don't deal with this, they use a library like JAXML or
Jackson for parsing (XML resp. JSON), get good-enough performance, and
move on.
Some people used to complain that 16-bit characters are a waste of
memory, but even that isn't considered a big problem - essentially, the
alternatives are out of sight and out of mind.
(It would be interesting to see what happened in a language where the
standard string representation is UTF-8. Given that Unicode requires a
minimum of three bytes for a codepoint nowadays, the UTF-16 advantage of
"character count = storage cell count" has vanished anyway.)
More information about the Haskell-Cafe
mailing list