[Haskell-cafe] Batteries included (Was: GHC is a monopoly compiler)
Richard A. O'Keefe
ok at cs.otago.ac.nz
Sun Oct 2 23:20:30 UTC 2016
On 30/09/16 7:17 PM, Joachim Durchholz wrote:
> There is a single standard representation.
[for strings in Java]
> I'm not even aware of a second one, and I've been programming Java for
> quite a while now
> Unless you mean StringBuilder/StringBuffer (that would be three String
> types then).
StringBuffer is just a synchronized version of StringBuilder.
However, these classes are by no means "preferred" in
> practice: the vast majority of APIs demands and returns String objects.
The Java *compiler* prefers StringBuilder: when you write a string
concatenation expression in Java the compiler creates a StringBuilder
behind the scenes. I'm counting a class as "preferred" if the
compiler *has* to know about it and generates code involving it
without the programmer explicitly mentioning it.
>
> Even then, Java has its preferred string representation nailed down
> pretty strongly: a hidden array of 16-bit Unicode code points,
> referenced by a descriptor object (the actual String), immutable.
As already noted, that representation changed internally.
And that change is actually relevant to this thread.
The representation that _used_ to be used was
(char[] array, offset, length, hash)
Amongst other things, this meant that taking a substring cost
O(1) time and O(1) space, because you just had to allocate and
initialise a new "descriptor object" sharing the underlying
array.
Since Java 1.7 the representation is
(char[] array, hash)
Amongst other things, this means that taking a substring n
characters long now costs O(n) time and O(n) space.
If you are working in a loop like
while (there is more input) {
read a chunk of input
split it into substrings
process some of the substrings
}
the pre-Java-1.7 representation is perfect.
If you *retain* some of the substrings, however, you
retain the whole chunk. That was easy to fix by
doing
retain(new String(someSubstring))
instead of
retain(someSubstring)
but you had to *know* to do it.
(Another solution would be to have a smarter
garbage collector that knew about string sharing and
could compact strings. I wrote such a collector for
XPL many years ago. It's quite easy to do a stop-and-
copy garbage collector that does that. But that's not
the state of the art in Java garbage collection, and
I'm not sure how well string compaction would fit into
a more advanced collector.)
The Java 1.7-and-later representation is *safer*.
Depending on your usage, it may either save a lot of
memory or bloat your memory use.
The point is that there is no one-size-fits-all string
representation; being given only one forces you to either
write your own additional representation(s) or to use a
representation which is not really suited to your
particular purpose.
More information about the Haskell-Cafe
mailing list