[Haskell-cafe] Batteries included (Was: GHC is a monopoly compiler)

Richard A. O'Keefe ok at cs.otago.ac.nz
Sun Oct 2 23:20:30 UTC 2016



On 30/09/16 7:17 PM, Joachim Durchholz wrote:
> There is a single standard representation.
[for strings in Java]
> I'm not even aware of a second one, and I've been programming Java for
> quite a while now
> Unless you mean StringBuilder/StringBuffer (that would be three String
> types then).

StringBuffer is just a synchronized version of StringBuilder.

However, these classes are by no means "preferred" in
> practice: the vast majority of APIs demands and returns String objects.

The Java *compiler* prefers StringBuilder: when you write a string
concatenation expression in Java the compiler creates a StringBuilder
behind the scenes.  I'm counting a class as "preferred" if the
compiler *has* to know about it and generates code involving it
without the programmer explicitly mentioning it.

>
> Even then, Java has its preferred string representation nailed down
> pretty strongly: a hidden array of 16-bit Unicode code points,
> referenced by a descriptor object (the actual String), immutable.

As already noted, that representation changed internally.
And that change is actually relevant to this thread.

The representation that _used_ to be used was
     (char[] array, offset, length, hash)
Amongst other things, this meant that taking a substring cost
O(1) time and O(1) space, because you just had to allocate and
initialise a new "descriptor object" sharing the underlying
array.

Since Java 1.7 the representation is
     (char[] array, hash)
Amongst other things, this means that taking a substring n
characters long now costs O(n) time and O(n) space.

If you are working in a loop like
    while (there is more input) {
        read a chunk of input
        split it into substrings
        process some of the substrings
    }
the pre-Java-1.7 representation is perfect.
If you *retain* some of the substrings, however, you
retain the whole chunk.  That was easy to fix by
doing
        retain(new String(someSubstring))
instead of
        retain(someSubstring)
but you had to *know* to do it.

(Another solution would be to have a smarter
garbage collector that knew about string sharing and
could compact strings.  I wrote such a collector for
XPL many years ago.  It's quite easy to do a stop-and-
copy garbage collector that does that.  But that's not
the state of the art in Java garbage collection, and
I'm not sure how well string compaction would fit into
a more advanced collector.)

The Java 1.7-and-later representation is *safer*.
Depending on your usage, it may either save a lot of
memory or bloat your memory use.


The point is that there is no one-size-fits-all string
representation; being given only one forces you to either
write your own additional representation(s) or to use a
representation which is not really suited to your
particular purpose.




More information about the Haskell-Cafe mailing list