[Haskell-cafe] stream interface vs string interface: references

Tue Sep 3 08:36:37 CEST 2013

On 3/09/2013, at 5:17 PM, damodar kulkarni wrote:
> I didn't want to clutter that thread so I am asking a question here.
> Where do I find foundational and/or other good references on the topic of "stream interface vs string interface to convert objects to text"? I tried google but failed. 

It has to do with the cost of string concatenation.

Let's take a simple thing.
A Set of Strings.

Smalltalk has

    String
      printOn: aStream
        aStream nextPut: $'.
        self do: [:each |
          each = $' ifTrue: [aStream nextPut: $'].
          aStream nextPut: each].
        aStream nextPut: $'.

(Smalltalk uses '' for strings with quote doubling for embedded
quotes and has no character escapes.  s nextPut: c writes character
c to stream s.  do: [:...] is a loop.)

    Set
      printOn: aStream
        |started|
        started := false.
        aStream nextPutAll: 'a Set ('.
        self do: [:each |
          started ifTrue: [aStream space] ifFalse: [started := true].
          each printOn: aStream].
        aStream nextPut: $'.

    Object
      printString
        |stream|
        stream := WriteStream on: (String new: 40).
        self printOn: stream.
        ^stream contents

(A WriteStream is an internal stream.  It starts with the
array-like object you give it and grows it, typically by
doubling, as needed.  `contents' returns the part actually
written to.)

Let's actually do something subtly different.
Each Set will contain the printString of a number
and also another set, so
   a Set('3' a Set('2' a Set('1' a Set())))

    s := Set new.
    1 to: 1000*1000 do: [:i |
        s := Set with: s with: i printString].
    s printOn: Transcript.

The set having been created, how much allocation is done
by the output phase?   *NONE*.  And the time is *LINEAR*
in the size of the output.

To summarise: Smalltalk makes "append print version to output stream"
the basic form and "obtain print version as a string" a derived form.
The result is that printing (acyclic) objects of any size takes time
linear in the size of the output.

Now consider the Java version.
Java makes "obtain print version as a string" the basic form and
"append print version to output stream" a derived form.

There's a nasty little wrinkle which is that "foo".toString() is
"foo" instead of the expected "\"foo\"" in Java.  So the output
will be [3, [2, [1, []]] or something like that.

    String {
        String toString() {
            return this;
        }
    }

    HashSet {
        String toString() {
	    StringBuilder b = new StringBuilder();
	    boolean started = false;
	    b.append("[");
	    for (Object o: this) {
		if (started) b.append(", "); else started = true;
		b.append(o.toString());
	    }
	    b.append("]");
	    return b.toString();
	}
    }

This takes *QUADRATIC* time, but it's annoyingly hard to demonstrate
because it keeps crashing with a stack overflow for even quite small n.
Thankfully, the -Xss option comes to the rescue.

    n		 time (seconds)
  100		 0.16
  200		 0.18
  500		 0.22
 1000		 0.30
 2000		 0.62
 5000		 2.58
10000           12.08
20000		51.54

Not only does it take an obscene amount of time to print a large
object, it turns over a totally unwarranted amount of space.

Now you might object that sets like this are not realistic.
If you are trying to write large circuits or abstract syntax
trees or the like to a file, or using this *style* even if
not this specific *interface* to write XML documents, the
example errs by being unrealistically *easy* for Java.
It's easy to understand what's going wrong.

Consider [2, [1, []]]
When visiting {}, we create "[]".  So far so good.
When visiting {1, {}}, we *copy* the "[]" into "[1, []]".
When visiting {2, {1, {}}}, we *copy* the "[1, []]" into
"[2, [1, []]]".
And so it goes.  This does an awful lot of copying and
*none* of it is needed given a sane interface.

What is the take of Haskell on this?
There is a *reason* 'shows' exists.

	newtype Date = Date (Int,Int,Int)
	instance Show Date
	  where showsPrec _ (Date (y,m,d)) rest =
	     "Date (" ++ shows y ("," ++ shows m ("," ++ shows d (")" ++ rest)))

The only uses of ++ here are where the left operand is a short literal.
Using this approach, Haskell programs can produce strings in linear time
and linear space.

For lazy I/O, using shows in Haskell is a good analogue of using
#printOn: in Smalltalk.  The basic form is "include this as PART of
a stream", with "convert this to a whole string" as a derived form.

What the equivalent of this would be for Iteratees I don't yet
understand.