[Haskell-cafe] Need help - my haskell code is over 50 times slower than equivalent perl implementation

Tue Jun 24 17:13:07 UTC 2014

On Tue, 24 Jun 2014 21:55:21 +0530, C K Kashyap <ckkashyap at gmail.com> wrote:
> Again, I doubt about String since even if I change the number of files to
> 12000 - the perl program finishes in less than a second.

There are just so many things wrong with String and the String API,
these are all colliding:

1. String is based on Char, which is a horribly inefficient
representation of characters (a tagged machine word, possibly 64 bit,
representing the UCS code point)

2. String is based on [], which are lazy linked lists - For each single
character, you have a heap-allocated pointed to a heap-allocated Char
and the heap-allocated rest of the list; traversing the String to
compute its length involves going through the entire heap, dereferencing
after every single step, and pattern matching on the (:) constructors
(which also requires a branch), garbage collecting the unused
intermediate structures along the way.

3. readFile is based on lazy IO, which means it opens the Handle and
leaves it dangling, then returns some values that, when forced, will
dynamically cause it to read some more bytes from the disk, allocate
heap objects for them, write them in there, and leave the Handle
dangling again. Only when you reach the end of the file does the Handle
eventually automatically get garbage collected and the file therefore
closed. (In theory; no practical guarantees on when it actually gets
closed)

4. All the while, your program is churning through the files like mad,
creating more and more dangling pointers that are just precariously
hanging around all over the place.

Contrast this to what the simple change to B.readFile and B.length does:

ByteString is basically a pointer to some array of bytes in memory,
together with its length. all B.readFile has to do is read in the bytes
1:1 without allocating anything other than a single fixed sized buffer
to hold them all, the size of which is given by the file size.
All B.length has to do is return the already-known size of the buffer.
After the handle is opened and the contents read in, it's immediately
closed. No garbage collection necessary other than eventually freeing up
the buffer used to store its contents - a single operation. (In addition
to the pointer itself, which is stored in the heap)

As you can see, there is a complete world of difference between the two
approaches on many levels - runtimes differing by one or even many
orders of magnitude are no surprise. Some would go as far as to say that
the String based file I/O APIs should be removed completely.

Either way, you could make your operation even more efficient by not
opening the files at all and just consulting their length through
filesystem APIs.