[Haskell-cafe] A round of golf

Thu Sep 18 16:33:37 EDT 2008

On Thu, Sep 18, 2008 at 12:31 PM, Creighton Hogg <wchogg at gmail.com> wrote:
> On Thu, Sep 18, 2008 at 1:55 PM, Don Stewart <dons at galois.com> wrote:
>> wchogg:
>>> On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart <dons at galois.com> wrote:
> <snip>
>>> > This makes me cry.
>>> >
>>> >    import System.Environment
>>> >    import qualified Data.ByteString.Lazy.Char8 as B
>>> >
>>> >    main = do
>>> >        [f] <- getArgs
>>> >        s   <- B.readFile f
>>> >        print (B.count '\n' s)
>>> >
>>> > Compile it.
>>> >
>>> >    $ ghc -O2 --make A.hs
>>> >
>>> >    $ time ./A /usr/share/dict/words
>>> >    52848
>>> >    ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
>>> >
>>> > Against standard tools:
>>> >
>>> >    $ time wc -l /usr/share/dict/words
>>> >    52848 /usr/share/dict/words
>>> >    wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
>>>
>>> So both you & Bryan do essentially the same thing and of course both
>>> versions are far better than mine.  So the purpose of using the Lazy
>>> version of ByteString was so that the file is only incrementally
>>> loaded by readFile as count is processing?
>>
>> Yep, that's right
>>
>> The streaming nature is implicit in the lazy bytestring. It's kind of
>> the dual of explicit chunkwise control -- chunk processing reified into
>> the data structure.
>
> To ask an overly general question, if lazy bytestring makes a nice
> provider for incremental processing are there reasons to _not_ reach
> for that as my default when processing large files?

Yes.  The main time is when you "accidentally" force the whole file
(or at least large parts of it) into memory at the same time.
Profiling and careful programming seem to be the workarounds, but in a
large application the "careful programming" part can become
prohibitively expensive.  This is due to the sometimes subtle nature
of how strictness composes with laziness.  This is a the result of a
more general issue that it is non-obvious how your program is
evaluated at run-time thanks to lazy evaluation, thus making lazy
evaluation act as a double edged sword at times.  I'm not saying get
rid of lazy eval, but occasionally it presents problems for efficiency
and diagnosing efficiency problems.

The rule seems to be:  Write correct code first, fix the problems
(usually just inefficiencies) later.

Using lazy bytestrings makes it easier to write concise code that is
more easily inspected for correctness.  Perhaps it is even easier to
test such code, but I'm skeptical of that.  Thus, I think most people
here would agree that reaching first for lazy byte string is preferred
over other techniques.  Plus, the one of the most common fixes to
inefficient haskell programs is to make them lazy in the right places
and strict in key places and using lazy bytestring will get you part
of the way to that refactoring usually.

Jason