[Haskell-cafe] Re: Processing of large files

Wed Nov 3 07:51:38 EST 2004

John Goerzen writes:

 >> Given that the block-oriented approach has constant space
 >> requirements, I am fairly confident it would save memory.

 > Perhaps a bit, but not a significant amount.

I see.

 >> > [read/processing blocks] would likely just make the
 >> > code a lot more complex. [...]

 >> Either your algorithm can process the input in blocks or
 >> it cannot. If it can, it doesn't make one bit a
 >> difference if you do I/O in blocks, because your
 >> algorithm processes blocks anyway.

 > Yes it does. If you don't set block buffering, GHC will
 > call read() separately for *every* single character.

I referred to the alleged complication of code, not to
whether the handle's 'BufferingMode' influences the
performance or not.

 > (I've straced stuff!)

How many read(2) calls does this code need?

  import System.IO
  import Control.Monad         ( when )
  import Foreign.Marshal.Array ( allocaArray, peekArray )
  import Data.Word             ( Word8 )

  main :: IO ()
  main = do
    h <- openBinaryFile "/etc/profile" ReadMode
    hSetBuffering h NoBuffering
    n <- fmap cast (hFileSize h)
    buf <- allocaArray n $ \ptr -> do
      rc <- hGetBuf h ptr n
      when (rc /= n) (fail "huh?")
      buf' <- peekArray n ptr :: IO [Word8]
      return (map cast buf')
    putStr buf
    hClose h

  cast :: (Enum a, Enum b) => a -> b
  cast = toEnum . fromEnum

 > It's a lot more efficient if you set block buffering in
 > your input, even if you are using interact and lines or
 > words to process it.

Of course it is. Which is why an I/O-bound algorithm should
process blocks. It's more efficient. And uses slightly less
memory, too. Although I have been told it's not a
insignificant amount.

Peter