[Haskell-beginners] Space leak while reading from a file?

Sun May 11 11:24:39 UTC 2014

Dear all,

I'm trying to implement a simple file-based database. I apparently have
a space leak, but I have no clue where it comes from.

Here's the file-based database implementation:
http://pastebin.com/QqiqcXFw

The idea to have a database table in a single textual file. One line
equals one table row. The fields within a row are whitespace separated.
The first field is the key. Because I'd like to work with large files, I
don't want to load the whole file into memory. Instead, I'd like to be
able to fetch the rows on demand, by keys. Thus I first create an index
that links keys to file seeks. I use the readerT to add the index to the
IO monad.

For testing, I use a dummy table produced as follows:

import System.IO
import Text.Printf
import Control.Monad

row = unwords [printf "field%03d" (i::Int) | i <- [1..999]]

main = do
  forM_ [1..250000] $ \i ->
     putStrLn $ printf "row%06d %s" (i::Int) row

This generates a 2.1G textual file, which I store on my disk.

The testing code:

import FileDB
import qualified Data.Text as T
import Text.Printf
import Control.Applicative
import Control.Monad
import Control.Monad.Trans
import System.IO
import System.Environment

main = do
  (f:_) <- getArgs
  t <- openTable f
  runDB t $ do
    ks <- getKeys
    liftIO $ do
      putStrLn . printf "%d keys read" $ length ks
      putStrLn "Press any key to continue..."
      getChar
    forM_ ks $ \k -> do
      Just r <- getRow k
      liftIO . putStrLn $ printf "Row \"%s\" has %d fields"
        (T.unpack k) (length r)

When I run the test on the 2.1GB file, the whole program consumes 10GB.

6GB seem to be allocated after the index is built (just before entering
the forM_ function). The remaining 4GB are allocated while fetching all
the rows.

I find both things difficult to explain.

6GB seems too much for the index. Each key is 9 characters (stored as
Data.Text), and I have 250K such keys in a Data.Map. Should this really
add up to 6GB?

Also, I have no idea why fetching all the rows, one by one, should
consume any additional memory. Each row is fetched and its length is
computed and printed out. I see no reason for the rows to be retained in
the memory.

Here's the memory allocation summary:

> 1,093,931,338,632 bytes allocated in the heap
>    2,225,144,704 bytes copied during GC
>    4,533,898,000 bytes maximum residency (26 sample(s))
>    3,080,926,336 bytes maximum slop
>            10004 MB total memory in use (0 MB lost due to fragmentation)
>
>                                     Tot time (elapsed)  Avg pause  Max
pause
>   Gen  0     2171739 colls,     0 par   45.29s   45.26s     0.0000s
 0.0030s
>   Gen  1        26 colls,     0 par    1.50s    1.53s     0.0589s
0.7087s
>
>   INIT    time    0.00s  (  0.00s elapsed)
>   MUT     time  279.92s  (284.85s elapsed)
>   GC      time   46.80s  ( 46.79s elapsed)
>   EXIT    time    0.68s  (  0.71s elapsed)
>   Total   time  327.40s  (332.35s elapsed)
>
>   %GC     time      14.3%  (14.1% elapsed)
>
>   Alloc rate    3,908,073,170 bytes per MUT second
>
>   Productivity  85.7% of total user, 84.4% of total elapsed

Btw., I don't get the "bytes allocated in the heap" figure, which is
approx. 1000 GB (?).

I'm obviously doing something wrong here. I'd be thankful for any help.

Best,
Jan