[Haskell-beginners] Memory usage problem

Patrick LeBoutillier patrick.leboutillier at gmail.com
Sun Mar 21 20:28:23 EDT 2010


Sami,

I'm no profiling expert, but I have a few questions though:

- What is the size (in bytes) of your input file?
- Also, does memory usage improve if you remove the "sort"?


Patrick

On Sun, Mar 21, 2010 at 7:23 PM, Sami Liedes <sliedes at cc.hut.fi> wrote:
> Hi!
>
> I wrote a simple program to parse Debian package list files (a simple
> text record format), sort the list of packages and output the list. I
> first wrote the program in C (which I speak fluently), but wanting to
> learn Haskell, have been trying to reimplement it in Haskell.
>
> The problem is that my 60-line Haskell program, which doesn't yet do
> everything the C program does, uses huge amounts of memory, 1.5
> gigabytes when parsing a single file. The C implementation can parse
> and sort more than twice that data in less than 10 megabytes. I can't
> figure out where the memory goes or how to fix it, but my guess would
> be it's some lazy computation thunk.
>
> (See below after the code for description of the input and output
> formats.)
>
> With a package file of 28000 records (packages), the memory usage is
> roughly 1.5 gigabytes when compiled with ghc - under hugs it just
> bails out quickly after stack overflow - and I'd appreciate help in
> figuring out
>
> 1) Where the memory is actually spent?
> 2) How would one usually go about figuring this out?
>
> and
>
> 3) How to fix it?
> [4) I'd also love style hints and other ideas to make my code more idiomatic]
>
> I've tried to randomly add some $! operators to the code (that doesn't
> feel right :-), but so far to no avail.
>
> Here's the code:
>
> ------------------------------------------------------------
> module Main where
> import List (sort)
> import Maybe (fromJust, mapMaybe)
>
> -- Parse one line of the format "Field: value".
> -- Ignore those that start with a space.
> readField :: String -> Maybe (String,String)
> readField line =
>  if line == "" || head line == ' ' then Nothing else
>    let (name,':':' ':val) = break (== ':') line in
>    Just (name,val)
>
> data Package = Package { name, version :: !String }
>             deriving (Show, Eq, Ord)
>
> -- Use sprintPackage for formatted output, show for unformatted
> sprintPackage :: Package -> String
> sprintPackage (Package name ver) =
>  name ++ replicate (50 - length name) ' ' ++ ver
>
> -- Read one record worth of lines (separated by blank lines)
> getOneRecordLines [] = ([],[])
> getOneRecordLines lines = (hd, if tl == [] then [] else tail tl) where
>  (hd,tl) = break (== "") lines
>
> -- Read one record worth of lines and parse, returning (Field, value)
> -- tuples and the rest of the lines (less the just read record)
> readRecordFields :: [String] -> ([(String,String)], [String])
> readRecordFields lines = (mapMaybe readField rl, rest) where
>  (rl,rest) = getOneRecordLines lines
>
> -- Convert a list of (Field,value) tuples to a Package
> recordFieldsToPackage :: [(String,String)] -> Package
> recordFieldsToPackage fields =
>  let names = [y | (x,y) <- fields, x=="Package"]
>      vers = [y | (x,y) <- fields, x=="Version"]
>  in
>   case (names,vers) of
>     ([], _) -> error "Package has no name."
>     (a:b:rest, _) -> error ("Package has two names ("++a++","++b++").")
>     ([n], []) -> error ("Package "++n++" has no version.")
>     ([n], a:b:rest) -> error ("Package "++n++" has multiple versions.")
>     ([n], [v]) -> Package n v
>
> -- Read one record, returning Just Package if it contained a valid
> -- package and version, Nothing otherwise.
> readRecord :: [String] -> (Maybe Package,[String])
> readRecord lines =
>  if fields == [] then (Nothing, rest)
>  else (Just (recordFieldsToPackage fields), rest) where
>    (fields,rest) = readRecordFields lines
>
> -- Converts the list (stream) of lines to a list of packages
> readRecords :: [String] -> [Package]
> readRecords [] = []
> readRecords lines =
>  let (rec,rest) = readRecord lines in
>  case rec of
>    Just pkg -> pkg : readRecords rest
>    Nothing -> readRecords rest
>
> processFile = unlines . (map sprintPackage) . sort . readRecords . lines
>
> main :: IO ()
> main = interact processFile
> ------------------------------------------------------------
>
> The input (stdin) for the program is of the format:
>
> ------------------------------------------------------------
> Package: somepackagename
> Version: someversionstring
> Other-Fields: ...
> Whatever: ...
>  ... (lines that begin with space are just ignored)
>
> Package: pkg2
> Version: otherversion
> ...
> ------------------------------------------------------------
>
> That is, the file has records separated by blank lines. Only lines
> that begin with "Package: " or "Version: " are considered in a record,
> the others are just ignored.
>
> When finished, it outputs a formatted list of packages to stdout:
>
> ------------------------------------------------------------
> a2ps                                              1:4.14-1
> a2ps-perl-ja                                      1.45-5
> a56                                               1.3-5
> a7xpg                                             0.11.dfsg1-4
> a7xpg-data                                        0.11.dfsg1-4
> aa3d                                              1.0-8
> aajm                                              0.4-3
> aap                                               1.091-1
> aap-doc                                           1.091-1
> ...
> ------------------------------------------------------------
>
> GHC profiling hints that the memory is spent in processFile and main,
> but no amount of adding $! to those functions seems to help:
>
> ------------------------------------------------------------
>        Thu Mar 18 22:37 2010 Time and Allocation Profiling Report  (Final)
>
>           aptlistsh +RTS -p -RTS
>
>        total time  =        3.18 secs   (159 ticks @ 20 ms)
>        total alloc = 1,873,922,000 bytes  (excludes profiling overheads)
>
> COST CENTRE                    MODULE               %time %alloc
>
> processFile                    Main                  44.0   61.9
> main                           Main                  32.1   20.7
> readField                      Main                  11.9    9.0
> recordFieldsToPackage          Main                   4.4    2.0
> newPkg                         Main                   3.1    0.8
> sprintPackage                  Main                   1.3    2.1
> readRecordFields               Main                   1.3    1.3
> CAF                            Main                   1.3    0.4
> getOneRecordLines              Main                   0.6    1.4
> ------------------------------------------------------------
>
>        Sami
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iQIcBAEBCAAGBQJLpqppAAoJEKLT589SE0a08FoQALtZf+eO7u+YdVJuvZ6xxg/y
> DVHTnRdemPtlNuBJazV1RlTQovRNyQpX1CHX70MdnHLP4g9CzQhBsIeGTdgA9am+
> pd9NjNFc1+ZDJJWAY24Kyx0pEfvyfLuUvF8KNInZGXeg+BriiFjFFU3RHqY0F3XD
> LIjAq/TaABmMGgahvG/y0EMRhduC+N9VskA4ivd7OCxvA7bGuzZEFxFbz2qqXrcE
> VXzE4ioha8C+WeX6djm/+CSoqN8o7bN1DS+fkEYr+jlU2jYG/xvnnfC8785r+9iP
> wZo1muIP4FVExZ3w5VSjLWkhMjgYDHLiuEs1S5dQdrgAHz+ea+PZqFyOoDEMyYvG
> vhnaTrPev2O4BZLAky5F13tj7fm1tV5CxJ9oq7fuurZoB7OdEpSguOyR3xYgoIYv
> spg7aqRlLuruX7VHbSu6pTtqWCEvxHxQbd+R2f5jqaDOsw+3n1enH6PFqTkiLpHc
> ReTLli4ploUGmiQRmbWpe2urECQ6QAXhqdx93vHwT2TUYVqL9wH84XsNvGBSt8Kq
> 5OowrJOlneeWre0SS1dn96ASE1oKKiCpc9rkGvD3tMxTONHFTPziyMtiC675ReH0
> 9XHvRkn9+8RV2gVnKeOMyfPT3ef/q2aP8uESf4nW6AAHNa41QWiB9/OYS32ljx/9
> Cv/IvQVVIzSYQjNTO7a2
> =yn0D
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
>
>



-- 
=====================
Patrick LeBoutillier
Rosemère, Québec, Canada


More information about the Beginners mailing list