Strings are slow
Lauri Alanko
la@iki.fi
Tue, 19 Nov 2002 12:11:12 +0200
Hello. Here's a grep program:
module Main where
import Text.Regex
import System.IO
import System.Environment
import Control.Monad
import Data.Maybe
main = do [re] <- getArgs
let rx = mkRegex re
let loop = do line <- getLine
when (isJust (matchRegex rx line)) (putStrLn line)
eof <- isEOF
unless eof loop
loop
It turned out that this is remarkably slow. The first problem was with
inlining. If this is compiled with ghc-5.04 -O -ddump-simpl, I get:
case GHC.IOBase.unsafePerformIO
@ (Data.Maybe.Maybe
(GHC.Base.String,
GHC.Base.String,
GHC.Base.String,
[GHC.Base.String]))
(Text.Regex.Posix.regexec (Text.Regex.mkRegex re) a731)
of wild4 {
Ie. the regex is compiled anew every time a string is matched. A bug?
Anyway, without optimization the code produced is reasonable, but still
horrendously slow. Testing with a simple word as a pattern from a 7.3MB,
800kline file, the running time was 37.5 seconds. For comparison, a similar
program in mzscheme (interpreted!) took 7.3 seconds while the system
grep, of course, took 0.4 seconds.
I did some profiling by creating new top-level bindings for matchRegex
and getLine (is there a better way?):
total time = 53.34 secs (2667 ticks @ 20 ms)
total alloc = 1,172,482,496 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
match Main 69.7 56.8
getl Main 23.9 40.2
main Main 6.3 3.0
So it seems like all the time is spent just converting ByteArrays to
char lists to C arrays. This makes me wonder how sensible it really is
to represent strings by char lists. Yes, it's nice and uniform and lazy,
but...
How can I get this faster, then? PackedStrings are not very useful
because they just don't support enough operations (getline, matchregex)
alone, and having to convert them to Strings sort of defeats their
purpose. _Any_ operation that provides only a String interface condemns
us to a gazillion allocations. And by default all char*-based foreign
interfaces are represented with Strings on the Haskell side.
Maybe a generic Textual class with at least String and PackedString (and
ByteArray?) as instances would help? Then the common string-based
operations could all have separate implementations for separate
representations. With heavy specialization, of course. This would be
especially useful if the FFI (especially withCString) supported it.
Or alternatively, maybe the foldr/build rewriting trick could be used to
eliminate some redundant conversions between representations?
Just throwing ideas in the air here.
Lauri Alanko
la@iki.fi