[Haskell-cafe] help with Haskell performance

Sat Nov 7 16:08:41 EST 2009

Hi all,

The task I'm trying to accomplish:

Given a log file containing several lines of white space delimited entries like this:

[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
[Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist
[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
[Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist

filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this:

("GET /url1 HTTP/1.1]:", 2)
("GET /url2 HTTP/1.0]:", 1)

Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email).

The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds.

Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code:

--- begin haskell code ---

import Text.Regex.Posix ( (=~) )
import qualified Data.List as List  
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy.Char8 as LB

main = do
  contents <- LB.readFile "log_file"
  putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
    -- create a Map grouping & counting matching tokens and sort based on the counts
    List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . Map.fromListWith (+) . filtertokens .
    LB.lines $ contents
  where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) []
          -- filter lines starting with " 08:"
          where f = (=~ " 08:") . LB.unpack
                -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1)
                g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line

--- end haskell code ---

This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result.

I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.

Thanks in advance.

P.S. For reference, here's my corresponding perl implementation:

--- start perl code ---

#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';

my %urls;
open(FILE, '<', $ARGV[0]);
while(<FILE>) {
    if (/ 08:/) {
        my @words = split;
        my $key = join(" ", ($words[6], $words[7], $words[8]));
        if (exists $urls{$key}) { $urls{$key}++ }
        else { $urls{$key} = 1 }
    }
}
for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }

--- end perl code ---

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20091107/108ec6a4/attachment.html