[Haskell-cafe] strict version of Haskell - does it exist?

Tue Jan 31 10:33:06 CET 2012

On Tue, Jan 31, 2012 at 6:05 AM, Marc Weber <marco-oweber at gmx.de> wrote:
> I didn't say that I tried your code. I gave enumerator package a try
> counting lines which I expected to behave similar to conduits
> because both serve a similar purpose.
> Then I hit the the "sourceFile" returns chunked lines issue (reported
> it, got fixed) - ....
>
> Anyway: My log files are a json dictionary on each line:
>
>  { id : "foo", ... }
>  { id : "bar", ... }
>
> Now how do I use the conduit package to split a "chunked" file into lines?
> Or should I create a new parser "many json >> newline" ?

Currently there are two solutions.  The first one is what I wrote
earlier on this thread:

 jsonLines :: C.Resource m => C.Conduit B.ByteString m Value
 jsonLines = C.sequenceSink () $ do
   val <- CA.sinkParser json'
   CB.dropWhile isSpace_w8
   return $ C.Emit () [val]

This conduit will run the json' parser (from aeson) and then drop any
whitespace after that.  Note that it will correctly parse all of your
files but will also parse some files that don't conform to your
specification.  I assume that's fine.

The other solution is going to released with conduit 0.2, probably
today.  There's a lines conduit that splits the file into lines, so
you could write jsonLines above as:

 mapJson :: C.Resource m => C.Conduit B.ByteString m Value
 mapJson = C.sequenceSink () $ do
   val <- CA.sinkParser json'
   return $ C.Emit () [val]

which doesn't need to care about newlines, and then change main to

 main = do
   ...
   ret <- forM_ fileList $ \fp -> do
     C.runResourceT $
       CB.sourceFile fp C.$=
       CB.lines C.$=  -- new line is here
       mapJson C.$=
       CL.mapM processJson C.$$
       CL.consume
   print ret

I don't know which solution would be faster.  Either way, both
solutions will probably be faster with the new conduit 0.2.

> Except that I think my processJson for this test should look like this
> because I want to count how often the clients queried the server.
> Probalby I should also be using CL.fold as shown in the test cases of
> conduit. If you tell me how you'd cope with the "one json dict on each
> line" issue I'll try to benchmark this solution as well.

This issue was already being coped with in my previous e-mail =).

> -- probably existing library functions can be used here ..
> processJson :: (M.Map T.Text Int) -> Value -> (M.Map T.Text Int)
> processJson m value = case value of
>                          Ae.Object hash_map ->
>                            case HMS.lookup (T.pack "id") hash_map of
>                              Just id_o ->
>                                case id_o of
>                                  Ae.String id -> M.insertWith' (+) id 1 m
>                                  _ -> m
>                              _ -> m
>                          _ -> m

Looks like the perfect job for CL.fold.  Just change those three last
lines in main from

  ... C.$=
  CL.mapM processJson C.$$
  CL.consume

into

  ... C.$$
  CL.fold processJson

and you should be ready to go.

Cheers!

-- 
Felipe.