[Haskell-cafe] strict version of Haskell - does it exist?
sseverance at alphaheavy.com
Tue Jan 31 21:19:16 CET 2012
I had a similar experience with a similar type of problem. The
application was analyzing web pages that our web crawler had collected,
well not the pages themselves but metadata about when the page was
The basic query was:
Domain, Date, COUNT(*)
The webpage data was split out across tens of thousands of files compressed
binary. I used enumerator to load these files and select the appropriate
columns. This step was performed in parallel using parMap and worked fine
once i figured out how to add the appropriate !s.
The second step was the group by. I built some tools across monad-par that
had the normal higher level operators like map, groupBy, filter, etc... The
typical pattern I followed was the map-reduce style pattern used in
monad-par. I was hoping to someday share this work, although I have since
abandoned work on it.
It took me a couple of weeks to get the strictness mostly right. I say
mostly because it still randomly blows up, meaning if I feed in a single
40kb file maybe 1 time in 10 it consumes all the memory on the machine in a
few seconds. There is obviously a laziness bug in there somewhere although
after working on it for a few days and failing to come up with a solid
repro case I eventually built all the web page analysis tools in scala, in
large part because I did not see a way forward and need to tie off that
work and move on.
Combining laziness and parallelism made it very difficult to reason about
what was going on. Test cases became non-deterministic not in terms out
output in the success case but whether they ran at all.
The tooling around laziness does not give enough information about
debugging complex problems. Because of this when people ask "Is Haskell
good for parallel development?" I tell them the answer is complicated.
Haskell has excellent primitives for parallel development like the STM
which I love but it lacks a PLINQ like toolkit that is fully built out to
enable flexible parallel data processing.
The other thing is that deepseq is very important . IMHO this needs to be a
first class language feature with all major libraries shipping with deepseq
instances. There seems to have been some movement on this front but you
can't do serious parallel development without it.
Some ideas for things that might help would be a plugin for vim that showed
the level of strictness of operations and data. I am going to take another
crack at a PLINQ like library with GHC 7.4.1 in the next couple of months
using the debug symbols that Peter has been working on.
Haskell was the wrong platform to be doing webpage analysis anyhow, not
because anything is wrong with the language but simply it does not have the
tooling that the JVM does. I moved all my work into Hadoop to take
advantage of multi-machine parallelism and higher level tools like Hive.
There might be a future in building haskell code that could be translated
into a Hive query.
With better tools I think that Haskell can become the goto language for
developing highly parallel software. We just need the tools to help
developers better understand the laziness of their software. There also
seems to be a documentation gap on developing data analysis or data
transformation pipelines in haskell.
Sorry for the length. I hope my experience is useful to someone.
On Tue, Jan 31, 2012 at 7:57 AM, Marc Weber <marco-oweber at gmx.de> wrote:
> Excerpts from Felipe Almeida Lessa's message of Tue Jan 31 16:49:52 +0100
> > Just out of curiosity: did you use conduit 0.1 or 0.2?
> I updated to 0.2 today because I was looking for a monad instance for
> SequenceSink - but didn't find it cause I tried using it the wrong way
> (\state -> see last mail)
> I also tried json' vs json (strict and non strict versions) - didn't
> seem to make a big difference.
> Marc Weber
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe