Map folds in containers: why no worker/wrapper split?
Claus Reinke
claus.reinke at talk21.com
Thu Jun 24 04:52:47 EDT 2010
> I did quite a lot of benchmarking, strictness analysis and performance
> improvements to the containers package recently, see the draft of my
> paper http://fox.ucw.cz/papers/containers/containers.pdf.
>
> Next week I will start finalizing the patches and submit them. The 'not
> generating a worker/wrapper' issue is one of several on my radar, so
> this should be resolved.
That is good to hear. It is great that someone is doing the work
needed to pin down, measure, document, and perhaps even
cure some of the performance issues in containers. If nothing
else, a published summary of the performance issues that have
popped up over the years would be a start.
Btw, the first thing I'd look for in a paper on containers performance,
and the one issue that tends to trip up most users, is strictness. So I
was surprised not to see a discussion of this in your draft (at least
not in a quick browse). Could it be that your work has focussed on
the Sets rather than the Maps in containers?
Lots of performance problems with containers are about
pragmatics, not representation, eg, one knows that Maps are
strict in keys but available operations happen to be defined
in such a way that one cannot easily ensure other desirable
strictness properties.
For the worker/wrapper split, there are non-strict definitions
of higher-order functions, which the compiler could make strict
by using strictness information about the calling context if the
definitions were written differently.
For other operations, turning the non-strict definitions into
strict versions (when strict versions are more appropriate;
ideally, both variants should be well supported and
documented) is often difficult for users of containers.
The issues are also non-obvious, which is perhaps worse,
because Map/IntMap is often used to attempt to speed up
a loop accumulator, with the result that thunks pile up
behind the scene and performance goes down until large
problem sizes cannot be handled. At this point, users are
no longer aware that they built in the performance problem
right at the start, by using a supposedly efficient data structure
in an unfortunate way - they think it is about Haskell and
large datasets.
Some examples that I happen to recall (there are more in
the list archives of cafe and libraries, under various keywords,
as the original poster doesn't realize that the root cause is in
containers; there are also some related tickets in the GHC trac):
- IntMap and Map expose different APIs wrt strictness
http://www.haskell.org/pipermail/libraries/2009-March/011399.html
http://www.mail-archive.com/haskell-cafe@haskell.org/msg48925.html
- to work around the non-strict Map fold, users have to
resort to nested folds (non-strict fold to convert to list,
then strict foldl' over list) and other tricks:
http://www.haskell.org/pipermail/haskell-cafe/2010-June/079098.html
similarly non-trivial thinking is required to work around other
no-control-over-strictness issues in the APIs, bypassing the
obvious functions and using non-obvious ones based on
knowledge about their implementation
- strictness is used inconsistently:
http://hackage.haskell.org/trac/ghc/ticket/4109
- strictness is not documented: !!!
search for "strict" in the Data.Map/IntMap documentation..
this is really bad - the only way for users to become aware
of strictness problems is by running into them, and the only
way to fix strictness-related performance issues is by referring
to the implementation (in theory, an implementation change
could invalidate lots of performance fixes in production code)
It seems that strictness documentation and control would
make for a large section in a paper on containers performance.
Claus
More information about the Libraries
mailing list