[Haskell-cafe] Optimising ReadP

Tue Sep 21 12:59:20 UTC 2021

Hi,

I'm trying to optimise a short Haskell program (inspired by Rust's 
performance). From simple profiling, I discovered that 90+% of time and 
allocations is spent in a `ReadP`-based parser. The parser is fairly 
small, I assumed GHC would inline/optimise most of it. I tried adding 
some `SPECIALISE` pragmas but found out those don't work: they require 
`INLINE`/`INLINEABLE` pragmas or compiling with `-O`. I'm confused by 
the latter requirement as I thought dependencies of the Stack project 
are compiled with optimisation enabled. I copied the entirety of the 
`ReadP` module into my project along with what I used from 
`Text.Read.Lex` to be able to optimise this code locally. I managed to 
get some speedups by specialising the integer parsers. Cloning the 
`ReadP` code also helped with more detailed profiling reports, I now 
know that the majority of time and allocations are spent in `>>=`.

That's it for the long-winded intro, here are my questions:

- what's the problem with GHC complaining about missing `INLINE(ABLE)` 
pragmas and/or `-O` in Stack dependencies?

- Are these things just never inlined unless library authors 
specifically mark them as such?

- What profiling tools do people usually use in situations like this?

- I cannot specialise `>>=` even when I mark it `INLINEABLE` in the 
`Monad ReadP` instance, apparently, the pragma should be placed at the 
declaration site (at least that's what the warning says). So I run into 
the same issue with `INLINE(ABLE)`/`-O` as above. That seems pretty 
silly, I feel like I should be able to inline/specialise this particular 
instance and not worry about the global declaration of `>>=`. Is there a 
workaround for this?

- I also used `BangPatterns` in my optimisations, but it's been 
hit-and-miss. They (surprisingly) helped in small, local definitions 
marked `INLINE` (I assumed strictness analysis would pick up on those), 
but in other places, such as `ReadP`'s `>>=`, strictness annotations 
slightly worsened performance. How do people look for good patterns to 
annotate as strict?

- How do I find out to what extent is laziness slowing things down, or 
look for places with unintended thunks?

As these are general optimisation questions, I'm not looking for parsing 
alternatives suggestions. I'd like to write high-level code and have it 
efficiently compose, this is a toy program anyway and I'm wondering how 
far can I get with such a simple parsing library. I suspect the majority 
of allocations coming from the parser are short-lived garbage that I 
should be able to avoid with optimisations and maybe a little bit of 
refactoring.

One thing that's specific to my implementation is that the parser first 
produces a `[(Int, Int)]` and then converts that to `IntMap (Set Int)` 
with a `foldl'`. I would like to ensure the list is never built in the 
first place, is there a way to do that?

As to the code itself, it's only 90 lines, but I'm not sure what the 
consensus is on attachments in this mailing list. I can provide the 
whole Stack project along with a test case and a reference solution in 
Rust upon request. (I'd post it on my GitHub, but there's a slight 
chance a participant in the competition this is a reference solution to 
may find it there.)

Best regards,
Andrew