parallelizing ghc

Mon Feb 27 10:33:49 CET 2012

On 17/02/2012 18:12, Evan Laforge wrote:
>> Sure, except that if the server is to be used by multiple clients, you will
>> get clashes in the PIT when say two clients both try to compile a module
>> with the same name.
>>
>> The PIT is indexed by Module, which is basically the pair
>> (package,modulename), and the package for the main program is always the
>> same: "main".
>>
>> This will work fine if you spin up a new server for each program you want to
>> build - maybe that's fine for your use case?
>
> Yep, I have a new server for each CPU.  So compiling one program will
> start up (say) 4 compilers and one server.  Then shake will start
> throwing source files at the server, in the proper dependency order,
> and the server will distribute the input files among the 4 servers.
> Each server is single-threaded so I don't have to worry about calling
> GHC functions reentrantly.
>
> But --make is single-threaded as well, so why doesn't it just call
> compileFile repeatedly and instead bother with all that HPT stuff?  Is
> it just for ghci?

That might be true, but I'm not completely sure.  The HPT stuff was 
added with a continuous edit-recompile cycle in mind (i.e. for GHCi), 
and we added --make at the same time because it fitted nicely.  It might 
be that just calling compileFile repeatedly works, and it would end up 
storing the interfaces for the home-package modules in the 
PackageIfaceTable, but we never considered this use case.  One thing 
that worries me: will it be reading the .hi file for a module off the 
disk after compiling it?  I suspect it might, whereas the HPT method 
will be caching the iface in the HPT.

>>> The 'user' is low for the server because it doesn't count time spent
>>> by the subprocesses on the other end of the socket, but excluding
>>> linking it looks like I can shave about 25% off compile time.
>>> Unfortunately it winds up being just about the same as ghc --make, so
>>> it seems too low.
>>
>> But that's what you expect, isn't it?
>
> It's surprising to me that the serial --make is just about the same
> speed as a parallelized one.  The whole point was to compile faster!

Ah, so maybe the problem is that the compileFile method is re-reading 
.hi files off the disk (and typechecking them), and that is making it 
slower.

> Granted, each interface has to be loaded for each processor while
> --make only needs to do it once, but once loaded they should stay
> loaded and I'd expect the benefit from two processors would win out
> pretty quickly.
>
>> --make has a slight advantage for linking in that it knows which packages it
>> needs to link against, whereas plain ghc will link against all the packages
>> on the command line.
>
> Ohh, so maybe with --make it can omit some packages and do less work.
> Let me try minimizing the -packages and see if that helps.
>
> As an aside, it would be handy to be able to ask ghc "given this main
> module, which -packages should the final program get?" but not
> actually compile anything.  Is there a way to do that, short of
> writing my own with the ghc api?  Would it be a reasonable ghc flag,
> along the lines of -M but for packages?

I don't think we can calculate the package dependencies without knowing 
the ModIface, which is generated by compiling (or at least typechecking) 
each module.

Cheers,
	Simon

>
> BTW, in case anyone is interested, a darcs repo is at
> http://ofb.net/~elaforge/ghc-server/