Potential GSoC proposal: Reduce the speed gap between 'ghc -c' and 'ghc --make'

Wed Apr 25 10:00:19 CEST 2012

On 25/04/2012 08:57, Simon Marlow wrote:
> On 25/04/2012 03:17, Mikhail Glushenkov wrote:
>> Hello Simon,
>>
>> Sorry for the delay.
>>
>> On Tue, Apr 10, 2012 at 1:03 PM, Simon Marlow<marlowsd at gmail.com> wrote:
>>>
>>>> Questions:
>>>>
>>>> Would implementing this optimisation be a worthwhile/realistic GSoC
>>>> project?
>>>> What are other potential ways to bring 'ghc -c' performance up to par
>>>> with 'ghc --make'?
>>>
>>>
>>> My guess is that this won't have a significant impact on ghc -c compile
>>> times.
>>>
>>> The advantage of squashing the .hi files for a package together is
>>> that they
>>> could share a string table, which would save a bit of space and time,
>>> but I
>>> think the time saved is small compared to the cost of deserialising and
>>> typechecking the declarations from the interface, which still has to be
>>> done. In fact it might make things worse, if the string table for the
>>> whole
>>> base package is larger than the individual tables that would be read
>>> from
>>> .hi files. I don't think mmap() will buy very much over the current
>>> scheme
>>> of just reading the file into a ByteArray.
>>
>> Thank you for the answer.
>> I'll be working on another project during the summer, but I'm still
>> interested in making interface files load faster.
>>
>> The idea that I currently like the most is to make it possible to save
>> and load objects in the "GHC heap format". That way, deserialisation
>> could be done with a simple fread() and a fast pointer fixup pass,
>> which would hopefully make running many 'ghc -c' processes as fast as
>> a single 'ghc --make'. This trick is commonly employed in the games
>> industry to speed-up load times [1]. Given that Haskell is a
>> garbage-collected language, the implementation will be trickier than
>> in C++ and will have to be done on the RTS level.
>>
>> Is this a good idea? How hard it would be to implement this optimisation?
>
> I believe OCaml does something like this.
>
> I think the main difficulty is that the data structures in the heap are
> not the same every time, because we allocate unique identifiers
> sequentially as each Name is created. So to make this work you would
> have to make Names globally unique. Maybe using a 64-bit hash instead of
> the sequentially-allocated uniques would work, but that would entail
> quite a performance hit on 32-bit platforms (GHC uses IntMap everywhere
> with Unique as the key).
>
> On top of this there will be a *lot* of other complications (e.g.
> handling sharing well, mapping info pointers somehow). Personally I
> think it's at best very ambitious, and at worst not at all practical.

Oh, I also meant to add: the best thing we could do initially is to 
profile GHC and see if there are improvements that could be made in the 
.hi file deserialisation/typechecking.

Cheers,
	Simon

>
> Cheers,
> Simon
>
>
>
>> Another idea (that I like less) is to implement a "build server" mode
>> for GHC. That way, instead of a single 'ghc --make' we could run
>> several ghc build servers in parallel. However, Evan Laforge's efforts
>> in this direction didn't bring the expected speedup. Perhaps it's
>> possible to improve on his work.
>>
>> [1]
>> http://www.gamasutra.com/view/feature/132376/delicious_data_baking.php?print=1
>>
>