[Haskell-cafe] Files and Modules

Mon Dec 2 13:24:48 UTC 2024

On 02.12.24 10:16, julian getcontented.com.au wrote:
> In a recent project that compiles Haskell source from data (ie of type Text from the module Data.Text), it would be useful to be able to decouple the dependency between GHC’s notion of where modules are and the file system. This doesn’t seem to be programmatically controllable.
> 
> How tenable is this?

I can report how this works in the Java world, where this exists.

TL;DR: The real issues are programmer workflows, tool integration, and 
optimization, not so much semantic issues.

The mechanism itself is a non-problem there, as it was designed right 
into the ecosystem right from the start.
Even there, it came with a number of significant downsides.

It places pretty hefty constraints on global optimizations, inlining in 
particular: You can't usefully inline if you don't know if a call will 
be polymorphic because somebody added a subclass.
You either have to prevent subclassing, or forfeit cross-module 
inlining, or keep track of dependencies so you can undo inlining 
whenever assumptions about polymorphism are broken by new code.
Now Haskell's polymorphism is different from Java's, but I'd expect 
similar issues.
In the Java world, this meant integrating the optimization phase into 
the runtime system, increasing the code and memory footprint of the JVM, 
and with a heavy runtime code during start-up when the bulk of 
optimisations is run.
The Java world is currently swinging from dynamic code loading to static 
precompilation (look for references to GraalVM is you want to know 
more); however, this burdens the application programmer with defining 
the dynamic loading behaviour of the class system, even though that's 
done in the build specifications and not in the code itself (which comes 
with its own set of problems, such as having to cross-reference code and 
build specs when reasoning about code, though that affects only those 
who want to control compiler behaviour tightly).

> Would it be useful for anyone else to have compilation itself be
> more first class in the language? If I think about languages such as
> LISP/Racket/Clojure, there’s a certain flexibility there that
> Haskell lacks, but it’s not apparent why, other than historical
> reasons?

These languages are extremely hard to optimize, so at least the GHC 
people won't be able to follow that route.

If that's fine, then your suggestions seems doable.

> Would this imply changing compiling and linking into a different
> Monad than IO?

I can't say with confidence but I wouldn't expect that to be an issue.
A compiler typically maps special operations like this to machine code 
or possibly intermediate code, as part of the optimization phase.
The selection of IO mechanism is more a programmer-facing issue.
The exact conditions under which what optimization is applicable does 
depend on the details of the IO mechanism's semantics, so I guess nobody 
will want to even touch that part of the mechanism, and say that IO is 
fine; they'd rather modify other systems to make IO work, if there are 
problems with it.

> At the moment to compile some source that exists in Text, my system
> has to write a bunch of temp files including the Text that contains
> the main module, and then put other modules in directories named a
> certain way, run the compiler across them via some exec command to
> call GHC or stack externally, then read the resulting executable
> back off disk to store it in its final destination.

And now we're in the area of application programmer downsides.
If you make these files temporary and unavailable to the programmer for 
debugging, stack traces and such become meaningless.
You'll have to add tooling for that. I.e. code that takes the stack 
traces and maps them back to the specification language that you're 
generating code from.
You'll have to consider the messages from that stack trace low-level, 
and add a translation step that transforms the low-level semantics to 
what the programmer specified, i.e. you have to know exactly what 
Haskell code patterns can exist, have a full list of possible errors, 
and code that does the translation.
Similar considerations apply to debuggers, profiling tools and whatever 
other programming tools with a connection to code lines have.
That's a pretty tall order, not only because of the translation step, 
but because you have to integrate that translation with a multitude of 
tools, some (most?) of them under active development, i.e. moving targets.
It's a pretty tall order.

Code generators like yours typically use another technique: Generate the 
code into a directory that's not under version control but part of the 
module paths of all tools.
Generate code with comments that refer back to the original 
specification, utilizing the programmer's knowledge to do the backwards 
translation.

In the Java world, this kind of stuff was recently integrated into the 
toolchains. There's a mechanism called an "annotation processor" (please 
ignore the "annotation" part, it's just the trigger for the mechanism) 
which will be run by the Java compiler and generate the code into a 
generated-code subdirectory; the toolchains know to include this 
directory into their module paths, since pretty recently even by default.

> It might be useful to be able to do this from within Haskell code
> directly, partly similarly to how the hint library works. Though, in
> this case it would almost certainly also require being able to have
> two versions of GHC loaded at once, which would also imply being
> able to simultaneously have multiple or different versions of
> libraries loaded at once, too, and possibly also just from data, ie
> not from disk. It feels like a massive, massive project at that
> point, though, like we’d be putting an entire dependency system into
> a first-class programmable context. I’m still interested in what
> folks think about these ideas, though, event though we this may
> never eventuate.
It will be less massive if you start with integrating code generation 
into the toolchains I think.
But yeah, I think it's still a massive project.

> Does it seem to anyone else like abstracting the library and module-
> access capabilities of compilation so that it’s polymorphic over
> where it gets its data from might be useful?
You can't usefully work with generated code unless you automate the 
translation of messages from the Haskell level to the new combined 
Haskell+specifications language.
I am not sure that such a thing is even realistically doable; even the 
Java ecosystem has been shying away from that approach, despite having 
much more manpower than Haskell's, and despite having a JVM that was 
designed and built for integrating anonymous code.
E.g. I hit bugs^W unexpected behaviour in Hibernate-generated code that 
I couldn't diagnose. The behaviour remained a mystery, the code 
generation was too deeply hidden below layers of polymorphic library 
code, so I started the bytecode disassembler and found, to my 
incredulous amazement, that Hibernate would add attributes, a behaviour 
that Hibernate did not document ever; I had accidentally defined an 
attribute with a conflicting name, so Hibernate would interfere with 
application logic and vice versa.

Sorry for the wall of text, but it's a pretty big topic.

HTH
Jo