[Haskell-cafe] Files and Modules
jo at durchholz.org
jo at durchholz.org
Mon Dec 2 13:24:48 UTC 2024
On 02.12.24 10:16, julian getcontented.com.au wrote:
> In a recent project that compiles Haskell source from data (ie of type Text from the module Data.Text), it would be useful to be able to decouple the dependency between GHC’s notion of where modules are and the file system. This doesn’t seem to be programmatically controllable.
>
> How tenable is this?
I can report how this works in the Java world, where this exists.
TL;DR: The real issues are programmer workflows, tool integration, and
optimization, not so much semantic issues.
The mechanism itself is a non-problem there, as it was designed right
into the ecosystem right from the start.
Even there, it came with a number of significant downsides.
It places pretty hefty constraints on global optimizations, inlining in
particular: You can't usefully inline if you don't know if a call will
be polymorphic because somebody added a subclass.
You either have to prevent subclassing, or forfeit cross-module
inlining, or keep track of dependencies so you can undo inlining
whenever assumptions about polymorphism are broken by new code.
Now Haskell's polymorphism is different from Java's, but I'd expect
similar issues.
In the Java world, this meant integrating the optimization phase into
the runtime system, increasing the code and memory footprint of the JVM,
and with a heavy runtime code during start-up when the bulk of
optimisations is run.
The Java world is currently swinging from dynamic code loading to static
precompilation (look for references to GraalVM is you want to know
more); however, this burdens the application programmer with defining
the dynamic loading behaviour of the class system, even though that's
done in the build specifications and not in the code itself (which comes
with its own set of problems, such as having to cross-reference code and
build specs when reasoning about code, though that affects only those
who want to control compiler behaviour tightly).
> Would it be useful for anyone else to have compilation itself be
> more first class in the language? If I think about languages such as
> LISP/Racket/Clojure, there’s a certain flexibility there that
> Haskell lacks, but it’s not apparent why, other than historical
> reasons?
These languages are extremely hard to optimize, so at least the GHC
people won't be able to follow that route.
If that's fine, then your suggestions seems doable.
> Would this imply changing compiling and linking into a different
> Monad than IO?
I can't say with confidence but I wouldn't expect that to be an issue.
A compiler typically maps special operations like this to machine code
or possibly intermediate code, as part of the optimization phase.
The selection of IO mechanism is more a programmer-facing issue.
The exact conditions under which what optimization is applicable does
depend on the details of the IO mechanism's semantics, so I guess nobody
will want to even touch that part of the mechanism, and say that IO is
fine; they'd rather modify other systems to make IO work, if there are
problems with it.
> At the moment to compile some source that exists in Text, my system
> has to write a bunch of temp files including the Text that contains
> the main module, and then put other modules in directories named a
> certain way, run the compiler across them via some exec command to
> call GHC or stack externally, then read the resulting executable
> back off disk to store it in its final destination.
And now we're in the area of application programmer downsides.
If you make these files temporary and unavailable to the programmer for
debugging, stack traces and such become meaningless.
You'll have to add tooling for that. I.e. code that takes the stack
traces and maps them back to the specification language that you're
generating code from.
You'll have to consider the messages from that stack trace low-level,
and add a translation step that transforms the low-level semantics to
what the programmer specified, i.e. you have to know exactly what
Haskell code patterns can exist, have a full list of possible errors,
and code that does the translation.
Similar considerations apply to debuggers, profiling tools and whatever
other programming tools with a connection to code lines have.
That's a pretty tall order, not only because of the translation step,
but because you have to integrate that translation with a multitude of
tools, some (most?) of them under active development, i.e. moving targets.
It's a pretty tall order.
Code generators like yours typically use another technique: Generate the
code into a directory that's not under version control but part of the
module paths of all tools.
Generate code with comments that refer back to the original
specification, utilizing the programmer's knowledge to do the backwards
translation.
In the Java world, this kind of stuff was recently integrated into the
toolchains. There's a mechanism called an "annotation processor" (please
ignore the "annotation" part, it's just the trigger for the mechanism)
which will be run by the Java compiler and generate the code into a
generated-code subdirectory; the toolchains know to include this
directory into their module paths, since pretty recently even by default.
> It might be useful to be able to do this from within Haskell code
> directly, partly similarly to how the hint library works. Though, in
> this case it would almost certainly also require being able to have
> two versions of GHC loaded at once, which would also imply being
> able to simultaneously have multiple or different versions of
> libraries loaded at once, too, and possibly also just from data, ie
> not from disk. It feels like a massive, massive project at that
> point, though, like we’d be putting an entire dependency system into
> a first-class programmable context. I’m still interested in what
> folks think about these ideas, though, event though we this may
> never eventuate.
It will be less massive if you start with integrating code generation
into the toolchains I think.
But yeah, I think it's still a massive project.
> Does it seem to anyone else like abstracting the library and module-
> access capabilities of compilation so that it’s polymorphic over
> where it gets its data from might be useful?
You can't usefully work with generated code unless you automate the
translation of messages from the Haskell level to the new combined
Haskell+specifications language.
I am not sure that such a thing is even realistically doable; even the
Java ecosystem has been shying away from that approach, despite having
much more manpower than Haskell's, and despite having a JVM that was
designed and built for integrating anonymous code.
E.g. I hit bugs^W unexpected behaviour in Hibernate-generated code that
I couldn't diagnose. The behaviour remained a mystery, the code
generation was too deeply hidden below layers of polymorphic library
code, so I started the bytecode disassembler and found, to my
incredulous amazement, that Hibernate would add attributes, a behaviour
that Hibernate did not document ever; I had accidentally defined an
attribute with a conflicting name, so Hibernate would interfere with
application logic and vice versa.
Sorry for the wall of text, but it's a pretty big topic.
HTH
Jo
More information about the Haskell-Cafe
mailing list