[Haskell-cafe] Correspondence between libraries and modules

Wed Apr 25 06:57:45 CEST 2012

On 4/23/12 3:06 PM, Alvaro Gutierrez wrote:
> I see. The first thing that comes to mind is the notion of module
> granularity, which of course is subjective, so whether a single module or
> multiple ones should handle e.g. doubles and integrals is a good question;
> are there guidelines as to how those choices are made?

I'm not sure if there are any guidelines per se; that's more of a 
general software engineering problem. If you browse around on Hackage 
you'll get a fairly good idea what the norms are though. Everyone seems 
to have settled on a common range of scope--- with notable exceptions 
like the containers library with far too many functions per module, and 
some of Ed Kmett's work on category theory which tends towards very few 
declarations per module.

> At any rate, why do these modules, with sufficiently-different
> functionality, live in the same library -- is it that they share some
> common bits of implementation, or to ease the management of source code?

I contacted Don Stewart (the former maintainer) to see whether he 
thought I should release the integral stuff on its own, or integrate it 
into bytestring-lexing. We agreed that it made more sense to try to 
build up a core library for lexing various common data types, rather 
than having a bunch of little libraries. He'd just never had time to get 
around to developing bytestring-lexing further; so I took over.

Eventually I plan to add rendering functions for floating point, and to 
split up the parsers for different floating point formats[1], so that it 
more closely resembles the integral stuff. But that won't be until this 
fall or later, unless someone requests it sooner.

[1] Having an omni-parser can be helpful when you want to be liberal 
about your input. But when you're writing parsers for a specified 
format, usually they're not that liberal so we need to offer restricted 
lexers in order to give code reuse.

>> When dealing with FFI code, because of the impedance mismatch between
>> Haskell and imperative languages like C, it's clear that there's going to
>> be some massaging of the API beyond simply declaring FFI calls. As such,
>> clearly we'd like to have separate modules for doing the low-level binding
>> vs presenting a high-level API. Moreover, depending on what you're
>> interfacing with, you may be forced to have multiple low-level modules.
>
> Ah, that's a good use case. Is the lower-level module usually made "public"
> as well, or is it only an implementation detail?

Depends on the project. For ByteStrings, most of that is hidden away as 
implementation details. For binding to C libraries, I think the current 
advice is to offer the low-level interface so that if there's something 
the high-level interface can't handle well, people have some easy recourse.

>> On the other hand, the main purpose of packages or libraries is as unit of
>> distribution, code reuse, and separate compilation. Even with the Haskell
>> culture of making small libraries, most worthwhile units of
>> distribution/reuse/compilation tend to be larger than a single
>> namespace/concern. Thus, it makes sense to have more than one module per
>> package, because otherwise we'd need some higher level mechanism in order
>> to manage the collections of package-modules which should be considered a
>> single unit (i.e., clients will almost always want the whole bunch of them).
>
> This is the part that I'm trying to get a better sense of. I can see how in
> some cases, it makes sense for more than one module to form a unit, because
> they are tightly coupled semantically or implementation-wise -- so clients
> will indeed want the whole bunch. On the other hand, several libraries
> provide modules that are all over the place, in a way that doesn't form a
> "unit" of any kind (e.g. MissingH), and it's not clear that you would want
> any Network stuff when all you need is String utilities.

Yeah, MissingH and similar libraries are just grab-bags full of stuff. 
Usually grab-bag libraries think of themselves as place-holders, with 
the intention of breaking things out once there's something of a large 
enough size to warrant being its own package. (Whether the breaking out 
actually happens is another matter.) But to get the general sense of 
things, you should ignore them.

Instead, consider one of the parsing libraries like uu-parsinglib, 
attoparsec, parsec, frisby. There are lots of pieces to a parsing 
framework, but it makes sense to distribute them together.

Or, consider one of the base libraries for iteratees, enumerators, 
pipes, conduits, etc. Like parsing, these offer a whole framework. You 
won't usually need 100% of it, but everyone needs a different 80%.

Or to mention some more of my own packages, consider stm-chans, 
unification-fd, or unix-bytestrings. In unification-fd, the stuff 
outside of Control.Unification.* could be moved elsewhere, but the stuff 
within there makes sense to be split up yet distributed together. For 
stm-chans because of the similarity in interfaces, use cases, etc, it'd 
be peculiar to want to separate them into different packages. In 
unix-bytestring I separated off the Iovec stuff (FFI implementation 
details) from the main API, but clearly they must go together.

> But the way you describe it, it seems that despite centralization having
> those disadvantages, it is more or less the way the system works, socially
> (egos, bad form, etc.) and technically (because of the lack of compiler
> support)

There's a difference between centralization and communalization.

With centralization there's a central authority who makes all the rules 
and (usaully) enforces them. This is the benevolent dictator model 
common in open-source. The problem is: what do you do if the dictator 
goes missing (gets hit by a bus, is too busy this semester, etc)?

With communalization, there's no central authority that writes/enforces 
the laws; instead, the community as a whole will come to agree on the 
norms. This is the way societies often operate (i.e., societies as 
cultures, rather than as governments). In virtue of the social 
interaction, things come to be a particular way, but there isn't 
necessarily any person or committee that decided it should be that way. 
Moreover, in order to disrupt the norms it's not enough to dispose of a 
dictator; you need some wide-scale way of disrupting the network of 
social interaction. The problem here is that it can be very hard to 
steer a community. If you've identified a problem, it's not clear how to 
get it fixed (whereas a dictator could just issue a fiat).

In practice, every organization has a bit of both models; it's just a 
question of how much of each, and in what contexts. The Haskell 
community is more centralized when it comes to things like the Haskell 
Report and the Haskell Platform, because you really need it there. 
Whereas Hackage and the Cafe are more of your standard social community.

> except that it is ad-hoc instead of mechanically enforced. In
> other words, I don't see what the advantages of allowing ambiguity
> currently are.

If you mechanically enforce things then you will find clashes. That's 
not the problem: clashes exist, you find them, whatever. The problem is: 
now that you've found it, how are you going to resolve it?

You can't just make Hackage refuse packages which would cause a module 
name conflict. If you try then you'll get angry developers who just 
leave or who badmouth Haskell (or both), which does no good for anyone. 
You have to have an escape hatch, some way for people to raise 
legitimate issues such as "the conflictor hasn't been maintained in five 
years and has no users", or "I wrote the old package and this new 
package is meant to supersede it", etc. But now you need to have a group 
of people who work on resolving those issues and making those 
case-by-case decisions about how conflicts should be resolved.

Allowing clashes saves you from needing that group of people. If you 
allow clashes, there are no developer complaints to be resolved. A lot 
of resources are tied up in making those central authority groups, and 
by not having such a central authority we free up those resources to be 
used elsewhere.

In cases like Perl's CPAN and Linux distros, they have enough resources 
that they can afford the overhead cost to create and maintain such 
groups. In addition, they're large enough that the resources for that 
group doesn't necessarily diminish the resources for other things. E.g., 
some members of the Linux developer community are no good at 
programming, but they're great at social organization. If you have a 
central authority group, they can contribute to that and thereby provide 
resources; vs, if there's no such group, they're unlikely to offer 
programming time or other resources instead.

Whereas for small communities: overhead costs are higher proportionally, 
and small communities aren't able to gather as many resources to cover 
them. In addition, the person who could offer social organization is 
probably already offering other resources which she wouldn't be able to 
offer if she moved over to helping the central authority; so you're 
closer to a zero-sum game of needing to decide how to allocate your 
scarce resources.

> Ah, interesting. So, perhaps I misunderstand, but this seems like an
> argument in favor of having uniquely-named modules (e.g. Foo.FD and
> Foo.TF) instead of overlapping ones, right?

Yeah, probably.

I mean, ideally I'd like to see GHC retooled so that both fundeps and 
type families actually compile down to the same code, and one is just 
sugar for the other (or both are sugar for some third thing). Then we'd 
get rid of the real problem of there being multiple incompatible ways of 
doing the same thing. Until then, it's probably better to just pick one 
approach for each project, rather than trying to maintain parallel forks 
for each approach. But if you're going to maintain parallel forks, then 
it's probably best to not do the module punning thing.

-- 
Live well,
~wren