[Haskell-cafe] Re: Seeking advice on a style question

Sun Dec 31 14:11:47 EST 2006

>> In summary, I think that the dependencies on the pagemaster are not
>> adequate, he mixes too many concerns that should be separated.
>
> True, but then that's even more miscellaneous bits and pieces to carry
> around. I guess what makes me uncomfortable is that when I'm writing
> down a function like process1 (not its real name, as you might imagine),
> I want to concentrate on the high-level data flow and the steps of the
> transformation. I don't want to have to exposes all of the little bits
> and pieces that aren't really relevant to the high-level picture.
> Obviously, in the definitions of the functions that make up process1,
> those details become important, but all of that should be internal to
> those function definitions.

Yes, we want to get rid of the bits and pieces. Your actual code is
between two extremes that both manage to get rid of them. One extreme is
the "universal" structure like you already noted:

> Alternatively, I can wrap all of the state up into a single universal
> structure that holds everything I will ever need at every step, but
> doing so seems to me to fly in the face of strong typing; at the early
> stages of processing, the structure will have "holes" in it that don't
> contain useful values and shouldn't be accessed.

Currently, (pagemaster) has tendencies to become such a universal beast.
The other extreme is the one I favor: the whole pipeline is expressible
as a chain of function compositions via (.). One should be able to write

  process = rectangles2pages . questions2rectangles

This means that (rectangles2pages) comes from a (self written) layout
library and that (questions2rectangles) comes from a question formatting
library and both concern are completely separated from each other. If
such a factorization can be achieved, you get clear semantics, bug
reduction and code reuse for free.

Of course, the main problem is: the factorization does not arise by
coding, only by thinking. Often the situation is as following and I for
myself encounter it again and again: one starts with an abstraction
along function composition but it quickly turns out, as you noted, that
"there are some complicated reasons why that doesn't work". To get
working code, one creates some miniature "universal structure" that
incorporates all the missing data that makes the thing work. After some
time, the different concerns get more and more intertwined and soon,
every data depends on everything else until the code finally gets
unmaintainable, it became "monolithic".

What can be done? The original problem was that the solutions to the
originally separated concerns (layout library and questions2rectangles)
simply were not powerful, not general enough. The remedy is to
separately increase the power and expressiveness of both libraries until
the intended result can be achieved by plugging them together.
Admittedly, this is not an easy task. But the outcome is
rewarding: by thinking about the often ill-specified problems, one
understands them much better and it most often turns out that some
implementation details were wrong and so on. In contrast, the ad-hoc
approach that introduces miniature "universal structures" does not make
the libraries more general, but tries to fit them together by appealing
to the special case, the special problem at hand. In my experience, this
only makes things worse.
The point is: you have to implement the functionality anyway, so you may
as well grab some free generalizations and implement it once and for all
in an independent and reusable library.

I think that the following toy example (inspired by a discussion from
this mailing list) shows how to break intertwined data dependencies:

  foo :: Keyvalue -> (Blueprint, Map') -> (Blueprint', Map)
  foo x (bp,m') = (insert x bp, uninsert x bp m')

The type for (foo) is much too general: it says that foo may mix the
(Blueprint) and the (Map') to generate (Blueprint'). But this is not the
case, the type for foo introduces data dependencies that are not present
at all. A better version would be

  foo' :: Keyvalue -> Blueprint -> (Blueprint', Map' -> Map)
  foo' x bp = (insert x bp, \m' -> uninsert x bp m')

Here, it is clear that the resulting (Map) depends on (blueprint) and
(Map'), but that the resulting (Blueprint') does not depend on (map').
The point relevant to your problem is that one can use (foo') in more
compositional ways than (foo) simply because the type allows it. For
instance, you can recover (insert) from (foo'):

  insert :: Keyvalue -> Blueprint -> Blueprint'
  insert x bp = fst $ foo' x bp

but this is impossible with (foo).*

In the original problem, the type signature for (foo') was that best one
could get. But here, the best type signature is of course

  foo'' :: ( Keyvalue -> Blueprint -> Blueprint'
           , Keyvalue -> Blueprint -> Map' -> Map )
  foo'' = (insert, uninsert)

because in essence, (foo) is just the pair (insert, uninsert).

One morale from the above example is that functions returned as result
(as in the signature of (foo')) are your friends when tackling the
problem of making libraries more expressive while keeping them independent.

In summary, I think that your question about style of pipelines roots in
questions far deeper and I think that the "high level only" wish is an
illusion: you simply have to write down every dependency you introduce,
there is no way around this law of nature. But IMHO and compared to
imperative languages, Haskell is the first programming language that
really offers the possibility to specify data dependencies exactly as
they are because Haskell is pure, higher order and has a powerful type
system.

Concerning your code, I wish to thank you for its detailed
explanation. The post already got quite long, so I'm adding only some
remarks. Of course, they are my personal opinion and you don't need to
incorporate or comment on them, because it's your code after all.

>>>> process :: Item -> MediaKind -> MediaSize -> Language -> SFO
> The reason it's just "Item" is that it can be a number of different
> things. It can be a full-blown questionnaire, composed of a number of
> questions, but it could also be just one question (sometimes the users
> want to see what a question layout looks like before okaying its
> inclusion into the questionnaire stream). The functions are overloaded
> to handle the various different kinds of Items.

If there are only the cases of some single question or a full
questionnaire, you could always do

    blowup :: SingleQuestion -> FullQuestionaire
    preview = process (blowup a_question) ...

In general, I think that it's the task of (process) to inspect (Item)
and to plug together the right steps. For instance, a single question
does not need page breaks or similar. I would avoid overloading the
(load*) functions and (paginate) on (Item).

> A pagemaster defines the sizes and locations of the various parts of the
> page (top and bottom margins, left and right sidebars, body region), as
> well as the content of everything except the body region (which is where
> the questions go). [...]
> 
> The pagemaster also contains a couple of other bits of information that
> don't fit neatly anywhere else (discussed below).

As you may guess, I'd throw out these other bits from (pagemaster) and
reserve him for arranging rectangles on a page only. I suspect that he
can be fully absorbed by (paginate) afterwards for (buildLayout) does
not use it (?).

>> Maybe one should write
>>  filter willBeDisplayedQuestion $
>> instead, but I think the name 'stripUndisplayedQuestions' says it all.
> 
> Sure. "stripUndisplayedQuestions" is indeed just a simple filter.

Writing (filter willBeDisplayedQuestion) has the minor advantage that it
is absolutely clear that this step in the pipeline will only filter
stuff. The name (stripUndisplayedQuestions) suggests that, too, but
names are no proofs and the type does not prove it either in this case.

>>>> appendEndQuestions :: Item -> Pagemaster -> [Question] -> [Question]
> End questions are questions that are inserted automagically at the end
> of (almost) every questionnaire. [...] It may seem like it would
> be better stored in the questionnaire itself, but there are some
> complicated reasons why that doesn't work. Obviously, it would be
> possible to rearrange the data after it is retrieved from the database,
> although I'm not sure that there would be a net simplification.

I'd go for a rearrange because my experience is that while taking over
foreign data structures eases import, it most often makes the actual
algorithm extremely cumbersome. The algorithm dictates the data structure.

Btw, the special place "end" suggests that the "question markup
language" does not incorporate all of: "conditional questions",
"question groups", "group templates"? Otherwise, I'd just let the user
insert

   <if media="print">
      <template-instance ref="endquestions.xml" />
   </if>

at the end of every questionnaire. If you use such a tiny macro language
(preferably with sane and simple semantics), you can actually merge
(stripUndisplayedQuestions) and (appendEndQuestions) into a function
(evalMacros) without much fuss. I think that this will even make the
code simpler. Numbering and cross-references could be implemented as
macro expansion, too. Perhaps it is also advisable to do
(validateQuestionContent) before macro expansion. And, best of all, the
macro language is completely independent of the question formatting
task, you can easily outsource this into a library.

>>>> coalesceParentedQuestions :: [Question] -> [Question]
> [...]
> Some questions are composed of multiple sub-questions that are treated
> as separate questions in the database. Because the people who created
> and maintain the database have difficulty fully grasping the concept of
> trees (or hierarchies in general, actually), I have to jump through a
> few hoops here and there to massage the data into something meaningful.
> 
> While it's true that a parent question looks superficially like a tree
> of child questions, there's more to it than that; the visual layout of
> the parent question is not generated by a simple traversal over its
> children, for example. So, for all of the processing that follows, a
> parent question (one with child questions) looks just like any other
> question, and any parent question-specific details remain hidden inside.

Again, I'd say that the algorithm and now more than ever the meaning
dictates the data structure. Assuming that processing children of
different parents is independent and that processing children of the
same parent is *not* independent, I'd group "families" right together in
a data structure. Whether it's a simple traversal (I interpret this as
"independent"?) or not, at some point you have to mess with the whole
group at once anyway, so you can put it together right now.

>>>> validateQuestionContent :: [Question] -> [Question]
>> Uh, I think the type is plain wrong. Doesn't the name suggest 'Question
>> -> Bool' and a fatal error when a question content is invalid?
> 
> No. The idea is to never fail to assemble the questionnaire. If there is
> a question with invalid content, then it is replaced by a dummy question > [...]

Ah, of course you are right, I didn't think of enhanced error
processing. I guess that (validateQuestionContent) is not a filter,
because you have to check "non-local" parent-child relations as well? If
so, then I suggest grouping them beforehand to make it a filter.

>>>>      (numberedQuestions,questionCategories) = numberQuestions pagemaster questions;
> Another piece of miscellaneous information contained within the
> pagemaster is the starting question number.

You can still automatically "number" questions in dependence of a first
number by overloading the (Num) class:

   newtype RelativeInteger = RI { unRI :: Integer -> Integer }
   instance (Num RelativeInteger) where ...

   mkAbsolute :: Integer -> RelativeInteger -> Integer
   mkAbsolute pointOfReference relint = unRI relint pointOfReference

> (Some questionnaires start
> with a question number other than 1 because there is a post-processing
> step where various "front ends" are pasted onto variable "back
> ends"--another example of where a hierarchical approach would have made
> more sense, but couldn't be adopted because the database people couldn't
> cope.)

Uh, that doesn't sound good. I assume that the post-processing is not
implemented in Haskell? Otherwise, you could incorporate this stuff into
(process) and choose suitable interfaces. IMHO, dealing with some
modestly expressive interface which still only offers medium abstraction
(like object orientation) is a pain in a type system as powerful as
Haskell's.

>  bands' = resolveCrossReferences bands questionCategories;
>
> Questions are cross-referenced by question number. For example, question
> 4 might be in the "Sales" category, while question 22 might be "Detailed
> Sales." The last item of question 22 might be "Total; should equal the
> value reported in (4)." In order to make the layouts as reusable as
> possible, rather than hard-coding "(4)" in that last item in (22), there
> is a tag that looks something like this:
> 
>> <text>Total; should equal the value reported in <question-ref category="Sales"/>.</text>

Fine, though I don't see exactly why this isn't done before after the
questions have been transformed to printable things but before there are
distributed across pages. So the references cannot refer to page
numbers, yet must be processed after transforming questions to rectangles?

>>>>      groupedBands = groupBands bands';
>> (can't guess on that)
> 
> In order to implement widow/orphan control, not every band is allowed to
> start a new page ("keep with previous" and "keep with next," in effect).
> Before being handed off to the paginator, the bands are grouped so that
> each group of bands begins with a band that _is_ allowed to start a
> page, followed by the next n bands that aren't allowed to start a page.
> Each grouped band is then treated by the paginator as an indivisible
> entity. (At this point, the grouped bands could be coalesced into single
> bands, but doing so adds a bit of unnecessary overhead to the rendering
> phase.)

Maybe (paginate) can be given a type along the lines of

   paginate :: Rectangle a => [a] -> Pages a

and perhaps you could merge several bands into a single rectangle simply
by saying

   instance Rectangle [Band] where ...

To conclude, I think that (process) can be roughly factorized as follows:

   process = buildPages . questions2rectangles . expandMacros

Now, you get 2/3 of TeX or another desktop publishing system for free,
you only have to replace (questions2rectangles) by (text2rectangles).

Regards,
apfelmus

Footnote:
* Well, it is possible to "recover" insert, but only by introducing a
contradiction into the logic of types with the help of (undefined):
    insert x bp = foo x (bp, (undefined :: map'))
This is clearly unsafe and heavily depends on the implicit knowledge
that the returned (BluePrint') ignores its arguments.