Summary and call for discussion on text proposal

Sun Nov 7 09:36:35 EST 2010

Thomas and I would like to summarise the current point of contention
in the text library proposal with the aim of resolving the issue and
getting the package accepted.

It seems clear that we all want the package accepted, the disagreement
is over details of the API. The problem here is not the amount of work
to make the changes some people have been suggesting, the problem is
disagreement over whether change is necessary and if so what change.

There is essentially just one point of contention, over about 10 out
of the 80+ functions in the Data.Text module. The issue is about which
functions should get the nice names and about consistency between
modules. (There is one other minor issue that Ross raised but we will
deal with the most substantive issue first)

There are two axes in which Text functions are generalised:
  * character predicate  (e.g. searching for first char matching a predicate)
  * substring            (e.g. searching for a substring)

These are orthogonal directions of generalisation. There is no simple
way to encompass both (regular expressions are not simple, naive
generalisations cannot be implemented efficiently).

The fact that there are these two forms of most functions is different
from the List library which only has the element predicate direction,
not the sub-sequence direction. This is the prelude to the problem,
because the List library has already taken the common names for the
character predicate versions.

The design of the Text library encourages the use of substring
operations because these are expected to be more commonly used and
because correct handling of Unicode often requires substring
operations (due to issues with combining characters).

There are a number of options. To illustrate them let us pick an
example function that breaks a text into two. There are two versions:
  * break based on a character predicate
  * break based on a substring

Option 1 (current Text lib design)
----------------------------------

break   :: Text           -> Text -> (Text, Text)
breakBy :: (Char -> Bool) -> Text -> (Text, Text)

This gives the short name 'break' to the substring version, and the
longer name 'breakBy' to the character predicate version.

The argument for doing this is that the substring version should be
the common encouraged one and so it should get the nice name.

The argument against is that this is inconsistent with the List
library which gives the name 'break' to the element predicate version:

break :: (a -> Bool) -> [a] -> ([a], [a])

Option 2
--------

breakSubstring :: Text           -> Text -> (Text, Text)
break          :: (Char -> Bool) -> Text -> (Text, Text)

This gives the short name 'break' to the character predicate version
and the longer 'breakSubstring' to the substring version.

The argument for doing this is that it is consistent with the List
library in its use of the name 'break'.

The argument against is that the short name is now given to the
version that is discouraged, and the version that is encouraged now
has a very long and ugly name: this API is encouraging users to make
the wrong choices.

Decisions
---------

There appears to be no consensus over which of these two options to
pick. If this situation persists then the default position is for the
package not to be accepted at all. We think there is consensus that
the package should go into the platform in some form -- that the worst
of all the options is for the package not to go in a all.

We are now at the third stage of the consensus protocol. At this stage
discussion should be limited to resolving one concern at a time.
Anyone may contribute to this discussion. The people required to take
part are the proposal author (Don) and anyone who has concerns with
the package going in as is.

People with concerns should restate those concerns and if necessary
questions should be asked to clarify the concerns. In particular, if
the summary above is not an accurate expression of peoples concerns
then they should say so.

Don will update the wiki proposal page with the details of the
remaining concerns (or simply the summary above if this is accurate).

The steering committee (in this case Thomas and I) will follow the
discussion. If we are still stuck in one week (14th Nov) then the
steering committee will re-evaluate the situation.

To kick off the discussion focused on this narrow issue, Thomas and I
would like to suggest a 3rd alternative option:

Option 3
--------

breakStr :: Text           -> Text -> (Text, Text)
breakChr :: (Char -> Bool) -> Text -> (Text, Text)

This give neither version the short name 'break', but gives both
reasonably short names with a suffix to indicate the character
predicate vs substring.

This addresses the complaint that a name from the List library is
being used but with an inconsistent type (because the name is not
being used at all).

It removes the problem that the character predicate versions are being
promoted over the substring versions by the use of the shorter names.

It makes explicit the fact that all the functions come in two forms,
whereas with List there is just one form.

There is still a strong connection with the List library by using the
same root names. So people can still carry over their experience of
the List library API to help them find the right Text functions. They
will have to make one choice between the character predicate and
substring versions, which is reasonable given that the substring
versions are preferred.

Duncan & Thomas
(with their platform steering committee hats on)