Haskell Platform Proposal: add the 'text' library

Don Stewart dons at galois.com
Tue Sep 7 11:26:36 EDT 2010


= Proposal: Add Data.Text to the Haskell Platform =

Maintainer: Bryan O'Sullivan (submitted with his approval)

== Introduction ==

This is a proposal for the 'text' package to be included in the next
major release of the Haskell platform.

An up to date copy of this text is kept at:

    http://trac.haskell.org/haskell-platform/wiki/Proposals/text

Everyone is invited to review this proposal, following the standard
procedure for proposing and reviewing packages.

    http://trac.haskell.org/haskell-platform/wiki/AddingPackages

Review comments should be sent to the libraries mailing list by
October 1 so that we have time to discuss and resolve issues
before the final deadline on November 1.

    http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable 

== Credits ==

Proposal author and package maintainer: Bryan O'Sullivan, originally by
Tom Harper, based on ByteString and Vector (fusion) packages.

The following individuals contributed to the review process: Don
Stewart, Johan Tibell

== Abstract ==

The 'text' package provides an efficient packed, immutable Unicode text type
(both strict and lazy), with a powerful loop fusion optimization framework.

The 'Text' type represents Unicode character strings, in a time and
space-efficient manner. This package provides text processing
capabilities that are optimized for performance critical use, both
in terms of large data quantities and high speed.

The 'Text' type provides character-encoding, type-safe case
conversion via whole-string case conversion functions. It also
provides a range of functions for converting Text values to and from
'ByteStrings', using several standard encodings (see the 'text-icu'
package for a much larger variety of encoding functions).
 
Efficient locale-sensitive support for text IO is also supported.
 
This module is intended to be imported qualified, to avoid name
clashes with Prelude functions, e.g.
 
    import qualified Data.Text as T

Documentation and tarball from the hackage page:

    http://hackage.haskell.org/package/text

Development repo:

    darcs get http://code.haskell.org/text/

== Rationale ==

While Haskell's Char type is capable of reprenting Unicode code points, the
String sequence of such Chars has some drawbacks that prevent is general
use:

 1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
 2. the representation is space inefficient.
 3. the data structure is element-level lazy, whereas a number of
   applications require either some level of additional strictness

An intermediate solution to these was via 'Data.ByteString' (an
efficient byte sequence type, that addresses points 2 and 3), which,
when used in conjunction with utf8-string, provides very simple
non-latin1 encoding support (though with significant drawbacks in terms
of locale and encoding range).

The 'text' package addresses these shortcomings in a number of way:

 1. support whole-string case conversion (thus, type correct unicode
    transformations) 
 2. a space and time efficient representation, based on unboxed Word16
    arrays
 3. either fully strict, or chunk-level lazy data types (in the style of
    Data.ByteString)
 4. full support for locale-sensitive, encoding-aware IO.

The 'text' library has rapidly become popular for a number of
applications, and is used by more than 50 other Hackage packages. As of
Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
in particular, in web programming. It is used by:

 * the blaze html pretty printing library
 * the hstringtemplate file templating library
 * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
 * the hexpat and libxml xml parsers

The design is based on experience from Data.Vector and Data.ByteString:
 
 * the underlying type is based on unpinned, packed arrays on the Haskell heap
    with an ST interface for memory effects.
 * pipelines of operations are optimized via converstion to and from the
   'stream' abstraction[1]

== The API ==

The API is broken into several logical pieces, which are
self-explanatory:

 * combinators for operating on strict, abstract 'text's.
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.html

 * an equivalent API for chunk-element lazy 'text's.
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy.html

 * encoding transformations, to and from bytestrings:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html

 * support for conversion to Ptr Word16:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Foreign.html

 * locale-aware IO layer:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-IO.html
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy-IO.html

== Design decisions ==

 * IO and pure combinators are in separate modules.
 * Both a fully strict, and partially-strict type are provided.
 * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
 * Unpinned arrays are used, to prevent fragmentation.
 * Large numbers of additional encodings are delegated to the text-icu package.
 * An 'IsString' instance is provided for String literals.
 * The implementation is OS and architecture neutral (portable).
 * The implementation uses a number of language extensions:

    CPP
    MagicHash
    UnboxedTuples
    BangPatterns
    Rank2Types
    RecordWildCards
    ScopedTypeVariables
    ExistentialQuantification
    DeriveDataTypeable

 * The implementation is entirely Haskell (no additional C code or libraries).
 * The package provides a QuickCheck/HUnit testsuite, and coverage data.
 * The package adds no new dependencies to the HP.
 * The package builds with the Simple cabal way.
 * There is no existing functionality for packed unicode text in the HP.
 * The package has complexity annotations.

== Open issues ==

The text-icu package is not part of this propposal.

== Notes ==

The implementation consists of 30 modules, and relies on cabal's package
hiding mechanism to expose only 5 modules. The implementation is around
8000 lines of text total.

The public modules expose none of these (?).

The Python standard library provides both a string and a unicode
sequence type. These are somewhat analogous to the
ByteString/String/Text split.

= References =

[1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts,
     Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.


More information about the Libraries mailing list