Haskell Platform Proposal: add the 'text' library
dons at galois.com
Tue Sep 7 11:26:36 EDT 2010
= Proposal: Add Data.Text to the Haskell Platform =
Maintainer: Bryan O'Sullivan (submitted with his approval)
== Introduction ==
This is a proposal for the 'text' package to be included in the next
major release of the Haskell platform.
An up to date copy of this text is kept at:
Everyone is invited to review this proposal, following the standard
procedure for proposing and reviewing packages.
Review comments should be sent to the libraries mailing list by
October 1 so that we have time to discuss and resolve issues
before the final deadline on November 1.
== Credits ==
Proposal author and package maintainer: Bryan O'Sullivan, originally by
Tom Harper, based on ByteString and Vector (fusion) packages.
The following individuals contributed to the review process: Don
Stewart, Johan Tibell
== Abstract ==
The 'text' package provides an efficient packed, immutable Unicode text type
(both strict and lazy), with a powerful loop fusion optimization framework.
The 'Text' type represents Unicode character strings, in a time and
space-efficient manner. This package provides text processing
capabilities that are optimized for performance critical use, both
in terms of large data quantities and high speed.
The 'Text' type provides character-encoding, type-safe case
conversion via whole-string case conversion functions. It also
provides a range of functions for converting Text values to and from
'ByteStrings', using several standard encodings (see the 'text-icu'
package for a much larger variety of encoding functions).
Efficient locale-sensitive support for text IO is also supported.
This module is intended to be imported qualified, to avoid name
clashes with Prelude functions, e.g.
import qualified Data.Text as T
Documentation and tarball from the hackage page:
darcs get http://code.haskell.org/text/
== Rationale ==
While Haskell's Char type is capable of reprenting Unicode code points, the
String sequence of such Chars has some drawbacks that prevent is general
1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
2. the representation is space inefficient.
3. the data structure is element-level lazy, whereas a number of
applications require either some level of additional strictness
An intermediate solution to these was via 'Data.ByteString' (an
efficient byte sequence type, that addresses points 2 and 3), which,
when used in conjunction with utf8-string, provides very simple
non-latin1 encoding support (though with significant drawbacks in terms
of locale and encoding range).
The 'text' package addresses these shortcomings in a number of way:
1. support whole-string case conversion (thus, type correct unicode
2. a space and time efficient representation, based on unboxed Word16
3. either fully strict, or chunk-level lazy data types (in the style of
4. full support for locale-sensitive, encoding-aware IO.
The 'text' library has rapidly become popular for a number of
applications, and is used by more than 50 other Hackage packages. As of
Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
in particular, in web programming. It is used by:
* the blaze html pretty printing library
* the hstringtemplate file templating library
* *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
* the hexpat and libxml xml parsers
The design is based on experience from Data.Vector and Data.ByteString:
* the underlying type is based on unpinned, packed arrays on the Haskell heap
with an ST interface for memory effects.
* pipelines of operations are optimized via converstion to and from the
== The API ==
The API is broken into several logical pieces, which are
* combinators for operating on strict, abstract 'text's.
* an equivalent API for chunk-element lazy 'text's.
* encoding transformations, to and from bytestrings:
* support for conversion to Ptr Word16:
* locale-aware IO layer:
== Design decisions ==
* IO and pure combinators are in separate modules.
* Both a fully strict, and partially-strict type are provided.
* The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
* Unpinned arrays are used, to prevent fragmentation.
* Large numbers of additional encodings are delegated to the text-icu package.
* An 'IsString' instance is provided for String literals.
* The implementation is OS and architecture neutral (portable).
* The implementation uses a number of language extensions:
* The implementation is entirely Haskell (no additional C code or libraries).
* The package provides a QuickCheck/HUnit testsuite, and coverage data.
* The package adds no new dependencies to the HP.
* The package builds with the Simple cabal way.
* There is no existing functionality for packed unicode text in the HP.
* The package has complexity annotations.
== Open issues ==
The text-icu package is not part of this propposal.
== Notes ==
The implementation consists of 30 modules, and relies on cabal's package
hiding mechanism to expose only 5 modules. The implementation is around
8000 lines of text total.
The public modules expose none of these (?).
The Python standard library provides both a string and a unicode
sequence type. These are somewhat analogous to the
= References =
: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts,
Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.
More information about the Libraries