Haskell Platform Proposal: add the 'text' library

Tue Sep 7 17:06:11 EDT 2010

I see that the text package provides its own encoding/decoding
functions. This overlaps with the Unicode API offered from the base
package. The API in base is oriented towards encoding/decoding of text
when doing file IO but definitely the conversion utils should be
reused. I implemented myself conversion functions using the internal
API:

http://code.haskell.org/gf/src/compiler/GF/Text/Coding.hs

This is String <-> ByteString conversion but it could work with Text as well.

This is mainly implementation issue but if we add text to Haskell
Platform then it will be harder to change the API later if that is
needed in order to reuse the API from base. For instance in base there
is a notion of TextEncoding which I don't see in text.

Regards,
  Krasimir

2010/9/7 Don Stewart <dons at galois.com>:
>
> = Proposal: Add Data.Text to the Haskell Platform =
>
> Maintainer: Bryan O'Sullivan (submitted with his approval)
>
> == Introduction ==
>
> This is a proposal for the 'text' package to be included in the next
> major release of the Haskell platform.
>
> An up to date copy of this text is kept at:
>
>    http://trac.haskell.org/haskell-platform/wiki/Proposals/text
>
> Everyone is invited to review this proposal, following the standard
> procedure for proposing and reviewing packages.
>
>    http://trac.haskell.org/haskell-platform/wiki/AddingPackages
>
> Review comments should be sent to the libraries mailing list by
> October 1 so that we have time to discuss and resolve issues
> before the final deadline on November 1.
>
>    http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable
>
> == Credits ==
>
> Proposal author and package maintainer: Bryan O'Sullivan, originally by
> Tom Harper, based on ByteString and Vector (fusion) packages.
>
> The following individuals contributed to the review process: Don
> Stewart, Johan Tibell
>
> == Abstract ==
>
> The 'text' package provides an efficient packed, immutable Unicode text type
> (both strict and lazy), with a powerful loop fusion optimization framework.
>
> The 'Text' type represents Unicode character strings, in a time and
> space-efficient manner. This package provides text processing
> capabilities that are optimized for performance critical use, both
> in terms of large data quantities and high speed.
>
> The 'Text' type provides character-encoding, type-safe case
> conversion via whole-string case conversion functions. It also
> provides a range of functions for converting Text values to and from
> 'ByteStrings', using several standard encodings (see the 'text-icu'
> package for a much larger variety of encoding functions).
>
> Efficient locale-sensitive support for text IO is also supported.
>
> This module is intended to be imported qualified, to avoid name
> clashes with Prelude functions, e.g.
>
>    import qualified Data.Text as T
>
> Documentation and tarball from the hackage page:
>
>    http://hackage.haskell.org/package/text
>
> Development repo:
>
>    darcs get http://code.haskell.org/text/
>
> == Rationale ==
>
> While Haskell's Char type is capable of reprenting Unicode code points, the
> String sequence of such Chars has some drawbacks that prevent is general
> use:
>
>  1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
>  2. the representation is space inefficient.
>  3. the data structure is element-level lazy, whereas a number of
>   applications require either some level of additional strictness
>
> An intermediate solution to these was via 'Data.ByteString' (an
> efficient byte sequence type, that addresses points 2 and 3), which,
> when used in conjunction with utf8-string, provides very simple
> non-latin1 encoding support (though with significant drawbacks in terms
> of locale and encoding range).
>
> The 'text' package addresses these shortcomings in a number of way:
>
>  1. support whole-string case conversion (thus, type correct unicode
>    transformations)
>  2. a space and time efficient representation, based on unboxed Word16
>    arrays
>  3. either fully strict, or chunk-level lazy data types (in the style of
>    Data.ByteString)
>  4. full support for locale-sensitive, encoding-aware IO.
>
> The 'text' library has rapidly become popular for a number of
> applications, and is used by more than 50 other Hackage packages. As of
> Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
> in particular, in web programming. It is used by:
>
>  * the blaze html pretty printing library
>  * the hstringtemplate file templating library
>  * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
>  * the hexpat and libxml xml parsers
>
> The design is based on experience from Data.Vector and Data.ByteString:
>
>  * the underlying type is based on unpinned, packed arrays on the Haskell heap
>    with an ST interface for memory effects.
>  * pipelines of operations are optimized via converstion to and from the
>   'stream' abstraction[1]
>
> == The API ==
>
> The API is broken into several logical pieces, which are
> self-explanatory:
>
>  * combinators for operating on strict, abstract 'text's.
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.html
>
>  * an equivalent API for chunk-element lazy 'text's.
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy.html
>
>  * encoding transformations, to and from bytestrings:
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html
>
>  * support for conversion to Ptr Word16:
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Foreign.html
>
>  * locale-aware IO layer:
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-IO.html
>        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy-IO.html
>
> == Design decisions ==
>
>  * IO and pure combinators are in separate modules.
>  * Both a fully strict, and partially-strict type are provided.
>  * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
>  * Unpinned arrays are used, to prevent fragmentation.
>  * Large numbers of additional encodings are delegated to the text-icu package.
>  * An 'IsString' instance is provided for String literals.
>  * The implementation is OS and architecture neutral (portable).
>  * The implementation uses a number of language extensions:
>
>    CPP
>    MagicHash
>    UnboxedTuples
>    BangPatterns
>    Rank2Types
>    RecordWildCards
>    ScopedTypeVariables
>    ExistentialQuantification
>    DeriveDataTypeable
>
>  * The implementation is entirely Haskell (no additional C code or libraries).
>  * The package provides a QuickCheck/HUnit testsuite, and coverage data.
>  * The package adds no new dependencies to the HP.
>  * The package builds with the Simple cabal way.
>  * There is no existing functionality for packed unicode text in the HP.
>  * The package has complexity annotations.
>
> == Open issues ==
>
> The text-icu package is not part of this propposal.
>
> == Notes ==
>
> The implementation consists of 30 modules, and relies on cabal's package
> hiding mechanism to expose only 5 modules. The implementation is around
> 8000 lines of text total.
>
> The public modules expose none of these (?).
>
> The Python standard library provides both a string and a unicode
> sequence type. These are somewhat analogous to the
> ByteString/String/Text split.
>
> = References =
>
> [1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts,
>     Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries
>