Haskell Platform Proposal: add the 'text' library
Duncan Coutts
duncan.coutts at googlemail.com
Wed Sep 8 15:28:29 EDT 2010
On 8 September 2010 19:43, Edward Kmett <ekmett at gmail.com> wrote:
>> Did you consider keeping the number of characters in the Text directly?
>> Is there a reason it couldn't be done?
>
>> There's little point. Knowing the length does not usually help you
>> save any other O(n) operations. It'd also only help for strict text,
>> not lazy. Just like lists, asking for the length is usually not a good
>> idea.
>
> Actually it _can_ help quite a bit to know the number of characters (not
> just words) present in the input:
> It can tell you of the absence of surrogate pairs if the length (in
> characters) is the same as the number of words. This then lets you do
> indexing as an O(1) operation. I used this in a simple fingertree of utf-8
> encoded bytestrings to enable faster indexing into leaves that didn't have
> utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII
> content, this turned out to be a pretty big win for me. Since the vast
> majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you
> can get similar wins. The cost is that you either have to have a sentinel
> 'don't know' value or pay to scan any arrays that you directly convert to
> Text.
Text does not need to fit every use case. Our impression is that most
applications do not need string indexing and for the few that do, a
specialised structure is perfectly reasonable (eg using arrays of code
points, or trees of arrays of code points).
Text should be a replacement for String: a type used for general
string manipulation a common type used in interfaces for passing
strings between components. In the general rage of use cases it should
offer reasonable memory use and performance. It does not need to be
the perfect choice for the buffer of a text editor or other special
internal use cases.
It's an important point in the API design of Text that indexing is
discouraged because, given a compact variable-length internal
representation, you cannot make any useful promises about the
performance of indexing. As you point out you can sometimes do better
than O(n) but generally you cannot and it's not good to encourage a
style that often performs well but occasionally is terrible.
> As an aside:
>
> I want to like and use Data.Text, but I haven't been able to. Too many of
> the operations devolve to O(n) that could be O(1) if I had some way to slice
> based on cursors into the data type, and the current API locks me out of the
> internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the
"cursor" and you don't need any other index.
> With ByteString at
> least, if I must, I can go grab the Data.Bytestring.Internal module and have
> at it. I'm all for a clean API, but a clean API that hurts asymptotics
> invites a lot of re-invention on the fringes of the community.
>
> But really the last nail in the coffin for me, is that often I want to be
> able to just mmap data in from disk and use it, and rarely is my data on
> disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by
being exposed in the API. In future if benchmarks demonstrate that
some different encoding offers a better balance of performance and
memory use then we need to be able to switch without breaking
everyone's programs.
Note that even if the internal encoding happened to match your
external encoding, the whole lot would still need to be forced into
memory to validate the encoding. So you would be able to share memory
with the page cache but not avoid any IO.
Duncan
More information about the Libraries
mailing list