utf8 strings: memory optimization and case-ignoring comparision

Simon Marlow simonmar at microsoft.com
Thu Dec 15 04:44:35 EST 2005


On 14 December 2005 20:35, Bulat Ziganshin wrote:

> i use utf8-packed strings in my program and have to ask 2 questions
> about them:
> 
> 1. i need function to do case-ignoring comparision of such strings.
> stricmp is not appropriate because it don't know about utf8. can be
> the existing Unicode support in Data.Char used for these or can the
> appropriate support will be added?

you should be able to use toUpper/toLower from Data.Char in GHC 6.4.1.

> 2. what is the most memory-efficient representaion for such strings?
> now i use John Meacham's library
> (http://repetae.net/john/repos/jhc/PackedString.hs) which declares:
> 
> newtype PackedString = PS (UArray Int Word8)
> 
> but this uses two Ints just to hold index bounds:
> 
> data UArray i e = UArray !i !i ByteArray#

I don't know why an extra 8/16 bytes per string is that worrying - if
you have so many small strings perhaps you should be sharing them via a
hash table?

> i want to use just memory ptr and put NUL at the end of array (my
> strings never contain NUL chars). but what type i must use for this
> ptr? ByteArray/ByteArray#, ForeignPtr, StablePtr, Ptr?? and which
> function 
> i must use to quickly allocate memory i need? my packed strings will
> be only unpacked and passed to "unsafe" C functions: stricmp, strcpy,
> strcat; i plan to not use any other operations

ForeignPtr and mallocForeignPtr are the way to go these days.  In GHC
6.6 these will be much faster than before.

Cheers,
	Simon


More information about the Glasgow-haskell-users mailing list