utf8 strings: memory optimization and case-ignoring comparision

Thu Dec 15 04:44:35 EST 2005

On 14 December 2005 20:35, Bulat Ziganshin wrote:

> i use utf8-packed strings in my program and have to ask 2 questions
> about them:
> 
> 1. i need function to do case-ignoring comparision of such strings.
> stricmp is not appropriate because it don't know about utf8. can be
> the existing Unicode support in Data.Char used for these or can the
> appropriate support will be added?

you should be able to use toUpper/toLower from Data.Char in GHC 6.4.1.

> 2. what is the most memory-efficient representaion for such strings?
> now i use John Meacham's library
> (http://repetae.net/john/repos/jhc/PackedString.hs) which declares:
> 
> newtype PackedString = PS (UArray Int Word8)
> 
> but this uses two Ints just to hold index bounds:
> 
> data UArray i e = UArray !i !i ByteArray#

I don't know why an extra 8/16 bytes per string is that worrying - if
you have so many small strings perhaps you should be sharing them via a
hash table?

> i want to use just memory ptr and put NUL at the end of array (my
> strings never contain NUL chars). but what type i must use for this
> ptr? ByteArray/ByteArray#, ForeignPtr, StablePtr, Ptr?? and which
> function 
> i must use to quickly allocate memory i need? my packed strings will
> be only unpacked and passed to "unsafe" C functions: stricmp, strcpy,
> strcat; i plan to not use any other operations

ForeignPtr and mallocForeignPtr are the way to go these days.  In GHC
6.6 these will be much faster than before.

Cheers,
	Simon