[Haskell-cafe] Grapheme length?

Viktor Dukhovni ietf-dane at dukhovni.org
Sat Feb 20 07:56:16 UTC 2021



> On Feb 20, 2021, at 3:59 AM, amindfv--- via Haskell-Cafe <haskell-cafe at haskell.org> wrote:
> 
>>> With the "Data.Text.ICU.Char" module, it may be possible to determine
>>> grapheme boundaries:
>>> 
>>>    https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Char.html#g:5
>> 
>> I'll look into this and report back.
>> 
> 
> I'm quite prepared to believe this is wrong/misguided, but I was able to hack something together that works for my uses so far:
> 
>    import Data.Text.ICU.Char
>    len = length . filter (==Nothing) . map (property GraphemeClusterBreak) . T.unpack
> 
> Example:
> 
> len ("🤣h👩🏻elloä❤️❤️👩❤️👩" :: Text)
> == 13

There's unfortunately at least one problem, which requires attention
from a text-icu maintainer, but AFAIK, there isn't one just at the
moment (see the libraries list archive).

The issue is that recent "icu" versions return GraphemClusterBreak values
that outside the range known to the "Char" module:

  https://github.com/haskell/text-icu/blob/36c2cf236da06cb3b08fa8e5c3981d784d4b9af2/Data/Text/ICU/Char.hsc#L853-L865

but it blithely calls "toEnum" on whatever the FFI call returns, and triggers an error:

  [Nothing,*** Exception: toEnum{GraphemeClusterBreak}: tag (16) is outside of enumeration's range (0,10)
  CallStack (from HasCallStack):
    error, called at Data/Text/ICU/Char.hsc:865:19 in text-icu-0.7.0.1-08bd532cd2c809ab3173b6766231a799217ecc9a166de7458474e8784471d168:Data.Text.ICU.Char

But in fact, exactly some of the new code points are relevant for detection
of grapheme cluster boundaries (your algorithm looks too naïve) see:

http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

When citing the Unicode definition of grapheme clusters, it must be clear which
of the two alternatives are being specified: extended versus legacy.

Break at the start and end of text, unless the text is empty.
GB1	sot	÷	Any
GB2	Any	÷	eot

Do not break between a CR and LF. Otherwise, break before and after controls.
GB3	CR	×	LF
GB4	(Control | CR | LF)	÷	 
GB5				÷	(Control | CR | LF)

Do not break Hangul syllable sequences.
GB6	L	×	(L | V | LV | LVT)
GB7	(LV | V)	×	(V | T)
GB8	(LVT | T)	×	T

Do not break before extending characters or ZWJ.
GB9	 	×	(Extend | ZWJ)

The GB9a and GB9b rules only apply to extended grapheme clusters: 
Do not break before SpacingMarks, or after Prepend characters.
GB9a	 	×	SpacingMark
GB9b	Prepend	×
	 
Do not break within emoji modifier sequences or emoji zwj sequences.
GB11	\p{Extended_Pictographic} Extend* ZWJ	×	\p{Extended_Pictographic}

Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.
GB12	sot (RI RI)* RI	×	RI
GB13	[^RI] (RI RI)* RI	×	RI

Otherwise, break everywhere.
GB999	Any	÷	Any
Notes:

	• Grapheme cluster boundaries can be transformed into simple regular expressions. For more information, see Section 6.3, State Machines.
	• The Grapheme_Base and Grapheme_Extend properties predated the development of the Grapheme_Cluster_Break property. The set of characters with Grapheme_Extend=Yes is used to derive the set of characters with Grapheme_Cluster_Break=Extend. However, the Grapheme_Base property proved to be insufficient for determining grapheme cluster boundaries. Grapheme_Base is no longer used by this specification.

-- 
	Viktor.



More information about the Haskell-Cafe mailing list