[Haskell-cafe] How to split this string.

Wed Jan 4 17:47:57 CET 2012

On 02/01/2012 11:12, Jon Fairbairn wrote:
> max<mk at mtw.ru>  writes:
>
>> I want to write a function whose behavior is as follows:
>>
>> foo "string1\nstring2\r\nstring3\nstring4" = ["string1",
>> "string2\r\nstring3", "string4"]
>>
>> Note the sequence "\r\n", which is ignored. How can I do this?
> cabal install split
>
> then do something like
>
>     import Data.List (groupBy)
>     import Data.List.Split (splitOn)
>
>     rn '\r' '\n' = True
>     rn _ _ = False
>
>     required_function = fmap concat . splitOn ["\n"] . groupBy rn
>
> (though that might be an abuse of groupBy)
>
Sadly, it turns out that not only is this an abuse of groupBy, but it 
has (I think) a subtle bug as a result.

I was inspired by this to try some other groupBy stuff, and it didn't 
work. After scratching my head a bit, I tried the following...

Prelude> import Data.List
Prelude Data.List> groupBy (<) [1,2,3,2,1,2,3,2,1]
[[1,2,3,2],[1,2,3,2],[1]]

That wasn't exactly the result I was expecting :-(

Explanation (best guess) - the function passed to groupBy, according to 
the docs, is meant to test whether two values are 'equal'. I'm guessing 
the assumption is that the function will effectively treat values as 
belonging to equivalence classes. That implies some rules such as...

   (a == a)
   reflexivity : (a == b) => (b == a)
   transitivity : (a == b) && (b == c) => (a == c)

I'm not quite certain I got those names right, and I can't remember the 
name of the first rule at all, sorry.

The third rule is probably to blame here. By the rules, groupBy doesn't 
need to compare adjacent items. When it starts a new group, it seems to 
always use the first item in that new group until it finds a mismatch. 
In my test, that means it's always comparing with 1 - the second 2 is 
included in each group because although (3 < 2) is False, groupBy isn't 
testing that - it's testing (1 < 2).

In the context of this \r\n test function, this behaviour will I guess 
result in \r\n\n being combined into one group. The second \n will 
therefore not be seen as a valid splitting point.

Personally, I think this is a tad disappointing. Given that groupBy 
cannot check or enforce that it's test respects equivalence classes, it 
should ideally give results that make as much sense as possible either 
way. That said, even if the test was always given adjacent elements, 
there's still room for a different order of processing the list 
(left-to-right or right-to-left) to give different results - and in any 
case, maybe it's more efficient the way it is.