[Haskell-cafe] How to split this string.
Steve Horne
sh006d3592 at blueyonder.co.uk
Wed Jan 4 17:47:57 CET 2012
On 02/01/2012 11:12, Jon Fairbairn wrote:
> max<mk at mtw.ru> writes:
>
>> I want to write a function whose behavior is as follows:
>>
>> foo "string1\nstring2\r\nstring3\nstring4" = ["string1",
>> "string2\r\nstring3", "string4"]
>>
>> Note the sequence "\r\n", which is ignored. How can I do this?
> cabal install split
>
> then do something like
>
> import Data.List (groupBy)
> import Data.List.Split (splitOn)
>
> rn '\r' '\n' = True
> rn _ _ = False
>
> required_function = fmap concat . splitOn ["\n"] . groupBy rn
>
> (though that might be an abuse of groupBy)
>
Sadly, it turns out that not only is this an abuse of groupBy, but it
has (I think) a subtle bug as a result.
I was inspired by this to try some other groupBy stuff, and it didn't
work. After scratching my head a bit, I tried the following...
Prelude> import Data.List
Prelude Data.List> groupBy (<) [1,2,3,2,1,2,3,2,1]
[[1,2,3,2],[1,2,3,2],[1]]
That wasn't exactly the result I was expecting :-(
Explanation (best guess) - the function passed to groupBy, according to
the docs, is meant to test whether two values are 'equal'. I'm guessing
the assumption is that the function will effectively treat values as
belonging to equivalence classes. That implies some rules such as...
(a == a)
reflexivity : (a == b) => (b == a)
transitivity : (a == b) && (b == c) => (a == c)
I'm not quite certain I got those names right, and I can't remember the
name of the first rule at all, sorry.
The third rule is probably to blame here. By the rules, groupBy doesn't
need to compare adjacent items. When it starts a new group, it seems to
always use the first item in that new group until it finds a mismatch.
In my test, that means it's always comparing with 1 - the second 2 is
included in each group because although (3 < 2) is False, groupBy isn't
testing that - it's testing (1 < 2).
In the context of this \r\n test function, this behaviour will I guess
result in \r\n\n being combined into one group. The second \n will
therefore not be seen as a valid splitting point.
Personally, I think this is a tad disappointing. Given that groupBy
cannot check or enforce that it's test respects equivalence classes, it
should ideally give results that make as much sense as possible either
way. That said, even if the test was always given adjacent elements,
there's still room for a different order of processing the list
(left-to-right or right-to-left) to give different results - and in any
case, maybe it's more efficient the way it is.
More information about the Haskell-Cafe
mailing list