Proposal for Data.List.splitBy

Mon Feb 9 11:43:30 EST 2009

Marcus D. Gabriel wrote:
> Christian Maeder wrote:
[...]
>> splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>
> 
> to split on a list such as "\r\n", then you can use isPrefixOf
> whereupon the performance is good enough (actually, its not bad
> at all).

The special case for "\r\n" is actually trivial, because "\r" can simply
be filtered out first.

>> But a general splitting on sublists could be implemented via
>>
>>   splitBy isNothing (replace (map Just sl) Nothing (map Just l))
>>
>> (For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
> 
> Nice.  It took me a moment, but nice.

Good, see below my definition of subListSplit and unintercalate.

>> For a simple predicate "(a -> Bool)" it remains to discuss the output of
>> splitBy.
>>
>> You've proposed "[([a], [a])]", but for the simpler case
>> "[(a, [a])]" or "[([a], a)]" may do, 
> 
> Actually, although enticing, I do not believe that [(a, [a])] is
> possible due to the corner cases when there is no beginning
> non-delimiter or ending delimiter, that is, one needs
> [(Maybe a, [a])].  (Please check me on this.)

You're right here, there are several ways to accommodate all delimiters
and non-delimiters. I've done it as below.

>> but in order to capture the full
>> information of lists before and after delimiters, something like
>> "([a], [(a, [a])])" is needed.
>>
>> Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]",
>> "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a
>> _tail_ of non-empty element lists.
> 
> I unfortunately do not follow you here, sorry.  Be that as it may,
> I have come to appreciate the output [[a]].

In "([a], [(a, [a])])" the first component takes the (possibly empty)
part before the first delimiter. The second part of type [(a, [a])] are
the remaining delimiter with (possibly empty or longer) non-delimiter
pairs. Such pairs are then viewed as non-empty lists. After changing the
second component to [[a]] the resulting pair ([a], [[a]]) is then also
changed to a non-empty list of lists [[a]].

>>   wordsBy p = filter (not . null) . dropDelims . splitBy p
>>   linesBy p = dropDelims . dropFinalDelim . splitBy p

I'ld like to improve linesBy as follows:

  linesBy p = dropFinalNil . dropDelims . splitBy p

(dropFinalNil is simpler than dropFinalDelim and dropDelims can assume a
non-empty list from splitBy.)

> So, in summary, your idea would be to introduce two functions into
> Data.List:
> 
>> splitBy :: (a -> Bool) -> [a] -> [[a]]
>> replace :: Eq a => [a] -> a -> [a] -> [a]
> 
> Is this correct?

Personally, I'ld be content with wordsBy only, but adding linesBy,
splitBy and the combination of "dropDelims . splitBy p" (under some
suitable names) would make sense for me with or without replace. (In
fact, replace should be generalized further.)

>  If so, how would you define
> 
>> splitOnAList :: Eq a => [a] -> [a] -> [[a]]
> 
> using splitBy and replace.  For example,
> 
>> splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]

Again, for "\r\n" this is simply:

  concatMap (: ["\r\n"]) . linesBy (== '\n') . filter (/= '\r')

For the general case take my unintercalate below:

  splitOnAList sl = intercalate [sl] . map (: []) . unintercalate sl

This code only puts back identical delimiters that nobody needs because
the delimiter is fixed and known via the input argument.

For sublist matching, keeping delimiters is unnecessary in general!

> reasonably handle the task.  Actually, can even Data.List.Split
> reasonably handle the task?  (I only just recalled this common
> little problem that is almost trivial but never really so.)

Data.List.Split can handle all these tasks (and more), too, only less
elegant (I think).

>> import Data.Char (isSpace)
>>
>> splitBy :: (a -> Bool) -> [a] -> [[a]]
>> splitBy p l = let (fr, rt) = break p l in case rt of
>>   [] -> [fr]
>>   d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
>>
>> dropFinalDelim :: [[a]] -> [[a]]

forget dropFinalDelim one and take:

  dropFinalNil :: [[a]] -> [[a]]
  dropFinalNil ll@(_ : _) =
    if null (last ll) then init ll else ll

>> dropDelims :: [[a]] -> [[a]]

dropDelims does no longer need to work for empty inputs:

>> dropDelims ll = case ll of
>>   [] -> []
>>   l : ls -> l : map tail ls

  dropDelims (l : ls) = l : map tail ls

but a total variant makes also sense:

  dropDelims ll = let (ft, rt) = splitAt 1 ll in
    ft ++ map (drop 1) rt

>> wordsBy :: (a -> Bool) -> [a] -> [[a]]
>> wordsBy p = filter (not . null) . dropDelims . splitBy p
>>
>> linesBy :: (a -> Bool) -> [a] -> [[a]]

change linesBy to:

  linesBy p = dropFinalNil . dropDelims . splitBy p

>> replace :: Eq a => [a] -> a -> [a] -> [a]
>> replace sl@(_ : _) r l = case l of
>>   [] -> l
>>   x : xs -> case stripPrefix sl l of
>>     Nothing -> x : replace sl r xs
>>     Just rt -> r : replace sl r rt
>>
>> subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]]
>> subListSplit sl@(_ : _) l =
>>   splitBy isNothing (replace (map Just sl) Nothing (map Just l))
>>
>> unintercalate :: Eq a => [a] -> [a] -> [[a]]
>> unintercalate sl@(_ : _) =
>>   map (map fromJust) . dropDelims . subListSplit sl

unintercalate can also be simplified to:

   unintercalate sl@(_ : _) = map catMaybes . subListSplit sl

Cheers Christian