Proposal for Data.List.splitBy

Mon Feb 9 16:14:16 EST 2009

Thanks for the clarification.  I would not have guessed wordsBy
and linesBy as your preference over splitBy for additions to
Data.List.

I do not have an opinion about whether or not functional composition
(pipelining) is more elegant than Brent's Splitter data type.  I just
know that I am very comfortable with constructs such as (f . g . h)
or longer.

By the way, out or curiosity, how would you generalize your current
function replace?

Cheers,
- Marcus

Christian Maeder wrote:
> Marcus D. Gabriel wrote:
>   
>> Christian Maeder wrote:
>>     
> [...]
>   
>>> splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>
>>>       
>> to split on a list such as "\r\n", then you can use isPrefixOf
>> whereupon the performance is good enough (actually, its not bad
>> at all).
>>     
>
> The special case for "\r\n" is actually trivial, because "\r" can simply
> be filtered out first.
>
>   
>>> But a general splitting on sublists could be implemented via
>>>
>>>   splitBy isNothing (replace (map Just sl) Nothing (map Just l))
>>>
>>> (For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
>>>       
>> Nice.  It took me a moment, but nice.
>>     
>
> Good, see below my definition of subListSplit and unintercalate.
>
>   
>>> For a simple predicate "(a -> Bool)" it remains to discuss the output of
>>> splitBy.
>>>
>>> You've proposed "[([a], [a])]", but for the simpler case
>>> "[(a, [a])]" or "[([a], a)]" may do, 
>>>       
>> Actually, although enticing, I do not believe that [(a, [a])] is
>> possible due to the corner cases when there is no beginning
>> non-delimiter or ending delimiter, that is, one needs
>> [(Maybe a, [a])].  (Please check me on this.)
>>     
>
> You're right here, there are several ways to accommodate all delimiters
> and non-delimiters. I've done it as below.
>
>   
>>> but in order to capture the full
>>> information of lists before and after delimiters, something like
>>> "([a], [(a, [a])])" is needed.
>>>
>>> Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]",
>>> "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a
>>> _tail_ of non-empty element lists.
>>>       
>> I unfortunately do not follow you here, sorry.  Be that as it may,
>> I have come to appreciate the output [[a]].
>>     
>
> In "([a], [(a, [a])])" the first component takes the (possibly empty)
> part before the first delimiter. The second part of type [(a, [a])] are
> the remaining delimiter with (possibly empty or longer) non-delimiter
> pairs. Such pairs are then viewed as non-empty lists. After changing the
> second component to [[a]] the resulting pair ([a], [[a]]) is then also
> changed to a non-empty list of lists [[a]].
>
>   
>>>   wordsBy p = filter (not . null) . dropDelims . splitBy p
>>>   linesBy p = dropDelims . dropFinalDelim . splitBy p
>>>       
>
> I'ld like to improve linesBy as follows:
>
>   linesBy p = dropFinalNil . dropDelims . splitBy p
>
> (dropFinalNil is simpler than dropFinalDelim and dropDelims can assume a
> non-empty list from splitBy.)
>
>   
>> So, in summary, your idea would be to introduce two functions into
>> Data.List:
>>
>>     
>>> splitBy :: (a -> Bool) -> [a] -> [[a]]
>>> replace :: Eq a => [a] -> a -> [a] -> [a]
>>>       
>> Is this correct?
>>     
>
> Personally, I'ld be content with wordsBy only, but adding linesBy,
> splitBy and the combination of "dropDelims . splitBy p" (under some
> suitable names) would make sense for me with or without replace. (In
> fact, replace should be generalized further.)
>
>   
>>  If so, how would you define
>>
>>     
>>> splitOnAList :: Eq a => [a] -> [a] -> [[a]]
>>>       
>> using splitBy and replace.  For example,
>>
>>     
>>> splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]
>>>       
>
> Again, for "\r\n" this is simply:
>
>   concatMap (: ["\r\n"]) . linesBy (== '\n') . filter (/= '\r')
>
> For the general case take my unintercalate below:
>
>   splitOnAList sl = intercalate [sl] . map (: []) . unintercalate sl
>
> This code only puts back identical delimiters that nobody needs because
> the delimiter is fixed and known via the input argument.
>
> For sublist matching, keeping delimiters is unnecessary in general!
>
>   
>> reasonably handle the task.  Actually, can even Data.List.Split
>> reasonably handle the task?  (I only just recalled this common
>> little problem that is almost trivial but never really so.)
>>     
>
> Data.List.Split can handle all these tasks (and more), too, only less
> elegant (I think).
>
>   
>>> import Data.Char (isSpace)
>>>
>>> splitBy :: (a -> Bool) -> [a] -> [[a]]
>>> splitBy p l = let (fr, rt) = break p l in case rt of
>>>   [] -> [fr]
>>>   d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
>>>
>>> dropFinalDelim :: [[a]] -> [[a]]
>>>       
>
> forget dropFinalDelim one and take:
>
>   dropFinalNil :: [[a]] -> [[a]]
>   dropFinalNil ll@(_ : _) =
>     if null (last ll) then init ll else ll
>
>   
>>> dropDelims :: [[a]] -> [[a]]
>>>       
>
> dropDelims does no longer need to work for empty inputs:
>
>   
>>> dropDelims ll = case ll of
>>>   [] -> []
>>>   l : ls -> l : map tail ls
>>>       
>
>   dropDelims (l : ls) = l : map tail ls
>
> but a total variant makes also sense:
>
>   dropDelims ll = let (ft, rt) = splitAt 1 ll in
>     ft ++ map (drop 1) rt
>
>   
>>> wordsBy :: (a -> Bool) -> [a] -> [[a]]
>>> wordsBy p = filter (not . null) . dropDelims . splitBy p
>>>
>>> linesBy :: (a -> Bool) -> [a] -> [[a]]
>>>       
>
> change linesBy to:
>
>   linesBy p = dropFinalNil . dropDelims . splitBy p
>
>   
>>> replace :: Eq a => [a] -> a -> [a] -> [a]
>>> replace sl@(_ : _) r l = case l of
>>>   [] -> l
>>>   x : xs -> case stripPrefix sl l of
>>>     Nothing -> x : replace sl r xs
>>>     Just rt -> r : replace sl r rt
>>>
>>> subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]]
>>> subListSplit sl@(_ : _) l =
>>>   splitBy isNothing (replace (map Just sl) Nothing (map Just l))
>>>
>>> unintercalate :: Eq a => [a] -> [a] -> [[a]]
>>> unintercalate sl@(_ : _) =
>>>   map (map fromJust) . dropDelims . subListSplit sl
>>>       
>
> unintercalate can also be simplified to:
>
>    unintercalate sl@(_ : _) = map catMaybes . subListSplit sl
>
> Cheers Christian
>
>
>   

-- 
  Marcus D. Gabriel, Ph.D.                         Saint Louis, FRANCE
  http://www.marcus.gabriel.name            mailto:marcus at gabriel.name
  Tel: +33.3.89.69.05.06                   Portable: +33.6.34.56.07.75