Proposal for Data.List.splitBy

Sun Feb 8 16:18:42 EST 2009

Christian Maeder wrote:
> Marcus D. Gabriel wrote:
>   
>> If I were to write
>>
>> organizeBy :: ([a] -> Bool) -> [a] -> [([a], [a])]
>>     
>
> I quite like your idea,

Thanks.

> but I think the input predicate "([a] -> Bool)" is too ambitious, although it would nicely unify Brent's
>
> data Delimiter a where
>   DelimEltPred :: (a -> Bool) -> Delimiter a
>   DelimSublist :: Eq a => [a] -> Delimiter a
>
> With the predicate ([a] -> Bool) you have to check all inits of your
> input to detect a delimiter and only if all inits are no delimiter you
> know the head element is not part of a delimiter and repeat checking the
> inits of the tail. I think this is too inefficient in general.
>
>   

The name "organizeBy" was a metaphor.  I was trying to distance my
language from words such as split, delimiter, and separator to
something more general.  As for the use of inits, yes, this is a
performance hit.  If you use something like

> splitOnAList :: Eq a => [a] -> [a] -> <To Be Decided>

to split on a list such as "\r\n", then you can use isPrefixOf
whereupon the performance is good enough (actually, its not bad
at all).

(<http://www.haskell.org/pipermail/libraries/2009-January/011061.html>)

In retrospect, the idea of splitting, delimiters, separators, and
fields is what this is all about anyway.

> In order to keep things simple I would vote for a split function that
> only takes the simple predicate (a -> Bool) and leaves the more
> complicated splitting to _additional_ functions
> (following http://haskell.org/haskellwiki/Simple_to_complex)
>
> For instance
>
>   replace :: [a] -> a -> [a] -> [a]
>
> could replace a sublist (first argument) with a single element (second
> argument)
>
> This would help in simple cases, but leaves it to the user to choose a
> suitable delimiter element.
>
> But a general splitting on sublists could be implemented via
>
>   splitBy isNothing (replace (map Just sl) Nothing (map Just l))
>
> (For a fixed sublist sl, "Nothing" is enough to represent the delimiter)
>   

Nice.  It took me a moment, but nice.

> For a simple predicate "(a -> Bool)" it remains to discuss the output of
> splitBy.
>
> You've proposed "[([a], [a])]", but for the simpler case
> "[(a, [a])]" or "[([a], a)]" may do, 

Actually, although enticing, I do not believe that [(a, [a])] is
possible due to the corner cases when there is no beginning
non-delimiter or ending delimiter, that is, one needs
[(Maybe a, [a])].  (Please check me on this.)

> but in order to capture the full
> information of lists before and after delimiters, something like
> "([a], [(a, [a])])" is needed.
>
> Since a tuple "(a, [a])" can be viewed as non-empty list of type "[a]",
> "([a], [(a, [a])])" collapses to a non-empty list of type "[[a]]" with a
> _tail_ of non-empty element lists.

I unfortunately do not follow you here, sorry.  Be that as it may,
I have come to appreciate the output [[a]].

>  Therefore I propose the following
> splitBy as a "work horse".
>
>   splitBy :: (a -> Bool) -> [a] -> [[a]]
>   

This works for me.

> The implementation is simply using "break". The basic property is:
>
>   concat (splitBy p l) == l
>   

I would consider this a requirement, so this works for me.

> (This is almost in the spirit of Data.List, since groupBy or words also
> produces non-empty element lists.)
>
> Getting rid of a final delimiter or all of them can be implemented in
> O(n) (see below).
>
> And finally:
>
>   wordsBy p = filter (not . null) . dropDelims . splitBy p
>   linesBy p = dropDelims . dropFinalDelim . splitBy p
>
> with:
>
>   words s == wordsBy isSpace s
>   lines s == linesBy (== '\n') s
>
> Surely, one might prefer a more type-safe version wrt non-empty lists,
> but the main idea is to use function composition rather than composing a
> (non-haskell98) splitter input data type as Brent does in his split package.
>   

I am aligned with your point of view above in the sense of
additions to Data.List.

The more that I think about and play with your definition of the
function replace, the more I like it.  It captures an idiom that
I have used on the UNIX command line when hacking away on text, that is,
replace sequences with an unambiguous marker for later use.

So, in summary, your idea would be to introduce two functions into
Data.List:

> splitBy :: (a -> Bool) -> [a] -> [[a]]
> replace :: Eq a => [a] -> a -> [a] -> [a]

Is this correct?  If so, how would you define

> splitOnAList :: Eq a => [a] -> [a] -> [[a]]

using splitBy and replace.  For example,

> splitOnAList "\r\n" "abc\r\nxyz\r\n" == ["abc","\r\n","xyz","\r\n"]

Given this, maybe we should look at all of the many ways that
software makes CSV (TSV) files and see if splitBy and replace can
reasonably handle the task.  Actually, can even Data.List.Split
reasonably handle the task?  (I only just recalled this common
little problem that is almost trivial but never really so.)

Cheers,
- Marcus

> Cheers Christian
>
> import Data.Char (isSpace)
>
> splitBy :: (a -> Bool) -> [a] -> [[a]]
> splitBy p l = let (fr, rt) = break p l in case rt of
>   [] -> [fr]
>   d : tl -> let hd : tll = splitBy p tl in fr : (d : hd) : tll
>
> dropFinalDelim :: [[a]] -> [[a]]
> dropFinalDelim ll@(l : ls) =
>   if null ls then
>       if null l then [] else ll
>   else
>       if null (tail (last ls)) then init ll else ll
>
> dropDelims :: [[a]] -> [[a]]
> dropDelims ll = case ll of
>   [] -> []
>   l : ls -> l : map tail ls
>
> wordsBy :: (a -> Bool) -> [a] -> [[a]]
> wordsBy p = filter (not . null) . dropDelims . splitBy p
>
> linesBy :: (a -> Bool) -> [a] -> [[a]]
> linesBy p = dropDelims . dropFinalDelim . splitBy p
>
> prop_wordsBy :: String -> Bool
> prop_wordsBy s = words s == wordsBy isSpace s
>
> prop_linesBy :: String -> Bool
> prop_linesBy s = lines s == linesBy (== '\n') s
>
> replace :: Eq a => [a] -> a -> [a] -> [a]
> replace sl@(_ : _) r l = case l of
>   [] -> l
>   x : xs -> case stripPrefix sl l of
>     Nothing -> x : replace sl r xs
>     Just rt -> r : replace sl r rt
>
> subListSplit :: Eq a => [a] -> [a] -> [[Maybe a]]
> subListSplit sl@(_ : _) l =
>   splitBy isNothing (replace (map Just sl) Nothing (map Just l))
>
> unintercalate :: Eq a => [a] -> [a] -> [[a]]
> unintercalate sl@(_ : _) =
>   map (map fromJust) . dropDelims . subListSplit sl
>   
-- 
  Marcus D. Gabriel, Ph.D.                         Saint Louis, FRANCE
  http://www.marcus.gabriel.name            mailto:marcus at gabriel.name