[Haskell-cafe] How to split this string.

Mon Jan 2 11:36:26 CET 2012

On 02/01/2012 09:44, max wrote:
> I want to write a function whose behavior is as follows:
>
> foo "string1\nstring2\r\nstring3\nstring4" = ["string1",
> "string2\r\nstring3", "string4"]
>
> Note the sequence "\r\n", which is ignored. How can I do this?
Doing it probably the hard way (and getting it wrong) looks like the 
following...

--  Function to accept (normally) a single character. Special-cases
--  \r\n. Refuses to accept \n. Result is either an empty list, or
--  an (accepted, remaining) pair.
parseTok :: String -> [(String, String)]

parseTok "" = []
parseTok (c1:c2:cs) | ((c1 == '\r') && (c2 == '\n')) = [(c1:c2:[], cs)]
parseTok (c:cs)     | (c /= '\n')                    = [(c:[], cs)]
                     | True                           = []

--  Accept a sequence of those (mostly single) characters
parseItem :: String -> [(String, String)]

parseItem "" = [("","")]
parseItem cs = [(j1s ++ j2s, k2s)
                  | (j1s,k1s) <- parseTok  cs
                  , (j2s,k2s) <- parseItem k1s
                ]

--  Accept a whole list of strings
parseAll :: String -> [([String], String)]

parseAll [] = [([],"")]
parseAll cs = [(j1s:j2s,k2s)
                 | (j1s,k1s) <- parseItem cs
                 , (j2s,k2s) <- parseAll  k1s
               ]

--  Get the first valid result, which should have consumed the
--  whole string but this isn't checked. No check for existence either.
parse :: String -> [String]
parse cs = fst (head (parseAll cs))

I got it wrong in that this never consumes the \n between items, so 
it'll all go horribly wrong. There's a good chance there's a typo or two 
as well. The basic idea should be clear, though - maybe I should fix it 
but I've got some other things to do at the moment. Think of the \n as a 
separator, or as a prefix to every "item" but the first. Alternatively, 
treat it as a prefix to *every* item, and artificially add an initial 
one to the string in the top-level parse function. The use tail etc to 
remove that from the first item.

See http://channel9.msdn.com/Tags/haskell - there's a series of 13 
videos by Dr. Erik Meijer. The eighth in the series covers this basic 
technique - it calls them monadic and uses the do notation and that 
confused me slightly at first, it's the *list* type which is monadic in 
this case and (as you can see) I prefer to use list comprehensions 
rather than do notation.

There may be a simpler way, though - there's still a fair bit of Haskell 
and its ecosystem I need to figure out. There's a tool called alex, for 
instance, but I've not used it.