[Haskell-cafe] parsing currency amounts with parsec

Tue May 10 06:20:19 CEST 2011

I'll check out Attoparsec, thanks! My first attempt may work for this
particular task, but I'm warming up for a more intense parsing project and
it sounds like Attoparsec with Bytestrings may work best.

Also, just in case anyone reads this thread later and is looking for a quick
Parsec solution, I discovered that the code I posted initially was a bit
greedy in a bad way if the dollar amount was at the end of the line. I got
rid of the original currency, floatOrSep, and beforeSep functions and
replaced them with the code below (still verbose, but hopefully a better
starting point for now).

------------------------------------------------------------------------

double = do i <- integer
            return (fromIntegral i :: Double)

currency = try float <|> largeAmount

largeAmount = do first <- double
                 rest  <- many afterSep
                 let parts = first : rest
                 let result = combine orderedParts where
                     combine = sumWithFactor 1
                     orderedParts = reverse parts
                 return result

afterSep = do char ','
              try float <|> double

------------------------------------------------------------------------

On Mon, May 9, 2011 at 8:15 PM, wren ng thornton <wren at freegeek.org> wrote:

> On 5/9/11 10:04 PM, Antoine Latter wrote:
>
>> On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen<ericrasmussen at gmail.com>
>>  wrote:
>>
>>> Hi everyone,
>>>
>>> I am relatively new to Haskell and Parsec, and I couldn't find any
>>> articles
>>> on parsing numbers in the following format:
>>>
>>> Positive: $115.33
>>> Negative: ($1,323.42)
>>>
>>> I'm working on the parser for practical purposes (to convert a 3rd-party
>>> generated, most unhelpful format into one I can use), and I'd really
>>> appreciate any insight into a better way to do this, or if there are any
>>> built-in functions/established libraries that would be better suited to
>>> the
>>> task. My code below works, but doesn't seem terribly efficient.
>>>
>>
>> Why do you think it inefficient? Is it slow?
>>
>> I don't have any substantial suggestions, but from a style perspective:
>>
>> * I would question the use of IEEE binary-floating-point number types
>> for currency. Haskell ships with a fixed-point decimal library, but I
>> don't know how fast it is:
>>
>> http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed.html
>>
>
> There are also a few other options (that I can't seem to find links to at
> the moment), and some hints on how to do it yourself:
>
>
>
> http://augustss.blogspot.com/2007/04/overloading-haskell-numbers-part-3.html
>
>
> I'm not sure if your choice of parsing library is fixed or not, but you
> could probably speed things up significantly by using Attoparsec. In
> particular, Attoparsec's combinators for takeWhile, takeWhile1,scan,...
> return bytestrings, and you can then fold over the bytestring quite nicely.
> If your source is ASCII or anything ASCII compatible (Latin-1, Latin-9,
> UTF-8,...) then take a look at Data.Attoparsec.Char8.decimal[1][2].
>
> Unless you need to verify that commas occur exactly every third digit, I'd
> suggest (a) dropping commas while scanning the string, or (b) implicitly
> dropping commas while folding over the string. If you do need to verify
> this, then your best option is (a), since you can maintain a state machine
> about how many digits seen since the last comma. Once you're using the
> Attoparsec strategy of folding over the raw byte buffers, then the only room
> for improvement is going to be making the code as straight-line as possible.
>
> Some untested example code:
>
>    import qualified Data.Attoparsec       as A
>    import qualified Data.Attoparsec.Char8 as A8
>    import qualified Data.ByteString       as B
>
>    rawCurrency :: A.Parser (ByteString,ByteString)
>    rawCurrency = do
>        dollars <- A.scan 0 step
>        _       <- A.char '.'               -- Assuming it's required...
>        cents   <- A.takeWhile1 isDigit_w8  -- Assuming no commas...
>        return (dollars,cents)
>        where
>        step :: Int -> Word8 -> Maybe Int
>        step 3 0x2C             = Just 0
>        step s c | isDigit_w8 c = Just $! s+1
>        step _ _                = Nothing
>
>    -- Note: the order of comparisons is part of why it's fast.
>    -- | A fast digit predicate.
>    isDigit_w8 :: Word8 -> Bool
>    isDigit_w8 w = (w <= 0x39 && w >= 0x30)
>    {-# INLINE isDigit_w8 #-}
>
>    -- With the dots filled in by whatever representation you use.
>    currency :: A.Parser ...
>    currency = do
>        (dollars,cents) <- rawCurrency
>        let step a w = a * 10 + fromIntegral (w - 0x30)
>            d = B.foldl' step 0 (B.filter (/= 0x2C) dollars)
>            c = fromIntegral (B.foldl' step 0 cents)
>                / (10 ^ length cents)
>        return (... d ... c ...)
>
>    amount :: A.Parser ...
>    amount = pos <|> neg
>        where
>        pos = A8.char '$' *> currency
>        neg = do
>            _ <- A8.string "($"
>            a <- currency
>            _ <- A8.char ')'
>            return (negate a)
>
>
>
>
> [1] And if you're using Attoparsec itself, you may want to take a look at
> Data.Attoparsec.Zepto as well.
>
> [2] If you're basing code on Attoparsec, you may want to look at some of my
> pending patches which improve performance on this (already extremely fast)
> code:
>
>    https://bitbucket.org/winterkoninkje/attoparsec/changesets
>
> --
> Live well,
> ~wren
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110509/6791861c/attachment.htm>