[Haskell-cafe] parsing currency amounts with parsec
Eric Rasmussen
ericrasmussen at gmail.com
Tue May 10 06:20:19 CEST 2011
I'll check out Attoparsec, thanks! My first attempt may work for this
particular task, but I'm warming up for a more intense parsing project and
it sounds like Attoparsec with Bytestrings may work best.
Also, just in case anyone reads this thread later and is looking for a quick
Parsec solution, I discovered that the code I posted initially was a bit
greedy in a bad way if the dollar amount was at the end of the line. I got
rid of the original currency, floatOrSep, and beforeSep functions and
replaced them with the code below (still verbose, but hopefully a better
starting point for now).
------------------------------------------------------------------------
double = do i <- integer
return (fromIntegral i :: Double)
currency = try float <|> largeAmount
largeAmount = do first <- double
rest <- many afterSep
let parts = first : rest
let result = combine orderedParts where
combine = sumWithFactor 1
orderedParts = reverse parts
return result
afterSep = do char ','
try float <|> double
------------------------------------------------------------------------
On Mon, May 9, 2011 at 8:15 PM, wren ng thornton <wren at freegeek.org> wrote:
> On 5/9/11 10:04 PM, Antoine Latter wrote:
>
>> On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen<ericrasmussen at gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I am relatively new to Haskell and Parsec, and I couldn't find any
>>> articles
>>> on parsing numbers in the following format:
>>>
>>> Positive: $115.33
>>> Negative: ($1,323.42)
>>>
>>> I'm working on the parser for practical purposes (to convert a 3rd-party
>>> generated, most unhelpful format into one I can use), and I'd really
>>> appreciate any insight into a better way to do this, or if there are any
>>> built-in functions/established libraries that would be better suited to
>>> the
>>> task. My code below works, but doesn't seem terribly efficient.
>>>
>>
>> Why do you think it inefficient? Is it slow?
>>
>> I don't have any substantial suggestions, but from a style perspective:
>>
>> * I would question the use of IEEE binary-floating-point number types
>> for currency. Haskell ships with a fixed-point decimal library, but I
>> don't know how fast it is:
>>
>> http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed.html
>>
>
> There are also a few other options (that I can't seem to find links to at
> the moment), and some hints on how to do it yourself:
>
>
>
> http://augustss.blogspot.com/2007/04/overloading-haskell-numbers-part-3.html
>
>
> I'm not sure if your choice of parsing library is fixed or not, but you
> could probably speed things up significantly by using Attoparsec. In
> particular, Attoparsec's combinators for takeWhile, takeWhile1,scan,...
> return bytestrings, and you can then fold over the bytestring quite nicely.
> If your source is ASCII or anything ASCII compatible (Latin-1, Latin-9,
> UTF-8,...) then take a look at Data.Attoparsec.Char8.decimal[1][2].
>
> Unless you need to verify that commas occur exactly every third digit, I'd
> suggest (a) dropping commas while scanning the string, or (b) implicitly
> dropping commas while folding over the string. If you do need to verify
> this, then your best option is (a), since you can maintain a state machine
> about how many digits seen since the last comma. Once you're using the
> Attoparsec strategy of folding over the raw byte buffers, then the only room
> for improvement is going to be making the code as straight-line as possible.
>
> Some untested example code:
>
> import qualified Data.Attoparsec as A
> import qualified Data.Attoparsec.Char8 as A8
> import qualified Data.ByteString as B
>
> rawCurrency :: A.Parser (ByteString,ByteString)
> rawCurrency = do
> dollars <- A.scan 0 step
> _ <- A.char '.' -- Assuming it's required...
> cents <- A.takeWhile1 isDigit_w8 -- Assuming no commas...
> return (dollars,cents)
> where
> step :: Int -> Word8 -> Maybe Int
> step 3 0x2C = Just 0
> step s c | isDigit_w8 c = Just $! s+1
> step _ _ = Nothing
>
> -- Note: the order of comparisons is part of why it's fast.
> -- | A fast digit predicate.
> isDigit_w8 :: Word8 -> Bool
> isDigit_w8 w = (w <= 0x39 && w >= 0x30)
> {-# INLINE isDigit_w8 #-}
>
> -- With the dots filled in by whatever representation you use.
> currency :: A.Parser ...
> currency = do
> (dollars,cents) <- rawCurrency
> let step a w = a * 10 + fromIntegral (w - 0x30)
> d = B.foldl' step 0 (B.filter (/= 0x2C) dollars)
> c = fromIntegral (B.foldl' step 0 cents)
> / (10 ^ length cents)
> return (... d ... c ...)
>
> amount :: A.Parser ...
> amount = pos <|> neg
> where
> pos = A8.char '$' *> currency
> neg = do
> _ <- A8.string "($"
> a <- currency
> _ <- A8.char ')'
> return (negate a)
>
>
>
>
> [1] And if you're using Attoparsec itself, you may want to take a look at
> Data.Attoparsec.Zepto as well.
>
> [2] If you're basing code on Attoparsec, you may want to look at some of my
> pending patches which improve performance on this (already extremely fast)
> code:
>
> https://bitbucket.org/winterkoninkje/attoparsec/changesets
>
> --
> Live well,
> ~wren
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110509/6791861c/attachment.htm>
More information about the Haskell-Cafe
mailing list