[Haskell-beginners] Defining custom parser using Parsec

Mon Oct 18 02:56:33 EDT 2010

On Sun, Oct 17, 2010 at 22:59, Jimmy Wylie <jwylie at uno.edu> wrote:
>  Hi everyone,
>
> I'm working on a digital forensics application that will take a file with
> lines of the following format:
>
> "MD5|name|inode|mode_as_string|UID|GID|size|atime|mtime|ctime|crtime"
>
> This string represents the metadata associated with a particular file in the
> filesystem.
>
> I created a data type to represent the information that I will need to
> perform my analysis:
>
> data Event = Event {
>     fn          :: B.ByteString,
>     mftNum :: B.ByteString,
>     ft           :: B.ByteString,
>     fs           :: Integer,
>     time       :: Integer,
>     at           :: AccessType
>     mt          :: AccessType
>     ct           ::  AccessType
>     crt          :: AccessType
>     } deriving (Show)
>
> data AccessType = ATime | MTime | CTime | CrTime
>                  deriving (Show)
>
> I would like to create a function that takes the Bytestring representing the
> file and returns a list of Events:
> createEvents :: ByteString -> [Event]
> (For now I'm creating a list, but depending on the type of analysis I decide
> to do, I may change this data structure)
>
> I understand that I can use the Parsec Library to do this.  I read RWH, and
> noticed they have the endBy and sepBy combinators, but my issue with these
> is that using these funcitons performs too many transformations on the data.
> endBy will return a list of strings, which then will be used by sepBy which
> will then return a [[ByteString]] which I will then have to iterate through
> to create the final [Event].
>
> What I would like to do is define a custom parser, that will go from the
> ByteString to the [Event] without the overhead of those intermediate steps.
> This function needs to be as fast as possible, as these files can be rather
> large, and I will be performing many different tests and analysis on the
> data.  I don't want the parsing to be a bottleneck.

This sounds awfully lot like a premature optimisation, which as we all
know, is the root of evil :-)

Why do you think that using Parsec will result in fewer
transformations?  (It will most likely result in fewer transformations
*that you see*, but that doesn't mean much.)

> I'm under the impression that the Parsec library will allow me to define a
> custom parser to do this, but I'm having problems understanding the library,
> and the documentation for it.
>
> A gentle shove in the right direction would be greatly appreciated.

AFAIK Parsec deals with String, not ByteString, have a look at the
attoparsec library[1] instead.

There are numerous explanations of using parser combinators out there.
 Personally I've found the Parsec documentation fairly easy to
understand.  A while ago I wrote a few posts myself on it, and I think
they should translate well to attoparsec (you will probably have to
keep the haddock doc at hand though):

http://therning.org/magnus/archives/289
http://therning.org/magnus/archives/290
http://therning.org/magnus/archives/295
http://therning.org/magnus/archives/296

/M

[1]: http://hackage.haskell.org/package/attoparsec-0.8.1.1

-- 
Magnus Therning                        (OpenPGP: 0xAB4DFBA4)
magnus＠therning．org          Jabber: magnus＠therning．org
http://therning.org/magnus         identi.ca|twitter: magthe