[Haskell-cafe] Some Alex beginner questions regarding parsing Java

Thu Mar 18 01:11:50 UTC 2021

On Wed, Mar 17, 2021 at 9:02 PM Javran Cheng <javran.c at gmail.com> wrote:

> Few days after I wrote the OP, I realized I had some misconceptions and
> now I can answer few of my own questions (I still don't get the second one
> though):
>
> On Mon, Mar 15, 2021 at 3:42 PM Javran Cheng <javran.c at gmail.com> wrote:
>
>> Hi Cafe!
>>
>> Firstly, sorry for spamming if you already see this on Stack Overflow, I
>> do feel this is too much for a single SO question.
>>
>> I'm recently playing with Alex + Happy to try parsing Java (for now I'm
>> only working on the lexer / Alex, so whenever you see "parse" below, that
>> means tokenize). My reference is Java SE 15 Spec, the chapter of interest
>> is Chapter 3. Lexical Structure
>> <https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html> (side
>> note: my timing is a bit interesting as I just realized as of writing this,
>> Java SE 16 spec is out just few days ago, so I might switch to that) and
>> now that I've get my hand dirty a bit, I have few questions and hope if
>> someone can shed some light on them:
>>
>>    1. For now I'm using "monad-bytestring" wrapper for performance, but
>>    now I think maybe String-based wrapper is more appropriate, for it allows
>>    me to follow 1. and 2. in "3.2. Lexical Translations" properly before
>>    passing the input to Alex - namely I can pre-process input stream to (1) do
>>    Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can
>>    normalize line terminators into just \n. But:
>>       1. Are those two passes (Unicode escape and line terminator
>>       normalization) possible within Alex framework?
>>       2. Is there any way that I can stick with a memory-compact string
>>       representation? (not sure why Alex doesn't provide any Text-based wrapper,
>>       as it did mention in its doc that it internally works on UTF-8 encoded byte
>>       sequence) I could probably consider to not use any wrapper, as GHC and Agda
>>       did, but that's sort of an undocumented territory so I'm hesitated to do so.
>>
>> I'm actually less worried about String-based parsing now - I'm not really
> keeping a chunk of String and moving it around like data, I'm consuming it,
> and this is fine as long as backtracking is minimized.
> And for whatever reason I totally missed "5.2 Basic interface" in Alex's
> user manual, which answers my "how to use Alex without wrapper" question -
> I can just run alex command and copy what's necessary from generated source
> code - now I know I just need the definition of AlexInput, alexGetByte,
> alexInputPrevChar and whatever transitively required.
>
>>
>>    1. The other trouble I have regarding "3.2. Lexical Translations" is
>>    the special rules applied to ">"s: "... There is one exception: if
>>    lexical translation occurs in a type context (§4.11) ..." - but how in the
>>    world am I going to do this? I mean the lexical analysis is not even done
>>    how am I going to tell whether it's a type context (and §4.11 is quite long
>>    that I won't even try to read it, unless forced)? Maybe I can just tokenize
>>    every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>="
>>    don't exist and worry about that later, but that doesn't sound right to me.
>>
>> This one I still don't get - any help is appreciated.
>
>>
>>    1. I realize there's a need for "irrecoverable failure": I have a
>>    test case with octal literal "012389", which is supposed to fail, but
>>    Alex happily tokenized that into [octal "0123", decimal "89"] - for
>>    now my workaround is for every number literal to check whether previous
>>    char is a digit and fail if it is indeed so, but I feel this is like
>>    ducktaping on a flawed approach and at some point it will fail on some edge
>>    cases. an ideal fix IMO would be to have some notion of irrecoverable
>>    failure - failing to parse a literal at full should be irrecoverable rather
>>    than trying to parse most of it and move on. In addition, as Java spec
>>    requires that if a numeric literal doesn't fit in the intended type, it's a
>>    compilation error - which can also be encoded as an irrecoverable failure
>>    as well. I'm not sure how to do that in Alex though, I can see few ways:
>>       1. encode irrecoverable failure by setting to a special startcode,
>>       which does not provide anyway to go back to startcode 0 - so an
>>       irrecoverable failure sets that special startcode, and at the start of
>>       every action, it checks whether startcode is the special "failure"
>>       startcode and fail accordingly
>>       2. this is similar to startcode, but use a wrapper that supports
>>       userstate.
>>       3. maybe this is another case that not using a wrapper would give
>>       me more control, but I don't have a concrete idea regarding this
>>       alternative.
>>
>> This is another thing that I totally misunderstood: I wrote the Alex rule
> for recognizing octals as "0(0-7)+", of course only "0123" bit of "012389"
> matches! Instead I should probably just accept a boarder pattern and deal
> with them in Alex actions (This is similar to just do regex "(\d+\.){3}\d+"
> and check whether numbers are in 0~255 range for parsing and validating
> IPv4 address), this way allow me to throw informative errors. And Alex is
> actually capable of doing "irrecoverable failures", as it's just a newtype
> around a state function that returns "Either String _". Although I'd like
> the error type to be finer than String, which now I'm happy to just roll my
> own without alex wrappers.
>
>
This sortah rule is akin to something agda does, roughly , which is treat
stuff as a single token unless there’s a white space separator.

For that matter, agdas parser may be a good example of a pretty fancy use
of happy and Alex. Though it itself was originally based on the happy/Alex
parsers in ghc.

Any thoughts on this is appreciated.
>>
>> Thanks!
>> Javran (Fang) Cheng
>>
>
> Cheers,
> --
> Javran (Fang) Cheng
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20210317/d7616776/attachment.html>