[Haskell-cafe] Some Alex beginner questions regarding parsing Java

Javran Cheng javran.c at gmail.com
Thu Mar 18 01:00:00 UTC 2021

Few days after I wrote the OP, I realized I had some misconceptions and now
I can answer few of my own questions (I still don't get the second one

On Mon, Mar 15, 2021 at 3:42 PM Javran Cheng <javran.c at gmail.com> wrote:

> Hi Cafe!
> Firstly, sorry for spamming if you already see this on Stack Overflow, I
> do feel this is too much for a single SO question.
> I'm recently playing with Alex + Happy to try parsing Java (for now I'm
> only working on the lexer / Alex, so whenever you see "parse" below, that
> means tokenize). My reference is Java SE 15 Spec, the chapter of interest
> is Chapter 3. Lexical Structure
> <https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html> (side
> note: my timing is a bit interesting as I just realized as of writing this,
> Java SE 16 spec is out just few days ago, so I might switch to that) and
> now that I've get my hand dirty a bit, I have few questions and hope if
> someone can shed some light on them:
>    1. For now I'm using "monad-bytestring" wrapper for performance, but
>    now I think maybe String-based wrapper is more appropriate, for it allows
>    me to follow 1. and 2. in "3.2. Lexical Translations" properly before
>    passing the input to Alex - namely I can pre-process input stream to (1) do
>    Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can
>    normalize line terminators into just \n. But:
>       1. Are those two passes (Unicode escape and line terminator
>       normalization) possible within Alex framework?
>       2. Is there any way that I can stick with a memory-compact string
>       representation? (not sure why Alex doesn't provide any Text-based wrapper,
>       as it did mention in its doc that it internally works on UTF-8 encoded byte
>       sequence) I could probably consider to not use any wrapper, as GHC and Agda
>       did, but that's sort of an undocumented territory so I'm hesitated to do so.
> I'm actually less worried about String-based parsing now - I'm not really
keeping a chunk of String and moving it around like data, I'm consuming it,
and this is fine as long as backtracking is minimized.
And for whatever reason I totally missed "5.2 Basic interface" in Alex's
user manual, which answers my "how to use Alex without wrapper" question -
I can just run alex command and copy what's necessary from generated source
code - now I know I just need the definition of AlexInput, alexGetByte,
alexInputPrevChar and whatever transitively required.

>    1. The other trouble I have regarding "3.2. Lexical Translations" is
>    the special rules applied to ">"s: "... There is one exception: if
>    lexical translation occurs in a type context (§4.11) ..." - but how in the
>    world am I going to do this? I mean the lexical analysis is not even done
>    how am I going to tell whether it's a type context (and §4.11 is quite long
>    that I won't even try to read it, unless forced)? Maybe I can just tokenize
>    every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>="
>    don't exist and worry about that later, but that doesn't sound right to me.
> This one I still don't get - any help is appreciated.

>    1. I realize there's a need for "irrecoverable failure": I have a test
>    case with octal literal "012389", which is supposed to fail, but Alex
>    happily tokenized that into [octal "0123", decimal "89"] - for now my
>    workaround is for every number literal to check whether previous char is a
>    digit and fail if it is indeed so, but I feel this is like ducktaping on a
>    flawed approach and at some point it will fail on some edge cases. an ideal
>    fix IMO would be to have some notion of irrecoverable failure - failing to
>    parse a literal at full should be irrecoverable rather than trying to parse
>    most of it and move on. In addition, as Java spec requires that if a
>    numeric literal doesn't fit in the intended type, it's a compilation error
>    - which can also be encoded as an irrecoverable failure as well. I'm not
>    sure how to do that in Alex though, I can see few ways:
>       1. encode irrecoverable failure by setting to a special startcode,
>       which does not provide anyway to go back to startcode 0 - so an
>       irrecoverable failure sets that special startcode, and at the start of
>       every action, it checks whether startcode is the special "failure"
>       startcode and fail accordingly
>       2. this is similar to startcode, but use a wrapper that supports
>       userstate.
>       3. maybe this is another case that not using a wrapper would give
>       me more control, but I don't have a concrete idea regarding this
>       alternative.
> This is another thing that I totally misunderstood: I wrote the Alex rule
for recognizing octals as "0(0-7)+", of course only "0123" bit of "012389"
matches! Instead I should probably just accept a boarder pattern and deal
with them in Alex actions (This is similar to just do regex "(\d+\.){3}\d+"
and check whether numbers are in 0~255 range for parsing and validating
IPv4 address), this way allow me to throw informative errors. And Alex is
actually capable of doing "irrecoverable failures", as it's just a newtype
around a state function that returns "Either String _". Although I'd like
the error type to be finer than String, which now I'm happy to just roll my
own without alex wrappers.

> Any thoughts on this is appreciated.
> Thanks!
> Javran (Fang) Cheng

Javran (Fang) Cheng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20210317/3da357db/attachment.html>

More information about the Haskell-Cafe mailing list