[Haskell-cafe] Some Alex beginner questions regarding parsing Java
Javran Cheng
javran.c at gmail.com
Mon Mar 15 22:42:58 UTC 2021
Hi Cafe!
Firstly, sorry for spamming if you already see this on Stack Overflow, I do
feel this is too much for a single SO question.
I'm recently playing with Alex + Happy to try parsing Java (for now I'm
only working on the lexer / Alex, so whenever you see "parse" below, that
means tokenize). My reference is Java SE 15 Spec, the chapter of interest
is Chapter 3. Lexical Structure
<https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html> (side note:
my timing is a bit interesting as I just realized as of writing this, Java
SE 16 spec is out just few days ago, so I might switch to that) and now
that I've get my hand dirty a bit, I have few questions and hope if someone
can shed some light on them:
1. For now I'm using "monad-bytestring" wrapper for performance, but now
I think maybe String-based wrapper is more appropriate, for it allows me to
follow 1. and 2. in "3.2. Lexical Translations" properly before passing the
input to Alex - namely I can pre-process input stream to (1) do Unicode
escaping to turn the raw byte flow into a flow of Chars (2) I can normalize
line terminators into just \n. But:
1. Are those two passes (Unicode escape and line terminator
normalization) possible within Alex framework?
2. Is there any way that I can stick with a memory-compact string
representation? (not sure why Alex doesn't provide any
Text-based wrapper,
as it did mention in its doc that it internally works on UTF-8
encoded byte
sequence) I could probably consider to not use any wrapper, as
GHC and Agda
did, but that's sort of an undocumented territory so I'm
hesitated to do so.
2. The other trouble I have regarding "3.2. Lexical Translations" is the
special rules applied to ">"s: "... There is one exception: if lexical
translation occurs in a type context (§4.11) ..." - but how in the world am
I going to do this? I mean the lexical analysis is not even done how am I
going to tell whether it's a type context (and §4.11 is quite long that I
won't even try to read it, unless forced)? Maybe I can just tokenize every
">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>=" don't
exist and worry about that later, but that doesn't sound right to me.
3. I realize there's a need for "irrecoverable failure": I have a test
case with octal literal "012389", which is supposed to fail, but Alex
happily tokenized that into [octal "0123", decimal "89"] - for now my
workaround is for every number literal to check whether previous char is a
digit and fail if it is indeed so, but I feel this is like ducktaping on a
flawed approach and at some point it will fail on some edge cases. an ideal
fix IMO would be to have some notion of irrecoverable failure - failing to
parse a literal at full should be irrecoverable rather than trying to parse
most of it and move on. In addition, as Java spec requires that if a
numeric literal doesn't fit in the intended type, it's a compilation error
- which can also be encoded as an irrecoverable failure as well. I'm not
sure how to do that in Alex though, I can see few ways:
1. encode irrecoverable failure by setting to a special startcode,
which does not provide anyway to go back to startcode 0 - so an
irrecoverable failure sets that special startcode, and at the start of
every action, it checks whether startcode is the special "failure"
startcode and fail accordingly
2. this is similar to startcode, but use a wrapper that supports
userstate.
3. maybe this is another case that not using a wrapper would give me
more control, but I don't have a concrete idea regarding this alternative.
Any thoughts on this is appreciated.
Thanks!
Javran (Fang) Cheng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20210315/2bf0d5d5/attachment.html>
More information about the Haskell-Cafe
mailing list