<div dir="ltr">Hi Cafe!<div><br></div><div>Firstly, sorry for spamming if you already see this on Stack Overflow, I do feel this is too much for a single SO question.</div><div><br></div><div><div>I'm recently playing with Alex + Happy to try parsing Java (for now I'm only working on the lexer / Alex, so whenever you see "parse" below, that means tokenize). My reference is Java SE 15 Spec, the chapter of interest is <a href="https://docs.oracle.com/javase/specs/jls/se15/html/jls-3.html">Chapter 3. Lexical Structure</a> (side note: my timing is a bit interesting as I just realized as of writing this, Java SE 16 spec is out just few days ago, so I might switch to that) and now that I've get my hand dirty a bit, I have few questions and hope if someone can shed some light on them:<br><ol><li>For now I'm using "monad-bytestring" wrapper for performance, but now I think maybe String-based wrapper is more appropriate, for it allows me to follow 1. and 2. in "3.2. Lexical Translations" properly before passing the input to Alex - namely I can pre-process input stream to (1) do Unicode escaping to turn the raw byte flow into a flow of Chars (2) I can normalize line terminators into just <font face="monospace">\n</font>. But:</li><ol><li>Are those two passes (Unicode escape and line terminator normalization) possible within Alex framework?</li><li>Is there any way that I can stick with a memory-compact string representation? (not sure why Alex doesn't provide any Text-based wrapper, as it did mention in its doc that it internally works on UTF-8 encoded byte sequence) I could probably consider to not use any wrapper, as GHC and Agda did, but that's sort of an undocumented territory so I'm hesitated to do so.</li></ol><li>The other trouble I have regarding "3.2. Lexical Translations" is the special rules applied to <font face="monospace">">"</font>s: "... There is one exception: if lexical translation occurs in a type context (§4.11) ..." - but how in the world am I going to do this? I mean the lexical analysis is not even done how am I going to tell whether it's a type context (and §4.11 is quite long that I won't even try to read it, unless forced)? Maybe I can just tokenize every ">" as an individual operatior, as if ">>", ">>=", ">>>", and ">>>=" don't exist and worry about that later, but that doesn't sound right to me.</li><li>I realize there's a need for "irrecoverable failure": I have a test case with octal literal <font face="monospace">"012389"</font>, which is supposed to fail, but Alex happily tokenized that into <font face="monospace">[octal "0123", decimal "89"]</font> - for now my workaround is for every number literal to check whether previous char is a digit and fail if it is indeed so, but I feel this is like ducktaping on a flawed approach and at some point it will fail on some edge cases. an ideal fix IMO would be to have some notion of irrecoverable failure - failing to parse a literal at full should be irrecoverable rather than trying to parse most of it and move on. In addition, as Java spec requires that if a numeric literal doesn't fit in the intended type, it's a compilation error - which can also be encoded as an irrecoverable failure as well. I'm not sure how to do that in Alex though, I can see few ways:</li><ol><li>encode irrecoverable failure by setting to a special startcode, which does not provide anyway to go back to startcode 0 - so an irrecoverable failure sets that special startcode, and at the start of every action, it checks whether startcode is the special "failure" startcode and fail accordingly</li><li>this is similar to startcode, but use a wrapper that supports userstate.</li><li>maybe this is another case that not using a wrapper would give me more control, but I don't have a concrete idea regarding this alternative.</li></ol></ol><div>Any thoughts on this is appreciated.</div></div></div><div><br></div><div>Thanks!</div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Javran (Fang) Cheng<br></div></div></div></div>